Theory Assignment 3
COMP 451 - Fundamentals of Machine Learning
Winter 2021
Preamble The assignment is due April 6th at 11:59pm via MyCourses. Late work will be automatically
subject to a 20% penalty, and can be submitted up to 5 days after the deadline. You may scan written
answers or submit a typeset assignment, as long as you submit a single pdf file with clear indication of
what question each answer refers to. You may consult with other students in the class regarding solution
strategies, but you must list all the students that you consulted with on the first page of your submitted
assignment. You may also consult published papers, textbooks, and other resources, but you must cite any
source that you use in a non-trivial way (except the course notes). You must write the answer in your own
words and be able to explain the solution to the professor, if asked.
Question 1 [13 points]
In class we introduced the Gaussian mixture model (GMM). In this question, we will consider a mixture
of Bernoulli distributions. Here, our data points will be defined as m-dimensional vectors of binary values
x ∈ {0, 1}
m.
First, we will introduce a single multivariate Bernoulli distribution, which is defined by a mean vector µ
P(x|µ) =
mY−1
j=0
µ[j]
x[j]
(1 − µ[j])(1−x[j])
. (1)
Thus, we see that a the individual binary dimensions are independent for a single multivariate Bernoulli.
Now, we can define a mixture of K multivariate Bernoulli distributions as follows
, πk, k = 0, .., K − 1} are the parameters of the mixture and P(x|µk
) is the probability
assigned to the point by each individual component in the model.
Note that the mean of each individual component distribution P(x|µk) is given by
Ek[x] = µk
, (5)
and the covariance matrix of each component is given by
Cov[x] = Σk = diag(µk ◦ (1 − µk
)), (6)
1
where ◦ denotes elementwise multiplication. In other words, the covariance matrix Σk for each component
is a diagonal matrix with diagonal entries given by Σk[j, j] = µ[j](1 − µ[j]). It is a diagonal matrix because
each dimension is independent.
Part 1 [8 points]
Derive expression for the mean vector and the covariance matrix of the full mixture distribution defined in
Equation 2. That is, give expressions for the following:
E[x] =? Cov[x] =? (7)
Hint: use the fact that
Cov[x] = E
(x − E[x])(x − E[x])>
= E[xx>] − E[x]E[x]
>.
2
Part 2 [5 points]
Just as with a GMM, we can use the expectation maximization (EM) algorithm to compute learn the
parameters of a Bernoulli mixture model. Here, we will provide you with the formula for the expectation
step as well as the log-likelihood of the model. You must derive the formula for the maximization step.
Expectation step. In the expectation step of the Bernoulli mixture model, we compute scores r(x, k), which
tell us how likely it is that point x belongs to component k. These scores are computed as follows:
r(x, k) = πkP(x|µk
)
PK
j=1 πjP(x|πj )
, (8)
where P(x|µk
) is defined as in Equation 2.
Log-likelihood.
(9)
Maximization step. You must find the formula for the µk parameters in the maximization step:
µk =? (10)
3
Question 2 [5 points]
Recall that the low dimensional codes in PCA are defined as
zi = U>(xi − µ), (11)
where U is a matrix containing the top-k eigenvectors of the covariance matrix and. (12)
Recall that the reconstruction of a point xi using its code zi
is given by
x˜i = Uzi + µ. (13)
Show that
(x˜i − xi)
>(x˜i − µ) = 0. (14)
4
Question 3 [short answers; 2 points each]
Answer each question with 1-3 sentences for justification, potentially with equations/examples for support.
a) True or false: It is always possible to choose an initialization so that K-means converges in one iteration.
b) Suppose you are learning a decision tree for email spam classification. Your current sample of the training
data has the following distribution of labels:
[43+, 30−], (15)
i.e., the training sample has 43 examples that are spam and 30 that are not spam. Now, you are choosing
between two candidate tests.
Test 1 (T1) tests whether the number of words in the email is greater than 30 and would result in the
following splits:
• num words > 30 : [5+, 15−]
• num words ≤ 30: [38+, 15−]
Test 2 (T2) tests whether the email contains an external URL link and would result in the following splits:
• has link: [25+, 5−]
• not has link: [18+, 25−]
Which test should you use to split the data? I.e., which test provides a higher information gain?
c) Which of the following statements is false:
1. If the covariance between two variables is zero, then their mutual information is also zero.
2. Adding more features is a useful strategy to combat underfitting.
3. Decision trees can learn non-linear decision boundaries.
4. The Gaussian mixture model contains more parameters than K-means.