STA314 Fall 2021 Homework 4
Homework 4
Deadline: Monday, Nov. 29, at 11:59pm.
Submission: You need to submit one file through Quercus with our answers to Questions 1, 2,
and 3 as well as code requested for Question 3. You can produce the PDF file however you like
(e.g. LATEX, Microsoft Word, scanner), as long as it is readable.
Neatness Point: One point will be given for neatness. You will receive this point as long as we
don’t have a hard time reading your solutions or understanding the structure of your code.
Late Submission: 10% of the total possible marks will be deducted for each day late, up to a
maximum of 3 days. After that, no submissions will be accepted.
Computing: To install Python and required libraries, see the instructions on the course web page.
Homeworks are to be done alone or in pairs. See the Course Information handout1
for detailed
policies.
1. [6pts] Categorial Distribution. In this problem you will consider a Bayesian approach to
modelling categorical outcomes. Let’s consider fitting the categorical distribution, which is
a discrete distribution over K outcomes, which we’ll number 1 through K. The probability
of each category is explicitly represented with parameter θk. For it to be a valid probability
distribution, we clearly need θk ≥ 0 and P
k
θk = 1. We’ll represent each observation x as
a 1-of-K encoding, i.e, a vector where one of the entries is 1 and the rest are 0. Under this
model, the probability of an observation can be written in the following form:
p(x|θ) = Y
K
k=1
θ
xk
k
.
Suppose you observe a dataset,
D = {x
(i)
}
N
i=1.
Denote the count for outcome k as Nk =
PN
i=1 x
(i)
k
and N =
PK
k=1 Nk. Recall that each data
point is in the 1-of-K encoding, i.e., x
(i)
k = 1 if the ith datapoint represents an outcome k
and x
(i)
k = 0 otherwise.
(a) [2pts] First, derive ˆθk, which is the maximum likelihood estimator (MLE), for the class
probabilities θk. You may assume that Nk > 0 for this question. Derivations should be
rigorous.
Hint 1: We saw in lecture that MLE can be thought of as ‘ratio of counts’ for the data,
so what should ˆθk be counting?
Hint 2: Similar to the binary case, write p(x
(i)
| θ) = ΠK
k=1θ
x
(i)
k
k
. The challenge in
maximizing the log-likelihood is that this problem is a constrained maximization under
the constraint that PK
k=1 θk = 1. To overcome this, we will use a generalization of the
idea from the coin flip example in lecture. We can turn the MLE maximization problem
1
https://www.cs.toronto.edu/~cmaddis/courses/sta314_f21/sta314_f21_syllabus.pdf
1
STA314 Fall 2021 Homework 4
into an easier constrained problem by setting θK = 1 −
PK−1
k=1 θk. Note that this is a
maximization problem over the set {(θk)k6=K|θk > 0,
P
k6=K θk < 1}. Although this is
still constrained, the log likelihood is a concave function that goes to −∞ as we approach
the boundary of the constraint set, so the function attains its maximum in the interior
of the region when ∂ log p(D|θ)/∂θk = 0.
(b) [2pts] Now we will take a Bayesian approach. For the prior, we’ll use the Dirichlet
distribution, which is defined over the set of probability vectors (i.e. vectors that are
nonnegative and whose entries sum to 1). Its PDF is as follows:
p(θ) ∝ θ
a1−1
1
· · · θ
ak−1
K .
What is the probability distribution of the posterior distribution p(θ | D)? Don’t just
give the density of the posterior, say which family of distributions it belongs to.
(c) [1pt] Still assuming the Dirichlet prior distribution, determine the MAP estimate of the
parameter vector θ. For this question, you may assume each ak > 1.
(d) [1pts] Now, suppose that your friend said that they had a hidden N + 1st outcome,
x
(N+1), drawn from the same distribution as the previous N outcomes. Your friend does
not want to reveal the value of x
(N+1) to you. So, you want to use your Bayesian model
to predict what you think x
(N+1) is likely to be. The “proper” Bayesian predictor is the
so-called posterior predictive distribution:
p(x
(N+1)|D) = Z
p(x
(N+1)|θ)p(θ|D) dθ
What is the probability that the N +1 outcome was k, i.e., the probability that x
(N+1)
k =
1, under your posterior predictive distribution? Hint: A useful fact is that if θ ∼
Dirichlet(a1, . . . , aK), then
E[θk] = P
ak
k
0 ak
0
.
Report your answers to the above questions.
2. [5pts] Gaussian Na¨ıve Bayes. In this question, you will derive the maximum likelihood
estimates for Gaussian Na¨ıve Bayes, which is just like the na¨ıve Bayes model from lecture,
except that the features are continuous, and the conditional distribution of each feature given
the class is (univariate) Gaussian rather than Bernoulli. Start with the following generative
model for a random discrete class label t ∈ {1, 2, ..., K} and a random real valued vector of
D features x ∈ R
D:
p(t = k) = αk (0.1)
p(x|t = k) = Y
D
d=1
2πσ2
d
!−1/2
exp(
−
X
D
d=1
1
2σ
2
d
(xd − µkd)
2
)
(0.2)
where αk ≥ 0 is the prior on class k, σ
2
d > 0 are the variances for each feature, which are
shared between all classes, and µkd ∈ R is the mean of the feature d conditioned on class k.
We write α to represent the vector with elements αk and similarly σ is the vector of variances.
The matrix of class means is written µ where the kth row of µ is the mean for class k.
(a) [1pt] Use Bayes’ rule to derive an expression for p(t = k|x). Hint: Use the law of total
probability to derive an expression for p(x).
2
STA314 Fall 2021 Homework 4
(b) [1pt] Write down an expression for the likelihood function (LL)
`(θ) = log p(t
(1)
, x
(1), t(2)
, x
(2)
, · · · , t(N)
, x
(N)
) (0.3)
of a particular dataset D = {(t
(1)
, x
(1)),(t
(2)
, x
(2)), · · · ,(t
(N)
, x
(N)
)} with parameters
θ = {α, µ,σ} and this model. (Assume the data are i.i.d.)
(c) [3pts] Take partial derivatives of the likelihood with respect to each of the parameters
µkd and with respect to the shared variances σ
2
d
. Report these partial derivatives and
find the maximum likelihood estimates for µkd and σ
2
d
. You may assume that each class
appears at least once in the dataset, i.e. the number of times Nk that class k appears
in the dataset is Nk > 0.
Report your answers to the above questions.
3. [6pts] Gaussian Discriminant Analysis. For this question you will build classifiers to
label images of handwritten digits. Each image is 8 by 8 pixels and is represented as a vector
of dimension 64 by listing all the pixel values in raster scan order. The images are grayscale
and the pixel values are between 0 and 1. The labels y are 0, 1, 2, . . . , 9 corresponding to
which character was written in the image. There are 700 training cases and 400 test cases for
each digit; they can be found in the .txt files provided.
A skeleton (q3.py) is is provided for each question that you should use to structure your code.
Starter code to help you load the data is provided (data.py). Note: the get digits by label
function in data.py returns the subset of digits that belong to a given class.
Using maximum likelihood, fit a set of 10 class-conditional Gaussians with a separate, full
covariance matrix for each class. Remember that the conditional multivariate Gaussian probability
density is given by,
p(x |t = k) = (2π)
−D/2
|Σk|
−1/2
exp
−
1
2
(x − µk
)
T Σ
−1
k
(x − µk
)
(0.4)
where µk ∈ R
D, Σk ∈ R
D×D and positive-definite. You should take p(t = k) = 1
10
. You
will compute parameters µkj and Σk for k ∈ (0...9), j ∈ (1...64). You should implement the
covariance computation yourself (i.e. without the aid of ’np.cov’). Hint: To ensure numerical
stability you may have to add a small multiple of the identity to each covariance matrix. For
this assignment you should add 0.01I to each covariance matrix.
(a) [5pts] Complete the 5 functions that are not complete in the starter code ([1pt] each).
Include the function body for each completed function in your report.
(b) [1pt] Report the average conditional log-likelihood, i.e. 1
N
PN
i=1 log p(t
(i)
| x
(i)
), of the
trained model on both the train and test set. Report the accuracy, of the classifier that
selects most likely posterior class for each data point using the trained model, on the
train and test set.