辅导COMPSCI 671D、讲解Modeling留学生、辅导R程序设计、R编程语言调试解析R语言编程|辅导R语言编程

COMPSCI 671D Fall 2019
Homework 4
1 EM Algorithm for Topic Modeling (35 points)
In this question we will try to design an algorithm for discovering the abstract topics that occur in a
set of documents. In the problem, we have a set of M abstract documents in the universe, D, which
are mixtures of latent topics. We also have a dictionary of words, W, with size N. Intuitively, W is
the list of all unique words that occur at least once within D. Using these two sets we can denote
the number of times word wj occurs in document di as n(wj , di). It follows that we can represent
a document di as a vector of size N, each entry of which corresponds to how many times word wj
appears in it. The topics z are latent variables in a topic set Z sized K. Intuitively, if a document
is about a certain topic, it will include some particular words with higher probability. For example,
“music”, “show” and “film” appear frequently in documents about Arts while “school”, “student”
and “teachers” usually occur if the document is about Education. Meanwhile, a document can be
a mixture of several topics, e.g., a document can be about arts education for high school students.
Inspired by the intuition, a reasonable approach to model the generative process is as follows.
Pr(di
, zk, wj ) = Pr(di) Pr(zk|di) Pr(wj |zk)
which means the topics only depend on the documents and the words only depend on topics. For
computational convenience, let Pr(di) be a uniform distribution and Pr(zk|di) and Pr(wj |zk) be
Multinomial distributions. i.e.
We are interested in learning αik and βkj because they are substantively interesting: αik represents
the proportion of topic k that makes up document i, and βkj the probability of seeing word j under
topic k. Therefore, a single word wj in a document di
is generated in this way: (1) Choose a topic
zk from the topic distribution P r(zk|di), the probability of choosing topic k is αik; (2) Choose a
word from the vocabulary distribution p(wj |zk) in this topic, the probability is βkj . We want to
learn these parameters by maximizing the likelihood of the data using the EM algorithm.
1
a) For our set of N documents and W word of choices, write down the log-likelihood of the model
above, i.e. log(p(D = di
, W = wj |α, β)).
Remember that in the EM (Expectation-Maximization) algorithm, we can figure out the parameters
and the hidden variables by iteratively computing (1) the distribution of latent variables using
the old parameters and (2) find new parameters that maximize the likelihood using the old latent
variable distribution. So, you can do the following:
b) E-step: Derive the distribution of latent variable p(zk|wj , di
, αold, βold) by fixing the old parameters.
c) M-step: Find the α
new and β
new that optimize the log likelihood using p(zk|wj , di
, αold, βold)
as the distribution of z.
You would iterate these until convergence to get the final parameters in practice.
Just so you know, the model you have just been working with in this problem is a very famous
and very popular model. :)
2 Neural Networks (65 points)
In this problem you will get the chance to construct a neural network with architecture 784 − H −
H − 1, where H is the number of hidden nodes you choose. This neural network can be used for
binary classification, such 0, 1 digits classification for MNIST. The model can be represented as
f(x; θ) : R
784 → [0, 1]
where θ = {W1 ∈ R
784×H, b1 ∈ R
H,W2 ∈ R
H×H, b2 ∈ R
H, w3 ∈ R
H, b3 ∈ R, }. Explicitly,
The loss function for this model (or the negative log-likelihood) is the usual one:

a) Backpropogation
Derive the gradients ∇w3L(θ), ∇b3L(θ), ∇W2L(θ), ∇b2L(θ), ∇W1L(θ), ∇b1L(θ) by backpropagation.
Hints: The problem would be much easier if you define zl = WT
l hl−1 + bl
, and derive the
relationship between ∇zlL(θ) and ∇zl−1L(θ). Also, you can give the final result in a recursive way
as long as it is correct and consistent.
2
b) Weight Initialization
Neural Networks are normally trained by gradient descent based optimization algorithms, for
which we have to give initial values for our parameters (weight and bias). The bias parameters
are usually initialized as 0’s, while the weight parameters are usually initialized by independently
sampling from N(0, σ2).
(i) What are the potential problems if we choose σ to be either too small or too large?
(ii) One nice property we would like our weights to have is that by proper initialization, we
want the values before activation to be similar in distribution for each layer. It turns out we can
achieve this by choosing σ properly. Now suppose we use ReLU activation in our neural network
and the inputs are normalized. We want to choose the proper σl
for the l-th hidden layer such that
V ar(zl) = V ar(zl−1).
We assume the neurons in each layer are mutually independent and share the same distribution
and we use zl to represent the random variable that generates zl
, where zl = WT
l hl−1 + bl and
hl−1 = max(0, zl−1) and Wl ∈ R
nl−1×nl.
Find σt such that V ar(zl) = V ar(zl−1) and prove your result.
c) Implement your Neural Network
Use the preprocessed MNIST data provided in the attachment. MNIST is a dataset of images
of handwritten digits that also contain labels for the number they correspond to. We want to train
a neural network to learn to recognize these handwritten digits. Use gradient descent with the
gradients you have derived in part a) to train a neural network on the data. Report the accuracy
for the network on the train and test data. Since the sample size is large, we can approximate the
true gradient with stochastic gradient for computational convenience:
which means at each step, instead of calculating the gradients with all samples in the dataset,
we randomly pick a mini-batch of B samples and calculate the gradient using these samples. In
implementing the neural network, it would be beneficial to build and train the model in a general
way. That is, build a NN with d hidden layers such that we can easily modify d.
d) Vanishing Gradient
(i) In our current model, we have 2 hidden layers. What is the magnitude of ∇W1L(θ) (you
can use Frobenius norm or induced matrix norm)? Now try to modify the the number of hidden
layers d, from 1 to 10. What happens to the magnitude of ∇W1L(θ) when we increase d? Plot
the magnitude against d on log scale. In this question, you don’t need to train the whole neural
network, just initialize it with random weights, e.g. N (0, 1), and calculate the gradient.
(ii) Suppose that all weights are finite and the weight matrices W2,W3, ...,Wd are diagonalizable
matrices in which the absolute value of the eigenvalues are upper-bounded by 4 − . Prove
that
lim
d→∞
∇W1L(θ) = 0
Hints: Try to use the the relationship between ∇zlL(θ) and ∇zl−1L(θ) and try to apply a common
inequality.
(iii) The result above is known as the vanishing gradient problem in the feedforward neural
network. Will there still be such a problem if we use a tanh activation? What if we use a ReLU
activation?
Give an intuitive explanation of why vanishing gradient happens in light of the shapes of activation
functions and provide 2 ways of dealing with this problem.
e) Sigmoid Activation
One potential problem of using the sigmoid activation function for hidden layers is that it maps
from R to [0, 1], which means the output of the sigmoid function is always positive. What happens
to the signs of the gradients of hidden layer weights in this situation if we only have one example?
What if we are adding gradients up across a batch of data?