CSCI567: Machine Learning USC, Fall 2022
Homework 4
Instructor: Vatsal Sharan Due: November 16 by 2:00 pm PST
We would like to thank Gregory Valiant (Stanford) for kindly sharing some of the problems with us.
A reminder on collaboration policy and academic integrity: Our goal is to maintain an optimal learning envi-
ronment. You can discuss the homework problems at a high level with other groups, but you should not look at any
other group’s solutions. Trying to find solutions online or from any other sources for any homework or project is
prohibited, will result in zero grade and will be reported. To prevent any future plagiarism, uploading any material
from the course (your solutions, quizzes etc.) on the internet is prohibited, and any violations will also be reported.
Please be considerate, and help us help everyone get the best out of this course.
Please remember the Student Conduct Code (Section 11.00 of the USC Student Guidebook). General principles of
academic honesty include the concept of respect for the intellectual property of others, the expectation that individual
work will be submitted unless otherwise allowed by an instructor, and the obligations both to protect one’s own
academic work from misuse by others as well as to avoid using another’s work as one’s own. All students are expected
to understand and abide by these principles. Students will be referred to the Office of Student Judicial Affairs and
Community Standards for further review, should there be any suspicion of academic dishonesty.
Notes on notation:
Unless stated otherwise, scalars are denoted by small letter in normal font, vectors are denoted by small letters
in bold font and matrices are denoted by capital letters in bold font.
∥.∥ means L2-norm unless specified otherwise, i.e., ∥.∥ = ∥.∥2.
Instructions
We recommend that you use LaTeX to write up your homework solution. However, you can also scan handwritten
notes. The homework will need to be submitted on Gradescope.
1
Theory-based Questions
Problem 1: Decision Trees (12pts)
Consider a binary dataset with 400 examples, where half of them belong to class A and the rest belong to class B.
Next, consider two decision stumps (i.e. trees with depth 1) T1 and T2, each with two children. For T1, the left child
has 150 examples in class A and 50 examples in class B. For T2, the left child has 0 examples in class A and 100
examples in class B. (You can infer the number of examples in the right child using the total number of examples.)
1.1 (6 pts) In class, we discussed entropy and Gini impurity as measures of uncertainty at a leaf. Another possible
metric is the classification error at the leaf, assuming that the prediction at the leaf is the majority class among all
examples that belong to that leaf. For each leaf of T1 and T2, compute the entropy (base e), Gini impurity and classi-
fication error. You can either exactly express the final numbers in terms of fractions and logarithms, or round them to
two decimal places.
1.2 (6 pts) Compare the quality of T1 and T2 (that is, the two different splits of the root) based on conditional
entropy (base e), weighted Gini impurity and total classification error. Intuitively, which of T1 or T2 appears to be a
better split to you (there may not necessarily be one correct answer to this)? Based on your conditional entropy, Gini
impurity and classification error calculations, which of the metrics appear to be more suitable choices to decide which
variable to split on?
2
Problem 2: Gaussian Mixture Model and EM (10pts + 5 pts bonus)
In class, we applied EM to learn Gaussian Mixture Models (GMMs) and showed the M-Step without a proof. Now, it
is time that you prove it.
Consider a GMM with the following PDF of xi:
p(xi) =
k∑
j=1
πjN(xi | μj ,Σj) =
k∑
j=1
πj
(
√
2π)d|Σj | 12
exp
(
?1
2
(xi ? μj)TΣ?1j (xi ? μj)
)
where k is the number of Gaussian components, d is dimension of a data point xi and N is the usual Gaussian pdf
(|Σ| in the pdf denotes the determinant of matrix Σ). This GMM has k tuples of model parameters {(μj , Σj , πj)}kj=1,
where the parameters represent the mean vector, covariance matrix, and component weight of the j-th Gaussian com-
ponent. For simplicity, we further assume that all components are isotropic Gaussian, i.e., Σj = σ2j I .
2.1 (10 pts) Find the MLE of the expected complete log-likelihood. Equivalently, find the optimal solution to the
following optimization problem.
argmax
2.2 (Bonus) (5 pts) The posterior probability of z in GMM can be seen as a soft assignment to the clusters; in
contrast, k-means assigns each data point to one cluster at each iteration (hard assignment). Show that if we set
{σj , πj}kj=1 in a particular way in the GMM model, then the cluster assignments given by the GMM reduce in the
limit to the k-means cluster assignments (where the cluster centers {μj}kj=1 remain the same for both the models). To
verify your answer, you should derive p(zi = j|xi) for your choice.
3
Programming-based Questions
As in previous homeworks, you need to have your coding environment setup for this part. We use python3 (version ≥
3.7) in our programming-based questions. There are multiple ways you can install python3, for example:
You can use conda to configure a python3 environment for all programming assignments.
Alternatively, you can also use virtualenv to configure a python3 environment for all programming assignments
After you have a python3 environment, you will need to install the following python packages:
numpy
matplotlib (for plotting figures)
Note: You are not allowed to use other packages such as tensorflow, pytorch, keras, scikit-learn, scipy, etc. for
3.1-3.2. If you have other package requests, please ask first before using them. You are allowed to use any packages
for 3.3-3.5.
Download the files for the programming part from https://vatsalsharan.github.io/fall22/hw4.
zip.
Problem 3: Exploring Decision Trees and Random Forests (12pts)
In this question, we will observe the effect of different hyperparameters in training decision trees and random forests,
and also visualize what features the random forest model is using to make predictions. We will do this on a Colab
notebook HW4-Exploring-Random-Forests.ipynb.
Instructions to run the notebook: Upload this file to your USC Google Drive. Then, add Google Colab to your
Google App by New → More → Connect more apps → Type in Google Colab and install it, as in Fig. 1. After that,
you can run the notebook with your browser.
Figure 1: Screenshots showing how to install Colab.
We use a variant of the MNIST dataset for this problem. As mentioned in HW 3, the MNIST dataset contains
images of handwritten digits (0, 1, 2, ..., 9) and is generally used for the (10) digit classification problem. Here, we
work with a binary classfication task of predicting whether the digit is less than 5 or not. Fig. 2 shows some samples
images from the dataset with original and binary labels, respectively.
You should go through the code we provide and understand what’s happening, but for the purpose of answering
the questions you will mainly have to run the code and understand the results. All the models (decision trees, random
forests, etc.) are imported from the sklearn library. You should feel free to explore the role of other hyperparameters
and other methods (such as bagging, boosting), and go through the documentation to better understand these things,
but for this question, we will focus on decision trees and random forests.
4
Figure 2: Some samples images from the MNIST dataset with original (left) and binary (right) labels, respectively.
We will look at the effect of 5 parameters:
max_depth: The maximum depth of the decision tree.
n_estimators: The number of decision trees in the forest.
min_samples_leaf: The minimum number of samples required to be a leaf node.
max_samples: The number of samples to draw from the training set to train each decision tree in the forest.
max_features: The number of features to consider when looking for the best split for any node.
To observe the effect of a parameter, we look at train and test accuracy for different values of that parameter while
keeping the rest of the parameters fixed.
3.1 (2 pts) The first set of plots shows the train and test accuracy for different values of max_depth using a deci-
sion tree and a random forest, respectively. How do the accuracies and the generalization gap vary with max_depth
in these cases? For a particular value of max_depth, how does the generalization gap for the decision tree compare
with that of the random forest? Explain your observations.
3.2 (2 pts) The next plot shows the train and test accuracy for different values of n_estimators for a random
forest. Comment on your observations regarding the accuracies and the generalization gap. What value(s) (range or
approximate values are enough) would you prefer to use for this parameter? Give reasons why. Hint: Are there any
drawbacks to using very high values of n_estimators?
For the next three parts of this question, we will also look at the how the size of the training set influences the
accuracy trends for a given parameter. In each case, for the first plot, the training set consists of 4000 samples whereas
for the second plot, it contains 1000 samples.
3.3 (2 pts) The next set of plots shows the train and test accuracy for different values of min_samples_leaf
for a random forest. Taking into account the behaviours for different training set sizes, explain your observations for
very low and very high values of min_samples_leaf. What do you conclude from this trend? What could be the
reasons for such a behaviour?
3.4 (2 pts) The next set of plots shows the train and test accuracy for different values of max_samples for a
random forest. What do you observe for very low and very high values of max_samples? Would you prefer to use
5
low, intermediate or high values for this parameter in both the cases? Give reasons why. Hint: How does the size of
the training set influence the choice of this parameter?
3.5 (2 pts) The next set of plots shows the train and test accuracy for different values of max_features for a
random forest. Comment on your observations regarding the accuracies and the generalization gap for the two training
set sizes. What is the best range of values for this parameter in both the cases? Is it similar/different? Explain your
observations.
3.6 (2 pts) In class, we discussed how ensembles are usually not as interpretable as a single decision tree. While
this is true, there are still ways to explore which features are used the most by our ensemble. We will explore one such
technique in this part.
We visualize the feature importances of a random forest model trained for the binary classification task on the
MNIST dataset. Intuitively, features with higher importance are the pixels which are used more often in the decision
trees in the forest and which lead to better splits, i.e. which contribute more in improving the performance of the
model. For more details, you can see Section 18.6.1 of the PML book. The last plot shows the importances of different
pixels/portions of the image for a trained random forest model to make its predictions. What portions of the image does
the model seem to be focusing on? In other words, can you think of reasons why the pixels with higher importance
are indeed important for the prediction task (classifying whether the digit is smaller than 5 or not)? As is usually true
for such open-ended questions, there can be multiple correct answers here and we’re looking more for your reasoning
than a specific answer.
6
Problem 4: PCA for Learning Word Embeddings(30pts+10pts Bonus)
This question is about word embeddings. We saw word embeddings in class in lecture 7. As we discussed then, a
word embedding is simply a vector space representation of words which captures some of the semantic and syntactic
structures in the language—for example, words similar in meaning have representations which are close to each other
in the vector space. Word embeddings have taken natural language processing (NLP) by storm in the past decade or so,
and have become the backbone for numerous NLP tasks such as question answering and machine translation. There
are neural approaches to learning word embeddings, but in this question we will study a simple PCA-based scheme
which does a surprisingly good job at learning word embeddings.
We have created a word co-occurrence matrix M of the 10000 most frequent words from a Wikipedia corpus with
1.5 billion words. The co-occurrences were obtained by using a sliding window of length 5 across the Wikipedia
corpus. Entry Mij of the matrix denotes the number of times words i and j occur in the corpus within the same sliding
window. The file co_occur.csv contains the symmetric co-occurence matrix. dictionary.txt contains the
dictionary for interpreting this matrix, the ith row of the dictionary is the word corresponding to the ith row or column
of M. The dictionary is sorted according to the word frequencies. Therefore the first word in the dictionary—“the”
is the most common word in the corpus and the first row or column of M contains the co-occurrence counts of “the”
with every other word in the dictionary.
We provide some starter code in the file hw4-pca.py with some useful functions (e.g. to read files, generate
plots, etc.) which can be used directly, and instructions on how to complete the functions required for this problem.
4.1 (6 pts) First, read the co-occurrance matrix and the list of all words from the given files. Let the matrix M be
the n × d (n = d = 10000) matrix of word co-occurrences. As we discussed in class, a suitable normalization or
scaling is often very helpful to get PCA to work well. In light of the power law distribution of word occurrences, in
this case we will work with the normalized matrix M? such that each entry M?ij = log(1 +Mij). We regard the i-th
row of M as the datapoint for the i-th word.
We will use PCA to find the first 100 principal components of the data. Let M?c be the centered version of M?.
Use the PCA function from the sklearn library (refer https://scikit-learn.org/stable/modules/
generated/sklearn.decomposition.PCA.html) to get V, i.e. the set of first 100 principal components or
eigenvectors of M?c. Note that you can directly use the fit method with M?c as the input. Also get the eigenvalues of
the covariance matrix (check the documentation of the function) and plot all the 100 eigenvalues. Do the eigenvalues
seem to decay? What percent of the variance in the data is explained by the first 100 eigenvalues we calculated (note
that there are 10,000 eigenvalues in total)?
4.2 (6 pts) In this question, we find embeddings for all 10,000 words in the dictionary using the princpal compo-
nents V. Then, we will use word embeddings to find word(s) which are ‘similar’ to a given word.
To obtain the embeddings of the words, we will project the datapoint corresponding to each word (its co-occurences
with every other word) onto the 100 dimensional space spanned by the first 100 principal components, similar to the
general approach we laid down in class. Here are the two steps you should follow. Recall that we regard the ith row of
M?c as the datapoint for the ith word. Given the 100 PCs (V), we now project each datapoint (row of M?c) onto these
PCs. Denote this n×k (10000×100) matrix as P (you should write P as a matrix operation using M?c and V). Next,
to ensure that each PC gets equal importance, we normalize the vector of the projections of all the words onto the jth
PC (i.e. the jth column of P) to have unit norm, for all j = {1, ..., 100}. Denote this n × k (10000 × 100) matrix as
E. Finally, normalize the rows of E such that each row has unit ?2 norm, to get a new matrix E?.
We regard the ith row of E? as the embedding of the ith word. Next, we will define a similarity metric for the word
embeddings. We will use the cosine-similarity as the similarity metric. As all the vectors have unit ?2 norm, the cosine
similarity between two words i and j with embeddingswi andwj is equal to the inner product ?wi,wj?. Now that we
have a similarity metric defined, we can have some fun with these embeddings by querying for the closest word to any
word we like! Try finding the closest words to some common words, such as “learning”, “university”, “california”,
and comment on your observations.
7
4.3 (6 pts) We’ll now interpret the principal components/eigenvectors (columns of V). For any i, denote vi as the
eigenvector corresponding to the ith largest eigenvalue. Note that the entries of this vector correspond to the 10000
words in our dictionary, we’ll call these our 10000 variables. By sorting the entries of vi by absolute value, and
observing what the top 10 variables and their (signed) entries are, we can infer what information the ith eigenvector
roughly captures. Can you find 5 interesting eigenvectors, and point out what semantic or syntactic structures they
capture? Can you do this for all 100 eigenvectors? Hint: What do you observe about PCs or eigenvectors with small
eigenvalues?
4.4 (12 pts) In this question, we will explore a curious property of the word embeddings—that certain directions
in the embedded space correspond to specific syntactic or semantic concepts. Let w1 be the word embedding for
“woman” and w2 be the word embedding for “man”. Let w = w1 ?w2, and w? = w∥w∥ .
4.4.1 (6 pts) Project the embeddings of the following words onto w?: boy, girl, brother, sister, king, queen, he, she,
john, mary, wall, tree. Present a plot of projections of the embeddings of these words marked on a line. For example,
if the projection of the embedding for “girl” onto w? is 0.1, then you should label 0.1 on the line with “girl”. What do
you observe?
4.4.2 (6 pts) Present a similar plot of the projections of the embeddings of the following words onto w?: math,
history, nurse, doctor, pilot, teacher, engineer, science, arts, literature, bob, alice. What do you observe? Why do you
think this is the case? Do you see a potential problem with this? Remember that word embeddings are extensively used
across NLP. Suppose LinkedIn used such word embeddings to find suitable candidates for a job or to find candidates
who best match a search term or job description. What might be the result of this?
If you want to learn more about this, you might find it interesting to read the original paper1 which pointed out this
issue in word embeddings.
4.5 (Bonus) (10 pts) In this question, we will explore the property that that directions in the embedded space cor-
respond to semantic or syntactic concepts in more depth.
Because word embeddings capture semantic and syntactic concepts, they can be used to solve word analogy tasks.
For example, consider an analogy question— “man is to woman as king is to ”, where the goal is to fill in the
blank space. This can be solved by finding the word whose embedding is closest to wwoman ? wman + wking in cosine
similarity. You can do this by a nearest neighbor search across the entire dictionary—excluding the three words man,
woman, king which already appear in the analogy as they cannot be valid answers to the question. Here wi represents
the word embedding for the word i. Refer to Fig. 3 for why this makes sense in light of the fact that directions in the
embedded space correspond to semantic/syntactic concepts.
Figure 3: Figure denoting how word embeddings might encode gender, tense or country-capital relationships. Image
source.
1https://proceedings.neurips.cc/paper/2016/file/a486cd07e4ac3d270571622f4f316ec5-Paper.pdf
8
We have provided a dataset, analogy_task.txt, which tests the ability of word embeddings to answer anal-
ogy questions, such as those represented in Fig. 3. Using the cosine similarity metric, find and report the accuracy
of the word embeddings you have constructed on the word analogy task. Look at the incorrect/correct answers of the
approach and comment on the results. For example, what types of analogy questions seem to be harder to answer
correctly for this approach?
Deliverables: Discussions for all the parts. Plots for parts 4.1, 4.4.1, 4.4.2. Code for all the parts as a separate
Python file hw4-pca.py.
9
Problem 5: Project Launchpad (16pts + 10pts bonus)
In this part, you will implement a simple linear regression model for the final project. At the same time, you will also
explore the use of the Pandas package, a very popular Python library for data analysis. Here is a tutorial for the Pandas
package. You should do this part together with your project team (each team will make one submission).
Early bird bonus We encourage you to start early on the project. The first 5 teams among all teams in CSCI-567
class at the HW4 submission deadline can get 10 bonus points. To get this bonus, at least one of the team members
will have to give a 5-minute presentation on their approach in the discussion session on November 17. We will consider
the final ranking on the public leaderboard as of 2pm PST on Nov 16 for this part of bonus points.
Setup We will complete this part of assignment on a jupyter notebook project-starter.ipynb. To load the
notebook into Kaggle, follow these steps:
(a) In the “Code” tag of the competition, click button “New Notebook”.
(b) Then, select “File/Import Notebook”, and import regression.ipynb into Kaggle.
Deliverables Notebook with completed code, output, and the discussion for 5.4.
10
Submission Instructions Each project team (not homework team) will make one submission for this problem (there
will be a separate Gradescope link to submit solutions for this problem). Please submit your notebook along with the
output for grading. If you are using Kaggle notebook, you can export the notebook by “File/Download Notebook”,
see figure below:
General Tip 1: before getting started, please read the dataset description carefully. Make sure you understand the
datatype of each entries and the meanings behind them.
General Tip 2: all of the tasks in this part can be done within three lines of code. You should spend most of your time
reading documentations to understand certain methods provided by the Pandas package, instead of implementing the
functionalities yourself.
Part I - Feature Engineering (12 pts)
One of the most important steps in developing a machine learning model is feature engineering. In this part, we will
walk you through the process of constructing a simple set of features for the final project.
5.1 Moving average of oil prices (6 pts) In this task, you are asked to compute the moving average of the oil prices
with window size 7.
5.1.1 (2 pts) In Section 1 of the notebook, after loading the oil price into data oil, compute data oil["ma oil"]
as the moving average of past seven recorded oil prices (note that the oil prices of some dates are missing, and we will
skip them in the computation).
5.1.2 (4 pts) After computing data oil["ma oil"], you should now deal with the missing dates in data oil["ma oil"].
A DataFrame calendar with contiguous dates from 2013-01-01 to 2017-08-31 is created for you. You are asked to
complete the following:
(a) Merge DataFrames calendar and data oil. The merged DataFrame should have contiguous dates from
2013-01-01 to 2017-08-3 as indexes, and the same columns as data oil, where missing values are denoted
by NaN.
(b) Fill in the missing values of data oil["ma oil"]. There are many ways to deal with missing values, such
as fill it with some default value, fill it with the last valid observation, or fill it with the next valid observation.
Here, you are asked to fill in each missing value by the last valid observation. Perform this operation in place
on data oil["ma oil"].
5.2 Workday feature (3 pts) Next, we will create an extra feature indicating whether each date is a workday. Here,
we only make use of the fact that Monday to Friday are usually workdays, and we ignore all the holidays information
in holidays events.csv (which is left for you to explore). In Section 2 of the notebook, your task is to create a
column calendar["wd"] such that a date between Monday and Friday has value True, and a date that is Saturday
or Sunday has value False.
11
5.3 Create trend feature (3 pts) Now we are ready to load all training and testing data into the notebook (Section
3). The last step before training our model is to create the trend feature: note that a linear regression model cannot
read the dates, and we need to map the dates into certain features. Here, you are asked to implement the simplest trend
feature: given a range of dates, map them into a range of integers starting from 1. You need to create a DataFrame X
as the trend feature, which is then extended with the moving average of oil prices and workday indicators to form the
final training features.
Part II - Train Your Model (4 pts)
We are now ready to train the model! Here, we will train a simple linear regression model, which only takes a few
seconds.
5.4 Display training error by the type of product (2 pts) Let’s show the training error to get a sense of how well
the model is fitting the training data. The model is making sales predictions for each store and type of product. You
are asked to show the training error within each type of product. Which family of product does the model has the most
difficulty in predicting the sales?
5.5 Make predictions on the test set (2 pts) Now you can use your trained model to make predictions on the test
set. To do this, you need to first create the features for the test set (denoted by X test) similar to the training set. Fill
in your steps for computing X test in the notebook.
5.6 Make a submission You are now ready to make a submission. Simply open the “Competitions” tag on the right,
and click the submit button to make the submission. You should get a score around 0.45.