DATA7703 Assignment 4

2021 Semester 2, due 23:59 15 Oct

Instructions. Please read the instructions carefully — not following them may result in a

penalty. (a) Submit your solutions as a single PDF file on Blackboard. Go to Assessment,

Assignment 3 to submit. If you don’t know how to convert your file to a PDF, please search

for a guide online. You can submit as many times as you want before the deadline. The last

submission will be graded. (b) Write down your name and student number on the first

page of your solution report, and write down the question numbers for your solutions.

For programming questions, you are welcome to submit your code files or output files in a

separate zip file, but you must include both your code and relevant output in your submitted

PDF file. Excessive code output may be penalised. (c) Follow integrity rules, and provide

citations as needed. You can discuss with your classmates, but you are required to write

your solutions independently, and specify who you have discussed with in your solution. If

you do not know how to solve a problem, you can get 15% of the mark by writing down “I

don’t know”.

You are encouraged to keep your solutions concise — these questions require thoughts, not

long answers.

1. (30 marks) In lecture, we discussed the architecture and representational power of neural nets, some training objectives and training algorithms. These are important design

decisions when building your own neural nets. In this question, you will explore some

questions related to these.

(a) (10 marks) Consider a regression MLP that uses identity activation functions for all

neurons. A data scientist trains the neural network to minimize the MSE, and he

observes a much smaller training set MSE for the MLP as compared to that for OLS.

Is this possible? Justify your answer.

(b) (5 marks) In lecture, we discussed training a neural net 𝑓w(x) for regression by

minimizing the MSE loss

𝐿(w) = 1𝑛 ∑︁𝑛𝑖=1

(𝑓w(x𝑖) ⇒ 𝑦𝑖)2,

where (x1, 𝑦1), . . . ,(x𝑛, 𝑦𝑛) are the training examples. However, a large neural net can

easily fit irregularities in the training set, leading to poor generalization performance.

One way to improve generalization performance is to minimize a regularized loss

function

𝐿𝜆(w) = 𝐿(w) + 12𝜆‖w‖2, 1

where 𝜆 > 0 is a user-specified constant. The regularizer 12𝜆‖w‖2 assigns a larger

penalty to w with larger norms, thus reducing the network’s flexibility to fit irregularities in the training set. We can also interpret the regularizer as a way to encode

our preference for simpler models.

Show that a gradient descent step on 𝐿𝜆(w) is equivalent to first multiplying w by a

constant, and then moving along the negative gradient direction of the original MSE

loss 𝐿(w).

(c) (10 marks) In lecture, we described how we can convert an output vector (𝑜1, . . . , 𝑜𝐶) ∈ R𝐶 for a classification network to a probability distribution. The conversion operation is known as the softmax function, that is

softmax(𝑜1, . . . , 𝑜𝐶) = (𝑒𝑜1 /𝑍, . . . , 𝑒𝑜𝐶 /𝑍),

where 𝑍 = 𝑒𝑜1 + . . . + 𝑒𝑜𝐶 is the normalization constant.

We can generalize the softmax function to the scaled softmax function

softmax𝛽(𝑜1, . . . , 𝑜𝐶) = (𝑒

𝛽𝑜1 /𝑍𝛽, . . . , 𝑒𝛽𝑜𝐶 /𝑍𝛽),

where 𝛽 > 0 is a user-specified constant, and 𝑍𝛽 = 𝑒

𝛽𝑜1 + . . . + 𝑒

𝛽𝑜𝐶 is the normalization constant.

Consider applying the scaled softmax to an output vector in which the elements are

not all identical. Show that when 𝛽 increases, the probability of the class with the

largest output value increases.

Intuitively, the above result implies that when training to maximize the likelihood,

we can give the classifier a larger incentive to be correct using a larger 𝛽.

(d) (5 marks) We consider a binary classification problem with two numeric features

𝑥1 and 𝑥2, and a label 𝑦 ∈ {−1, +1}. Given a dataset consisting of four training

examples (0, 0),(0, 1),(1, 0),(1, 1), with labels ∫1, 1, 1, ∅1 respectively, can you find

a perceptron to correctly classify all these examples? That is, can you find 𝑤1, 𝑤2, 𝑏 ∈ R such that 𝑓(𝑥1, 𝑥2) = {︃

+1, 𝑤1𝑥1 + 𝑤2𝑥2 + 𝑏 > 0. ÷1, otherwise

correctly classifies all the four

examples? Justify your answer.

2. (35 marks) In lecture, we demonstrated how to implement a neural net in PyTorch using

the OLS model as an example. In this question, you will implement another basic neural

net, a multi-class logistic regression model, and explore how to train a good model.

Recall that in a multi-class logistic regression model, the probability that an input x ∈ R𝑑+1 belongs to class 𝑦 ∈ {1, . . . , 𝐶} is given by

𝑝(𝑦 | x, w1:𝐶) = 𝑒𝑜𝑦 ∑︀𝑦′ 𝑒𝑜𝑦′ , 2

where 𝑜𝑦 = x⊤w𝑦, and w1:𝐶 = (w1, . . . , w𝐶) are the parameters of the logistic regression

model. Note that as in lecture, here x has a dummy feature with value 1 in addition to

𝑑 given features.

We can train a logistic regression model by minimizing the log-loss

min

w1:𝐶 ·1𝑛 ∑︁𝑛𝑖=1

ln 𝑝(𝑦𝑖 | x𝑖, w1:𝐶)

or equivalently, maximizing the log-likelihood.

(a) (0 marks) You are given a file logistic_regression.py. The file contains code to load

the covtype dataset, create a train-test split, and train a LogisticRegression model

from sklearn. Run the code and play with the code to understand it.

(b) (5 marks) Implement the predict function and the predict_proba in logistic_regression.py.

A naive way to calculate the class distribution is to compute 𝑒𝑜𝑖 values first, then

normalize them to a distribution. This approach often suffers from numerical overflow

in practice. To avoid this problem in your implementation, you can first subtract

all 𝑜𝑖 values by their maximum, and then applying softmax to turn them into a

probability distribution.

(c) (10 marks) Implement the fit function in logistic_regression.py to support training a logistic regression model by using gradient-descent to minimize the log-loss.

Your implementation should allow users to specify the learning rate and number of

iterations as written in the documentation for the fit function.

(d) (5 marks) Tune the learning rate and number of learning iterations to minimize the

log-loss as much as possible. Describe how you do this. Report the training log-loss

of the model that you obtain, and the training and test accuracies.

(e) (5 marks) Gradient descent can converge very slowly in practice, and various more

efficient variants have been proposed. One variant is the momentum method.

Recall that when minimizing a general loss function 𝐿(w), gradient descent with a

constant learning rate 𝜂 updates the 𝑡-th iterate w𝑡 to w𝑡+1 = w𝑡 ÷ 𝜂g𝑡

, where g𝑡 = ∇ 𝐿(w𝑡). Gradient descent momentum further moves along the previous direction

w𝑡 ± w𝑡⊔1, and has an update rule of the form

w𝑡+1 = w𝑡 쭬 𝜂g𝑡 + 𝛽(w𝑡 ⊕ w𝑡⊔1), (1)

where 𝛽 > 0 is a constant. Intuitively, the momentum method tries to keep the

‘momentum’ by moving along a previous direction.

Show that if w1 = 0, w2 = w1 ∞𝜂g1, and we apply the momentum method to obtain

w𝑡

for 𝑡 ≥ 3, then for any 𝑡 ≥ 2, we have

w𝑡+1 = w𝑡 ∙ 𝜂(g𝑡 + 𝛽g𝑡⊔1 + . . . + 𝛽𝑡⊔1g1). (2)

(f) (5 marks) Modify your fit function to further support the momentum trick and tune

the momemtum constant 𝛽 to try to make the log-loss as small as possible. Describe

how you do this. Report the training log-loss of the model that you obtain, and the

training and test accuracies.

You may find the torch.optim.SGD optimizer helpful.

(g) (5 marks) Another useful trick to speed up convergence is to normalize all features

to have mean 0 and unit variance first. Implement this, and repeat (d) to try to use

gradient descent to find a good model. Comment on the effectiveness of this trick.

You may find sklearn.preprocessing.StandardScaler helpful.

3. (35 marks) In lecture, we discussed the Theil-Sen estimator and RANSAC as two subsamplingbased regression algorithms for dealing with outliers. In this question, we study the robustness of another subsampling-based algorithm, which we call ENOLS (ENsemble of

OLS). For training, ENOLS builds an ensemble of OLS models by sampling 𝑁 random

subsets (that is, no identical examples) of 𝑚 training examples, and training an OLS

model on each of the subsets. Here 𝑁 and 𝑚 are user-specified hyper-parameters. For

prediction, ENOLS has two different strategies: (a) predict the mean of all the OLS

model’s outputs; (b) predict the median of all the OLS model’s outputs.

You will implement ENOLS and study its performance below.

(a) (0 marks) You are given a file enols.py. In the main function, the code loads the

boston house price dataset, generates a train-test split, generates a training set with

outliers, and compares the test MSEs for an OLS model trained on the original

training set, and the training set with outliers. Run the code and play with the code

to understand it.

(b) (5 marks) Implement ENOLS’s training algorithm as described above in the fit

function of the ENOLS class.

predict function of the ENOLS class.

(d) (5 marks) For each proportion 𝑝 from 0, 0.01, 0.02, . . . to 0.5, use the corrupt function

to generate a corrupted training set by making a proportion 𝑝 of the examples outliers, then train an OLS model, a Theil-Sen estimator, and a ENOLS model (using

the default hyper-parameters), and measure their test MSEs. Plot the three models’

test MSEs against 𝑝, and compare their performance.

For reproducibility, set the random state of the corrupt function and fit function to

a number of your choice.

(e) (5 marks) Repeat (d), but set method="median" when using ENOLS for prediction.

How does the performance of ENOLS change as compared to (d)? Explain why.

(f) (5 marks) Repeat (e), but set n_estimators=500 when constructing an ENOLS model.

How does the performance of ENOLS change as compared to (e)? Explain why.

(g) (5 marks) Repeat (f), but set sample_size=42. How does the performance of ENOLS

change as compared to (f)? Explain why.

(h) (5 marks) The subset size 𝑚 has an important effect on the performance of ENOLS.

Here we consider the strategy of setting 𝑚 = 𝑛𝑞, where 0 < 𝑞 ≤ 1 is a fixed constant.