首页 > > 详细

DATA7703 Assignment 4

 DATA7703 Assignment 4

2021 Semester 2, due 23:59 15 Oct
Instructions. Please read the instructions carefully — not following them may result in a
penalty. (a) Submit your solutions as a single PDF file on Blackboard. Go to Assessment,
Assignment 3 to submit. If you don’t know how to convert your file to a PDF, please search
for a guide online. You can submit as many times as you want before the deadline. The last
submission will be graded. (b) Write down your name and student number on the first
page of your solution report, and write down the question numbers for your solutions.
For programming questions, you are welcome to submit your code files or output files in a
separate zip file, but you must include both your code and relevant output in your submitted
PDF file. Excessive code output may be penalised. (c) Follow integrity rules, and provide
citations as needed. You can discuss with your classmates, but you are required to write
your solutions independently, and specify who you have discussed with in your solution. If
you do not know how to solve a problem, you can get 15% of the mark by writing down “I
don’t know”.
You are encouraged to keep your solutions concise — these questions require thoughts, not
long answers.
1. (30 marks) In lecture, we discussed the architecture and representational power of neu￾ral nets, some training objectives and training algorithms. These are important design
decisions when building your own neural nets. In this question, you will explore some
questions related to these.
(a) (10 marks) Consider a regression MLP that uses identity activation functions for all
neurons. A data scientist trains the neural network to minimize the MSE, and he
observes a much smaller training set MSE for the MLP as compared to that for OLS.
Is this possible? Justify your answer.
(b) (5 marks) In lecture, we discussed training a neural net 𝑓w(x) for regression by
minimizing the MSE loss
𝐿(w) = 1𝑛 ∑︁𝑛𝑖=1
(𝑓w(x𝑖) ⇒ 𝑦𝑖)2,
where (x1, 𝑦1), . . . ,(x𝑛, 𝑦𝑛) are the training examples. However, a large neural net can
easily fit irregularities in the training set, leading to poor generalization performance.
One way to improve generalization performance is to minimize a regularized loss
function
𝐿𝜆(w) = 𝐿(w) + 12𝜆‖w‖2, 1
where 𝜆 > 0 is a user-specified constant. The regularizer 12𝜆‖w‖2 assigns a larger
penalty to w with larger norms, thus reducing the network’s flexibility to fit irregu￾larities in the training set. We can also interpret the regularizer as a way to encode
our preference for simpler models.
Show that a gradient descent step on 𝐿𝜆(w) is equivalent to first multiplying w by a
constant, and then moving along the negative gradient direction of the original MSE
loss 𝐿(w).
(c) (10 marks) In lecture, we described how we can convert an output vector (𝑜1, . . . , 𝑜𝐶) ∈ R𝐶 for a classification network to a probability distribution. The conversion opera￾tion is known as the softmax function, that is
softmax(𝑜1, . . . , 𝑜𝐶) = (𝑒𝑜1 /𝑍, . . . , 𝑒𝑜𝐶 /𝑍),
where 𝑍 = 𝑒𝑜1 + . . . + 𝑒𝑜𝐶 is the normalization constant.
We can generalize the softmax function to the scaled softmax function
softmax𝛽(𝑜1, . . . , 𝑜𝐶) = (𝑒
𝛽𝑜1 /𝑍𝛽, . . . , 𝑒𝛽𝑜𝐶 /𝑍𝛽),
where 𝛽 > 0 is a user-specified constant, and 𝑍𝛽 = 𝑒
𝛽𝑜1 + . . . + 𝑒
𝛽𝑜𝐶 is the normal￾ization constant.
Consider applying the scaled softmax to an output vector in which the elements are
not all identical. Show that when 𝛽 increases, the probability of the class with the
largest output value increases.
Intuitively, the above result implies that when training to maximize the likelihood,
we can give the classifier a larger incentive to be correct using a larger 𝛽.
(d) (5 marks) We consider a binary classification problem with two numeric features
𝑥1 and 𝑥2, and a label 𝑦 ∈ {−1, +1}. Given a dataset consisting of four training
examples (0, 0),(0, 1),(1, 0),(1, 1), with labels ∫1, 1, 1, ∅1 respectively, can you find
a perceptron to correctly classify all these examples? That is, can you find 𝑤1, 𝑤2, 𝑏 ∈ R such that 𝑓(𝑥1, 𝑥2) = {︃
+1, 𝑤1𝑥1 + 𝑤2𝑥2 + 𝑏 > 0. ÷1, otherwise
correctly classifies all the four
examples? Justify your answer.
2. (35 marks) In lecture, we demonstrated how to implement a neural net in PyTorch using
the OLS model as an example. In this question, you will implement another basic neural
net, a multi-class logistic regression model, and explore how to train a good model.
Recall that in a multi-class logistic regression model, the probability that an input x ∈ R𝑑+1 belongs to class 𝑦 ∈ {1, . . . , 𝐶} is given by
𝑝(𝑦 | x, w1:𝐶) = 𝑒𝑜𝑦 ∑︀𝑦′ 𝑒𝑜𝑦′ , 2
where 𝑜𝑦 = x⊤w𝑦, and w1:𝐶 = (w1, . . . , w𝐶) are the parameters of the logistic regression
model. Note that as in lecture, here x has a dummy feature with value 1 in addition to
𝑑 given features.
We can train a logistic regression model by minimizing the log-loss
min
w1:𝐶 ·1𝑛 ∑︁𝑛𝑖=1
ln 𝑝(𝑦𝑖 | x𝑖, w1:𝐶)
or equivalently, maximizing the log-likelihood.
(a) (0 marks) You are given a file logistic_regression.py. The file contains code to load
the covtype dataset, create a train-test split, and train a LogisticRegression model
from sklearn. Run the code and play with the code to understand it.
(b) (5 marks) Implement the predict function and the predict_proba in logistic_regression.py.
A naive way to calculate the class distribution is to compute 𝑒𝑜𝑖 values first, then
normalize them to a distribution. This approach often suffers from numerical overflow
in practice. To avoid this problem in your implementation, you can first subtract
all 𝑜𝑖 values by their maximum, and then applying softmax to turn them into a
probability distribution.
(c) (10 marks) Implement the fit function in logistic_regression.py to support train￾ing a logistic regression model by using gradient-descent to minimize the log-loss.
Your implementation should allow users to specify the learning rate and number of
iterations as written in the documentation for the fit function.
(d) (5 marks) Tune the learning rate and number of learning iterations to minimize the
log-loss as much as possible. Describe how you do this. Report the training log-loss
of the model that you obtain, and the training and test accuracies.
(e) (5 marks) Gradient descent can converge very slowly in practice, and various more
efficient variants have been proposed. One variant is the momentum method.
Recall that when minimizing a general loss function 𝐿(w), gradient descent with a
constant learning rate 𝜂 updates the 𝑡-th iterate w𝑡 to w𝑡+1 = w𝑡 ÷ 𝜂g𝑡
, where g𝑡 = ∇ 𝐿(w𝑡). Gradient descent momentum further moves along the previous direction
w𝑡 ± w𝑡⊔1, and has an update rule of the form
w𝑡+1 = w𝑡 쭬 𝜂g𝑡 + 𝛽(w𝑡 ⊕ w𝑡⊔1), (1)
where 𝛽 > 0 is a constant. Intuitively, the momentum method tries to keep the
‘momentum’ by moving along a previous direction.
Show that if w1 = 0, w2 = w1 ∞𝜂g1, and we apply the momentum method to obtain
w𝑡
for 𝑡 ≥ 3, then for any 𝑡 ≥ 2, we have
w𝑡+1 = w𝑡 ∙ 𝜂(g𝑡 + 𝛽g𝑡⊔1 + . . . + 𝛽𝑡⊔1g1). (2)
3
(f) (5 marks) Modify your fit function to further support the momentum trick and tune
the momemtum constant 𝛽 to try to make the log-loss as small as possible. Describe
how you do this. Report the training log-loss of the model that you obtain, and the
training and test accuracies.
You may find the torch.optim.SGD optimizer helpful.
(g) (5 marks) Another useful trick to speed up convergence is to normalize all features
to have mean 0 and unit variance first. Implement this, and repeat (d) to try to use
gradient descent to find a good model. Comment on the effectiveness of this trick.
You may find sklearn.preprocessing.StandardScaler helpful.
3. (35 marks) In lecture, we discussed the Theil-Sen estimator and RANSAC as two subsampling￾based regression algorithms for dealing with outliers. In this question, we study the ro￾bustness of another subsampling-based algorithm, which we call ENOLS (ENsemble of
OLS). For training, ENOLS builds an ensemble of OLS models by sampling 𝑁 random
subsets (that is, no identical examples) of 𝑚 training examples, and training an OLS
model on each of the subsets. Here 𝑁 and 𝑚 are user-specified hyper-parameters. For
prediction, ENOLS has two different strategies: (a) predict the mean of all the OLS
model’s outputs; (b) predict the median of all the OLS model’s outputs.
You will implement ENOLS and study its performance below.
(a) (0 marks) You are given a file enols.py. In the main function, the code loads the
boston house price dataset, generates a train-test split, generates a training set with
outliers, and compares the test MSEs for an OLS model trained on the original
training set, and the training set with outliers. Run the code and play with the code
to understand it.
(b) (5 marks) Implement ENOLS’s training algorithm as described above in the fit
function of the ENOLS class.
(c) (5 marks) Implement ENOLS’s two prediction strategies as described above in the
predict function of the ENOLS class.
(d) (5 marks) For each proportion 𝑝 from 0, 0.01, 0.02, . . . to 0.5, use the corrupt function
to generate a corrupted training set by making a proportion 𝑝 of the examples out￾liers, then train an OLS model, a Theil-Sen estimator, and a ENOLS model (using
the default hyper-parameters), and measure their test MSEs. Plot the three models’
test MSEs against 𝑝, and compare their performance.
For reproducibility, set the random state of the corrupt function and fit function to
a number of your choice.
(e) (5 marks) Repeat (d), but set method="median" when using ENOLS for prediction.
How does the performance of ENOLS change as compared to (d)? Explain why.
(f) (5 marks) Repeat (e), but set n_estimators=500 when constructing an ENOLS model.
How does the performance of ENOLS change as compared to (e)? Explain why.
4
(g) (5 marks) Repeat (f), but set sample_size=42. How does the performance of ENOLS
change as compared to (f)? Explain why.
(h) (5 marks) The subset size 𝑚 has an important effect on the performance of ENOLS.
Here we consider the strategy of setting 𝑚 = 𝑛𝑞, where 0 < 𝑞 ≤ 1 is a fixed constant.
Suppose we are working on a problem with an outlier ratio of 𝑝 > 0, that is, in a
training set of 𝑛 examples, there are 𝑛𝑝 outliers. Show that the expected proportion
of subsets that does not contain an outlier can be arbitrarily close to 0 for large 𝑛.
In other words, sampling a fixed proportion 𝑞 of the training set is a bad choice in
this case.
For simplicity, you can assume that 𝑛𝑞 and 𝑛𝑝 are integers in your analysis.
 
联系我们 - QQ: 99515681 微信:codinghelp
程序辅导网!