Assignment 1 Problem 2

Assignment 1

Problem 2 [50%]

In this problem, we will explore gradient descent optimization for linear regression, applied to the Boston house price prediction. The dataset is loaded by

import sklearn.datasets as datasets

where sklearn is a popular machine learning toolkit. Unfortunately, in the coding assignment, we CANNOT use any other API functions in existing machine learning toolkits including sklearn. Again, we shall use linear algebra routines (e.g., numpy) for the assignment.

The dataset contains 506 samples, each with 13 features. We first randomly shuffle all samples, and then take 300 samples as the training set, 100 samples as the validation test, and 106 as the test set.

We normalize features and output by

We use mean square error as our loss to train our model. The measure of success, however, is the mean of the absolute difference between the predicted price and true price. Here, we call the measure of success the risk or error. This reflects how much money we would lose for a bad prediction. The lower, the better.

In other words, the training loss is

where we compute loss on the normalized output and the prediction .

The measure of success (the lower, the better) is

Here, the risk is defined on the original output (thinking of it’s the real money).

Notice that we will use mini-batch gradient descent, and thus, should be the number of samples in a batch.

(a)[30%] We implement the train-validation-test framework, where we train the model by mini-batch gradient descent, and validate model performance after each epoch. After reaching the maximum number of iterations, we pick the epoch that yields the best validation performance (the lowest risk), and test the model on the test set.

Without changing default hyperparameters, we report three numbers

1.The number of epoch that yields the best validation performance,

2.The validation performance (risk) in that epoch, and

3.The test performance (risk) in that epoch.

and two plots:

1.The learning curve of the training loss, and

2.The learning curve of the validation risk.

where x-axis is the number of epochs, and y-axis is training loss and validation risk, respectively.

(b)[10%] We now explore non-linear features in the linear regression model. In particular, we adopt point-wise quadratic features. Suppose the original features are . We now extend it as .

At the same time, we tune -penalty to prevent overfitting by

The hyperparameter should be tuned from the set {3, 1, 0.3, 0.1, 0.03, 0.01}.

Report the best hyperparameter , i.e., the one yields the best performance, and under this hyperparameter, the three numbers and two plots required in Problem 2(a).

(c)[10%] Ask a meaningful scientific question on this task by yourself, design an experimental protocol, report experimental results, and draw a conclusion.