CS5487 Programming Assignment 1

Regression

Antoni Chan

Department of Computer Science

City University of Hong Kong

In this programming assignment you will implement and test some of the regression methods that

were introduced in the lecture and in the problem sets. Let f(x, θ) be a function with input x ∈ Rd

and parameters θ ∈ RD, and

f(x, θ) = φ(x)T

θ, (1)

where φ(x) : Rd → RD is a feature transformation of x. For example, the K-th order polynomial

with input x ∈ R can be expressed as

f(x, θ) = XKk=0

xkθk = φ(x)T

θ, (2)

where the feature transformation and parameters are

φ(x) = 1, x, x2, · · · , xKT ∈ RK+1, θ = θ0, · · · , θKT ∈ RK+1

. (3)

Our goal is to obtain the best estimate of the function given iid samples D = {(x1, y1), . . . ,(xn, yn)},

where yi are the noisy observations of f(xi

, θ). We have seen several ways of doing the regression,

using different assumed noise models and formulations. For convenience, define the following quantities,

y = y1, · · · , ynT , Φ = φ(x1), · · · , φ(xn)

, X = x1, · · · , xn . (4)

Here is a summary of the various regression algorithms we have seen so far:

method objective function parameter estimate ˆθ prediction f∗ for input x∗

least-squares (LS)

y y ΦT θ

2 ˆθLS = (ΦΦT )1Φy f∗ = φ(x∗)T ˆθ

regularized LS (RLS)

y y ΦT θ

2 + λ kθk2 ˆθRLS = (ΦΦT + λI) 1Φy f∗ = φ(x∗)T ˆθ

L1-regularized LS (LASSO)

y y ΦT θ

2 + λ kθk1 QP solver (see Prob. 3.12) f∗ = φ(x∗)T ˆθ

robust regression (RR)

y y ΦT θ

LP solver (see Prob. 2.10) f∗ = φ(x∗)T ˆθ

distributions posterior predictive

Bayesian regression (BR) θ ∼ N (0, αI) θ|X, y ∼ N (ˆµθ, σˆ2θ ) f∗|X, y, x∗ ∼ N (ˆµ∗, σˆ2∗) y|x, θ ∼ N (f(x, θ), σ2) µˆθ = 1σ2 Σˆ θΦy µˆ∗ = φ(x∗)T µˆθ Σˆ θ = ( 1α I + 1σ2 ΦΦT ) 1 σˆ2∗ = φ(x∗)T Σˆ θφ(x∗) 1

−2 −1.5 −1 −0.5 0 0.5 1 1.5 2

−20

−10

x y

samples

function

Part 1 Polynomial function

In the first problem, you will test these regression methods on a simple dataset, and look at the

effects of using different objective functions, regularization terms, and formulations. The MATLAB

data file poly data.mat (or poly data *.txt if you are not using MATLAB) contains a set of

samples generated from a 5th order polynomial function, with observation noise σ2 = 5. The file

contains the following variables:

• sampx - sample input values (xi), each entry is an input value.

• sampy - sample output values (yi), each entry is an output.

• polyx - input values for the true function (also these are the test inputs x∗).

• polyy - output values for the true function.

• thtrue - the true value of the θ used to generate the data.

A plot of the samples (sampx,sampy) and the true polynomial function (polyx, polyy) are shown

below.

The goal is to try to estimate the polynomial function from the samples.

(a) Implement the above 5 regression algorithms for the K-th order polynomial given in (2). In

the next problem, you will use these regression methods with a different feature transformation φ(x). Hence, it would be better to separate the regression algorithm and the feature

transformation in your implementation.

(b) For each regression method, use the sample data (sampx, sampy) to estimate the parameters of

a 5th order polynomial function. Make a plot of the estimated function using polyx as inputs,

along with the sample data. For BR, also plot the standard deviation around the mean. What

is the mean-squared error between the learned function outputs and the true function outputs

(polyy), averaged over all input values in polyx? For algorithms with hyperparameters, select

some values that tend to work well.

samples (e.g., 10%, 25%, 50%, 75%). Plot the estimated functions. Which models are more

robust with less data, and which tend to overfit? Make a plot of error versus training size, and

comment on any important trends and findings. (You should run multiple trials with different

random subsets, and take the average error).

(d) Add some outliers output values (e.g., add large numbers to a few values in sampy), and repeat

(b). Which methods are robust to the presence of outliers, and which are most sensitive? Why?

(e) Repeat (b) but estimate a higher-order polynomial (e.g., 10th order). Which models tend to

overfit the data when learning a more complex model, and which models do not? Verify your

observations by examining the estimated parameter values.

. . . . . . . . .

Part 2 A real world regression problem – counting people

Now we will consider a real world regression problem – estimating the number of people in an

image. Below is an example image of a sidewalk with people.

From each image, we extract a feature vector xi ∈ R9

, which is based on some properties of the

crowd (e.g., the area covered by the crowd, the perimeter of the crowd, etc). We also have the

number of people in the image, yi ∈ R. The goal is to learn a regression function between this

feature vector and the number of people in the image. (Let’s ignore the fact that the count is

actually a non-negative integer, and just apply our standard regression methods).

The data can be found in the count data.mat MATLAB file (or count data *.txt if you

are not using MATLAB). The mat file contains the following variables:

• trainx – training inputs. Each column is a 9-dimensional feature vector.

• trainy – training outputs (mean-subtracted counts for each input vector).

• testx – test inputs.

• testy – true outputs for the test inputs.

Note that the output values are actually the mean-subtracted counts, so you will see some negative

outputs.

(a) Let’s first look at using the features directly, i.e., set φ(x) = x. Use the training set (trainx,

trainy), estimate a function with some of the regression algorithms above. Use the test set

inputs testx to predict the output, and compare the predictions to the true output testy. (You

can round the function outputs so they are counting numbers). Calculate the mean-absolute

error and mean-squared error. Which method works the best? Plot the test predictions and

the true counts, and discuss any interesting findings.

(b) Now try some other feature transformations. For example, you can create a simple 2nd order

polynomial as φ(x) = x1, · · · , x9, x21, · · · , x29T

. This could be expanded to include more crossterms xixj . Try some other non-linear transformation of the features. Can you improve your

results from (a)?

. . . . . . . . .

Part 3 Estimating hyperparameters (optional)

So far you probably have been setting the hyperparameters of each method (e.g., λ) by hand to get

the best performance on the test set. This is not a fair way to do an experiment, since it amounts

to optimizing to the test set, and doesn’t reflect whether the model can generalize well to unseen

data. In a real scenario, you wouldn’t have the true outputs for the test set! Here are two common

ways to set the hyperparameters using only the training set:

1. N-fold cross-validation: With this method, the original training set is split into a new

training set and a validation set. The regression function is estimated from the new (smaller)

training set, and tested on the validation set, for different hyperparameter values. This is

repeated for different partitions of the original training set. Formally, separate the training set

D into N equal partitions, {D1, · · · , DN }. For each partition i and candidate hyperparameter

value λj : 1) train a regression function using λj on D\Di (all the data except for Di); 2) test

the function on Di

; 3) record the error ei,j . Finally, sum the error over the N partitions,

Ej = PNi=1 ei,j . Select the λj with the lowest overall error, and train the final regression

function on the full training set D. (usually people use between N = 3 to N = 10 folds,

depending on the size of the training set).

2. maximum marginal likelihood: For Bayesian regression, another way to select the hyperparameters is to maximize the marginal likelihood of the training data (see Problem 3.11 for

details),

α, ˆ σˆ2 = argmax

α,σ2 p(y|X, σ2

, α) = argmax

α,σ2

log p(y|X, σ2

, α) (5)

where the marginal log-likelihood is

log p(y|X, σ2

, α) = log Z p(y|θ, X, σ2)p(θ|α)dθ (6)

= =D2

log α n2

log σ2 · 12σ2

y y ΦT µˆθ

2 · 12α kµˆθk2 · 12

log

Σˆ^ 1 θ n2

log(2π). (7)

The procedure is as follows: for each candidate hyperparameter pair {αj , σ2j }, calculate the

marginal likelihood in (7). Select the hyperparameter pair that results in the largest marginal

likelihood.

(a) For Problem 2, use cross-validation or maximum marginal likelihood to select the hyperparameters using only the training data. Does the accuracy increase or decrease? How close were the

automatically selected parameters to your hand-selected ones?

. . . . . . . . .

• What to hand in – You need to turn in the following things:

1. Answers, plots, analysis, discussion, etc. for the above problems.

2. Source code files.

You must submit your programming assignment using the Canvas website. Go to “Assignments” ⇒ “Programming Assignment 1”.

• Plagiarism – You should implement each regression algorithm using your own code. Do

not use someone else’s implementation of the regression algorithms (including MATLAB’s

implementation). It is okay to use common library components and functions, e.g. standard

matrix operations (e.g., inv, eig), general purpose optimization toolboxes (e.g., quadprog,

linprog), etc. If you use any special toolboxes or libraries, make sure to note them in your

report.