# CS5487 Programming Assignment 1

CS5487 Programming Assignment 1

Regression
Antoni Chan
Department of Computer Science
City University of Hong Kong
In this programming assignment you will implement and test some of the regression methods that
were introduced in the lecture and in the problem sets. Let f(x, θ) be a function with input x ∈ Rd
and parameters θ ∈ RD, and
f(x, θ) = φ(x)T
θ, (1)
where φ(x) : Rd → RD is a feature transformation of x. For example, the K-th order polynomial
with input x ∈ R can be expressed as
f(x, θ) = XKk=0
xkθk = φ(x)T
θ, (2)
where the feature transformation and parameters are
φ(x) = 1, x, x2, · · · , xKT ∈ RK+1, θ = θ0, · · · , θKT ∈ RK+1
. (3)
Our goal is to obtain the best estimate of the function given iid samples D = {(x1, y1), . . . ,(xn, yn)},
where yi are the noisy observations of f(xi
, θ). We have seen several ways of doing the regression,
using different assumed noise models and formulations. For convenience, define the following quan￾tities,
y = y1, · · · , ynT , Φ = φ(x1), · · · , φ(xn)
, X = x1, · · · , xn . (4)
Here is a summary of the various regression algorithms we have seen so far:
method objective function parameter estimate ˆθ prediction f∗ for input x∗
least-squares (LS)

y y ΦT θ

2 ˆθLS = (ΦΦT )1Φy f∗ = φ(x∗)T ˆθ
regularized LS (RLS)

y y ΦT θ

2 + λ kθk2 ˆθRLS = (ΦΦT + λI) 1Φy f∗ = φ(x∗)T ˆθ
L1-regularized LS (LASSO)

y y ΦT θ

2 + λ kθk1 QP solver (see Prob. 3.12) f∗ = φ(x∗)T ˆθ
robust regression (RR)

y y ΦT θ

1
LP solver (see Prob. 2.10) f∗ = φ(x∗)T ˆθ
distributions posterior predictive
Bayesian regression (BR) θ ∼ N (0, αI) θ|X, y ∼ N (ˆµθ, σˆ2θ ) f∗|X, y, x∗ ∼ N (ˆµ∗, σˆ2∗) y|x, θ ∼ N (f(x, θ), σ2) µˆθ = 1σ2 Σˆ θΦy µˆ∗ = φ(x∗)T µˆθ Σˆ θ = ( 1α I + 1σ2 ΦΦT ) 1 σˆ2∗ = φ(x∗)T Σˆ θφ(x∗) 1
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2
−20
−10
0
10
20
30
40
x y
samples
function
Part 1 Polynomial function
In the first problem, you will test these regression methods on a simple dataset, and look at the
effects of using different objective functions, regularization terms, and formulations. The MATLAB
data file poly data.mat (or poly data *.txt if you are not using MATLAB) contains a set of
samples generated from a 5th order polynomial function, with observation noise σ2 = 5. The file
contains the following variables:
• sampx - sample input values (xi), each entry is an input value.
• sampy - sample output values (yi), each entry is an output.
• polyx - input values for the true function (also these are the test inputs x∗).
• polyy - output values for the true function.
• thtrue - the true value of the θ used to generate the data.
A plot of the samples (sampx,sampy) and the true polynomial function (polyx, polyy) are shown
below.
The goal is to try to estimate the polynomial function from the samples.
(a) Implement the above 5 regression algorithms for the K-th order polynomial given in (2). In
the next problem, you will use these regression methods with a different feature transforma￾tion φ(x). Hence, it would be better to separate the regression algorithm and the feature
(b) For each regression method, use the sample data (sampx, sampy) to estimate the parameters of
a 5th order polynomial function. Make a plot of the estimated function using polyx as inputs,
along with the sample data. For BR, also plot the standard deviation around the mean. What
is the mean-squared error between the learned function outputs and the true function outputs
(polyy), averaged over all input values in polyx? For algorithms with hyperparameters, select
some values that tend to work well.
2
(c) Repeat (b), but reduce the amount of training data available, by selecting a subset of the
samples (e.g., 10%, 25%, 50%, 75%). Plot the estimated functions. Which models are more
robust with less data, and which tend to overfit? Make a plot of error versus training size, and
comment on any important trends and findings. (You should run multiple trials with different
random subsets, and take the average error).
(d) Add some outliers output values (e.g., add large numbers to a few values in sampy), and repeat
(b). Which methods are robust to the presence of outliers, and which are most sensitive? Why?
(e) Repeat (b) but estimate a higher-order polynomial (e.g., 10th order). Which models tend to
overfit the data when learning a more complex model, and which models do not? Verify your
observations by examining the estimated parameter values.
. . . . . . . . .
Part 2 A real world regression problem – counting people
Now we will consider a real world regression problem – estimating the number of people in an
image. Below is an example image of a sidewalk with people.
From each image, we extract a feature vector xi ∈ R9
, which is based on some properties of the
crowd (e.g., the area covered by the crowd, the perimeter of the crowd, etc). We also have the
number of people in the image, yi ∈ R. The goal is to learn a regression function between this
feature vector and the number of people in the image. (Let’s ignore the fact that the count is
actually a non-negative integer, and just apply our standard regression methods).
The data can be found in the count data.mat MATLAB file (or count data *.txt if you
are not using MATLAB). The mat file contains the following variables:
• trainx – training inputs. Each column is a 9-dimensional feature vector.
• trainy – training outputs (mean-subtracted counts for each input vector).
• testx – test inputs.
• testy – true outputs for the test inputs.
Note that the output values are actually the mean-subtracted counts, so you will see some negative
outputs.
3
(a) Let’s first look at using the features directly, i.e., set φ(x) = x. Use the training set (trainx,
trainy), estimate a function with some of the regression algorithms above. Use the test set
inputs testx to predict the output, and compare the predictions to the true output testy. (You
can round the function outputs so they are counting numbers). Calculate the mean-absolute
error and mean-squared error. Which method works the best? Plot the test predictions and
the true counts, and discuss any interesting findings.
(b) Now try some other feature transformations. For example, you can create a simple 2nd order
polynomial as φ(x) = x1, · · · , x9, x21, · · · , x29T
. This could be expanded to include more cross￾terms xixj . Try some other non-linear transformation of the features. Can you improve your
results from (a)?
. . . . . . . . .
Part 3 Estimating hyperparameters (optional)
So far you probably have been setting the hyperparameters of each method (e.g., λ) by hand to get
the best performance on the test set. This is not a fair way to do an experiment, since it amounts
to optimizing to the test set, and doesn’t reflect whether the model can generalize well to unseen
data. In a real scenario, you wouldn’t have the true outputs for the test set! Here are two common
ways to set the hyperparameters using only the training set:
1. N-fold cross-validation: With this method, the original training set is split into a new
training set and a validation set. The regression function is estimated from the new (smaller)
training set, and tested on the validation set, for different hyperparameter values. This is
repeated for different partitions of the original training set. Formally, separate the training set
D into N equal partitions, {D1, · · · , DN }. For each partition i and candidate hyperparameter
value λj : 1) train a regression function using λj on D\Di (all the data except for Di); 2) test
the function on Di
; 3) record the error ei,j . Finally, sum the error over the N partitions,
Ej = PNi=1 ei,j . Select the λj with the lowest overall error, and train the final regression
function on the full training set D. (usually people use between N = 3 to N = 10 folds,
depending on the size of the training set).
2. maximum marginal likelihood: For Bayesian regression, another way to select the hyper￾parameters is to maximize the marginal likelihood of the training data (see Problem 3.11 for
details),
α, ˆ σˆ2 = argmax
α,σ2 p(y|X, σ2
, α) = argmax
α,σ2
log p(y|X, σ2
, α) (5)
where the marginal log-likelihood is
log p(y|X, σ2
, α) = log Z p(y|θ, X, σ2)p(θ|α)dθ (6)
= =D2
log α n2
log σ2 · 12σ2

y y ΦT µˆθ

2 · 12α kµˆθk2 · 12
log
Σˆ^ 1 θ n2
log(2π). (7)
The procedure is as follows: for each candidate hyperparameter pair {αj , σ2j }, calculate the
marginal likelihood in (7). Select the hyperparameter pair that results in the largest marginal
likelihood.
4
(a) For Problem 2, use cross-validation or maximum marginal likelihood to select the hyperparam￾eters using only the training data. Does the accuracy increase or decrease? How close were the
automatically selected parameters to your hand-selected ones?
. . . . . . . . .
• What to hand in – You need to turn in the following things:
1. Answers, plots, analysis, discussion, etc. for the above problems.
2. Source code files.
You must submit your programming assignment using the Canvas website. Go to “Assign￾ments” ⇒ “Programming Assignment 1”.
• Plagiarism – You should implement each regression algorithm using your own code. Do
not use someone else’s implementation of the regression algorithms (including MATLAB’s
implementation). It is okay to use common library components and functions, e.g. standard
matrix operations (e.g., inv, eig), general purpose optimization toolboxes (e.g., quadprog,
linprog), etc. If you use any special toolboxes or libraries, make sure to note them in your
report.
• Grading – The marks for this assignment will be distributed as follows:
– 10% – Implementation of the regression algorithms (1a).
– 40% – Results testing the regression functions (1b, 1c, 1d, 1e).
– 20% – Results on counting people (2a).
– 10% – Trying out other features for counting people (2b).
– 20% – Quality of the written report. More points given to insightful observations and
analysis.
Note: if you really cannot implement the algorithms correctly, it is okay to use 3rd party
software. You will not get marks for implementation, but you can still get marks for presenting
the results.
• Optional Problems – The optional problem 3 will not be graded.

• QQ：99515681
• 邮箱：99515681@qq.com
• 工作时间：8:00-23:00
• 微信：codinghelp