辅导DS 5220、讲解Java编程设计、辅导Python，c/c++语言解析Java程序|讲解留学生Processing

DS 5220 - Spring 2019
Homework 1
Please follow the homework submission instructions provided on Piazza.
Due on Blackboard before midnight on Friday, January 18 2019.
Each part of the problems 5 points
1. [Analytical question] Consider two Normally distributed random variables Y1and Y2 with
expected values µ1 and µ2, variances σ21
and σ22, and correlation ρ.
(a) State the joint probability distribution of these random variables. State it twice:
once in a non-matrix and the second time in a matrix form. Explain the meaning
of each term.
(b) Use Bayes theorem to derive the conditional probability distribution of Y1|Y2 and
of Y2|Y1
(c) Does the correlation or the parameter of linear regression depend on whether we
want to predict Y1 as function of Y2, or Y2 as function of Y1?
(d) Use the derivations above to explain the difference between the coefficient of correlation
and the slope of linear regression.
2. [Analytical question] Consider the following loss functions for error terms ei, i =1, . . . , N in linear regression. For each loss function, (i) state whether it is convex,
(ii) provide a mathematical proof, and (iii) explain how it can be useful in the context
of linear regression.
(a) Quadratic loss (related to mean squared error, L2 norm) L =
PN
i=1 e2i
(b) Mean absolute error (L1 norm) L =PN i=1 |ei|
(c) Huber loss (smooth mean absolute error) with parameter δ
L =XN
i=1l(ei), where l(e) = 12e2, if |e| ≤ δδ|e| − 12δ2, if |e| > δ
3. [Analytical question] For linear regression Yi = θ0 + θ1Xi + ei, i = 1, . . . , N minimizing
squared loss:
(a) Write down the likelihood on the training data, and analytically derive the maximum
likelihood solution for parameter estimates.
(b) Calculate the gradient with respect to the parameter vector.
(c) Write down the steps of the (batch) gradient descent rule.
(d) Write down the steps of the stochastic gradient descent rule.
1
4. [Implementation question]
(a) Overlay graphs of the loss functions in question 2 for a range of e (consider two
different values of δ for Huber loss). Use the graph to discuss the relative advantages
and disadvantages of these loss functions for linear regression.
(b) Implement gradient descent for the loss functions above.
(c) Implement stochastic gradient descent for the loss functions above
5. [Implementation question] In this question we will revisit JW Figure 3.3, and empirically
evaluate various approaches to fitting linear regression.
(a) Simulate N=50 values of Xi
, distributed Uniformly on interval (-2,2). Simulate the
values of Yi = 3 + 2Xi + ei
, where ei
is drawn from N (0, 4). Fit linear regression
with squared loss to the simulated data using (i) analytical solution, (ii) batch
gradient descent, and (iii) stochastic gradient descent implemented in Question 4.
Set learning rate α to a small value (say, α = 0.01).
(b) Repeat (a) 1,000 times, overlay the histograms of the estimates of the slopes, and
overlay the true value. Comment on how the choice of the algorithm affects the
estimates of the slope parameter.
(c) Simulate N=50 values of Xi
, distributed Uniformly on interval (-2,2). Simulate the
values of Yi = 3+2Xi+ei
, where ei
is drawn from N (0, 4). Fit linear regression with
(i) squared loss with the analytical solution, (ii) mean absolute error with batch
gradient descent, and (iii) Huber loss with batch gradient descent implemented in
Question 4. Set learning rate α to a small value (say, α = 0.01).
(d) Repeat (c) 1,000 times, overlay the histograms of the estimates of the slopes, and
overlay the true value. Comment on how the choice of the loss function in the case
of Normal distribution affects the estimates of the slope parameter.
(e) Simulate N=50 values of Xi
, distributed Uniformly on interval (-2,2). Simulate
the values of Yi = 3 + 2Xi + ei
, where ei
is drawn from N (0, 4). Modify the
simulated values of Y to introduce outliers, as follows. With probability 0.1, select
an observation for modification. If it is selected, increase its value by 200% with
probability 0.5, and decrease its value by 200% with probability 0.5. Fit linear
regression to the modified data, with (i) squared loss with the analytical solution,
(ii) mean absolute error with batch gradient descent, and (iii) Huber loss with batch
gradient descent implemented in Question 4. Set learning rate α to a small value
(say, α = 0.01).
(f) Repeat (c) 1,000 times, overlay the histograms of the estimates of the slopes, and
overlay the true value. Comment on how the choice of the loss function in presence
of outliers affects the estimates of the slope parameter.
2

辅导DS 5220、讲解Java编程设计、辅导Python，c/c++语言 解析Java程序|讲解留学生Processing

辅导DS 5220、讲解Java编程设计、辅导Python，c/c++语言解析Java程序|讲解留学生Processing