首页 >
> 详细

Statistical Machine Learning GR5241

Spring 2021

Homework 2

Due: Monday, February 22nd by 11:59pm

Homework submission: Please submit your homework as a pdf on Canvas.

Problem 1 (Training Error vs. Test Error, ESL 2.9)

In this problem, we want to use the least squares estimator to illustrate the point that the trainning error is

generally an underestimate of the prediction error (or test error).

Consider a linear regression model with p parameters,

We fit the model by least squares to a set of trainning data (x1, y1), . . . ,(xN , yN ) drawn independently from a

population. Let βˆ be the least squares estimate obtained from the training data. Suppose we have some test

data (˜x1, y˜1), · · · ,(˜xM, y˜M) (N ≥ M > p) drawn at random from the same population as the training data.

where Nk(x) is the neighborhood of x defined by the k closest points x, in the training sample. Assuming x is

not random, derive an expression of prediction error using squared error loss, i.e., compute and simplify

E[(Y − ˆfk(x0))2

|X = x0].

Note that x0 is a single query point (or test point).

Problem 3 (K-Means Clustering Proof)

Consider the traditional k-means clustering algorithm and let d(xi

, xj ) be squared Euclidean distance. Prove the

following identity:

• The syntax C(i) = k, or equivalently {i : C(i) = k}, represents the set of all indices (or observations)

having cluster assignment k. The symbol |Nk| represents the number of elements in set {i : C(i) = k}.

For example, suppose our data matrix consists of n = 10 observations and each observation (or row) is

assigned to K = 3 clusters

Case: {1, 2, 3, 4, 5, 6, 7, 8, 9, 10}

Cluster Assignment: {3, 1, 3, 1, 2, 2, 2, 1, 3, 2}

Then

{i : C(i) = 1} = {2, 4, 8} |N1| = 3

{i : C(i) = 2} = {5, 6, 7, 10} |N2| = 4

{i : C(i) = 3} = {1, 3, 9} |N3| = 3

Problem 4 (PCA, LDA and Logistic Regression)

The zipcode data are high dimensional, and hence linear discriminant analysis suffers from high variance. Using

the training and test data for the 3s, 5s, and 8s, compare the following procedures:

1. LDA on the original 256 dimensional space.

2. LDA on the leading 49 principle components of the features.

3. Multiple linear logistic regression (multinomial regression) using the same filtered data as in the previous

question.

Note:

• For all the above exercises, use R or Python functions to perform the PCA, LDA and multinomial regression,

i.e., there is no need to manually code these procedures.

• For all the above exercises, compare the procedures with respect to training and test misclassification

error. You need to report both training and test misclassification error in your submission.

• When evaluating the test error based on the filtered trained model, don’t forget to first project the test

features onto space generated by the leading 49 principle components. In R use the predict() function.

• The data of interest is already split into a training and testing set.

2

Problem 5 (PCA: Finance)

1. For each of the 30 stocks in the Dow Jones Industrial Average, download the closing prices for every trading

day from January 1, 2020 to January 1, 2021. You can use http://finance.yahoo.com to find the data. To

download the prices, for example for symbol AAPL, we use the R package quantmod. The R code is as the

following:

library(quantmod)

data <- getSymbols("AAPL", auto.assign = F, from ="2020-01-01", to = "2021-01-01")

Please find a way to download data for the 30 stocks efficiently.

2. Perform a PCA on the un-scalled closing prices and create the biplot. Do you see any structure in the biplot,

perhaps in terms of the types of stocks? How about the screeplot – how many important components seem

to be in the data?

3. Repeat part 2 using the scaled variables.

4. Use the closing prices to calculate the return for each stock, and repeat Part 3 on the return data. In looking

at the screeplot, what does this tell you about the 30 stocks in the DJIA? If each stock were fluctuating up

and down randomly and independent of all the other stocks, what would you expect the screeplot to look

like?

Problem 6 (PCA on Digits)

In this problem students will run PCA on the zipcode training data from Problem 4. This problem will help solidify

how to interpret PCA in a high dimensional setting.

1. Open and run the file “Problem 6 Setup.Rmd” to establish your data matrix X and unobserved test cases.

Note that the data matrix X has dimension 1753 × 256 after removing three cases.

2. Run PCA on the data matrix X. Produce a graphic showing the cumulative explained variance as a function

of the number of PC’s (or features). Identify how many PC’s yield 90% explained variance.

3. Display the first 16 principle components as images. Try to combine all 16 images in a single plot.

4. Approximate the test cases ConstructCase 1, ConstructCase 2 and ConstructCase 3 by projecting

these cases into the subspace (eigenspace) generated by d = 3, 58 and 256 principal components. Your

final result should be represented as 9 images, 3 per test case.

联系我们

- QQ：99515681
- 邮箱：99515681@qq.com
- 工作时间：8:00-23:00
- 微信：codinghelp2

- Cs2461-10实验程序代做、代写java，C/C++，Python编程设 2021-03-02
- 代写program程序语言、代做python，C++课程程序、代写java编 2021-03-02
- Programming课程代做、代写c++程序语言、Algorithms编程 2021-03-02
- 代写csc1-Ua程序、代做java编程设计、Java实验编程代做 代做留学 2021-03-02
- 代做program编程语言、代写python程序、代做python设计编程 2021-03-02
- 代写data编程设计、代做python语言程序、Python课程编程代写 代 2021-03-02
- Cse 13S程序实验代做、代写c++编程、C/C++程序语言调试 代写留学 2021-03-02
- Mat136h5编程代做、C/C++程序调试、Python，Java编程设计 2021-03-01
- 代写ee425x实验编程、代做python，C++，Java程序设计 帮做c 2021-03-01
- Cscc11程序课程代做、代写python程序设计、Python编程调试 代 2021-03-01
- 代写program编程、Python语言程序调试、Python编程设计代写 2021-03-01
- 代做r语言编程|代做database|代做留学生p... 2021-03-01
- Data Structures代写、代做r编程课程、代做r程序实验 帮做ha 2021-03-01
- 代做data留学生编程、C++，Python语言代写、Java程序代做 代写 2021-03-01
- 代写aps 105编程实验、C/C++程序语言代做 代写r语言程序|代写py 2021-03-01
- Fre6831 Computational Finance 2021-02-28
- Sta141b Assignment 5 Interactive Visu... 2021-02-28
- Eecs2011a-F20 2021-02-28
- Comp-251 Final Asssessment 2021-02-28
- 代写cs1027课程程序、代做java编程语言、代写java留学生编程帮做h 2021-02-28