首页 >
> 详细

School of Mathematics and Statistics

MAST90083: Computational Statistics and Data Science

Assignment 1

Due date: No later than 11:59pm on Monday 5th September 2022

Weight: 15%

Question 1: Linear Regression

This question relates to the methods used to resolve the issues present in linear regression by

doing variable selection so that predictors that fail to significantly explain the response can

be dropped. However you will find that ridge regression although penalizes the coefficients,

still fails to do variable selection. Lasso on other hand resolves this issue by only shrinking

the insignificant coefficients towards zero. This question makes use of the Hitters dataset.

1. Load the Hitters dataset. Remove all those rows from Hitters dataset that have entry

NA in the §salary§ column

2. For a design matrix construction, use function §model.matrix§ to read all variables in

Hitters dataset excluding the salary and store them in variable x. Also, read the salary

variable and store it in variable y. Generate a sequence of 100 values of 竹 between 1010

and 10?2 and call the function §glmnet§ from glmnet library. You can generate the

sequence as 10托seq(10,?2, length = 100), where 托 is a §raised to§ sign. For glmnet,

set 汐 = 0, and estimate ridge coefficients for 100 竹 values. Then, observe the set of

coefficients for two extreme values of 竹 i.e. 1010 and also for 10?2. For which value of

竹 among these two, the coefficient values are more close to zero?

3. Now, draw a plot of l2-norm of coefficient values (excluding the intercept＊s coefficient

value) against the logarithm of the 竹 values. Can you say from this plot that you

cannot really decide the optimal 竹 value between 1010 and 10?2, better is to use the

mean square error (MSE) plot against the 竹 values? Explain how can you say that?

4. The glment library already has a function §cv.glmnet§ that performs ten fold cross

validation (CV). You are going to use this function to select an optimal 竹. Now, first

you need to set the seed equal to 10 for random number generator. Then randomly

pick 131 samples from x for all variables and also the corresponding samples from

y to construct a training dataset. The rest of the samples can be saved for testing

dataset. Using this training dataset, plot the cross validation results, and find the best

竹 (the one that results in smallest CV error) value and its corresponding test MSE

value (MSE value obtained using testing dataset and best 竹), you may want to use

§predict§ function here. Now refit the ridge regression model on the full data set using

the 竹 chosen by CV. Examine the coefficients are they all present, similar to the linear

regression case?

1

5. This time we set 汐 = 1 (Lasso case) and again plot the cross validation results, and find

the best 竹 value (using training set) and its corresponding MSE value (using testing

set). Now predict the coefficients again using the best 竹 that we just selected. Were

all coefficients selected again? Well most of them are zero, are they not?

Question 2: Model Selection

In this question we consider the analysis of three model selection criteria for selecting the

order p of the following model

yt = 耳1yt?1 + ....+ 耳pyt?p + 灰t t = p+ 1, ..., n yt ﹋ R

where 灰t are independent identically distributed (i.i.d.) from N (0, 考

2). The criteria we

consider are

IC1 = log

(

考?2p

)

+

2 (p+ 1)

T

IC2 = log

(

考?2p

)

+

T + p

T ? p? 2

IC3 = log

(

考?2p

)

+

p log (T )

T

where 考?2p =

RSSp

T

= ′y?y?′

2

T

.

1. In the IC＊s given above, T represents the number of effective samples. In the case of

the model of order p above what is T?

2. Find the least square estimator of 耳 = (耳1, ..., 耳p)

>

3. Provide the expression of 考?2p

4. Generate two sets of 100 samples using the models

M1 : yt = 0.434yt?1+0.217yt?2+0.145yt?3+0.108yt?4+0.087yt?5+灰t 灰t ‵ N(0, 1)

M2 : yt = 0.682yt?1 + 0.346yt?2 + 灰t 灰t ‵ N(0, 1)

5. Using these two sets, compute the values of IC1, IC2 and IC3 for p = 1, ..., 10 for

models M1 and M2. For each model provide a figure illustrating the variations of IC1,

IC2 and IC3 (plot the three criteria in a single figure for each model).

6. Using model M1 generate 1000 sets (vectors) of size 100 and provide a table of counts

of the selected model by IC1, IC2 and IC3

7. Using model M1 generate 1000 sets of size 15 and provide a table of counts of the

selected model by IC1, IC2 and IC3

2

8. Repeat questions 6 and 7 using model M2.

9. What do you observe from these tables?

10. Derive expressions for the probabilities of overfitting for the model selection criteria

IC1, IC2 and IC3. For the derivation you will assume the true model to be p0 and

consider overfitting by L extra parameters.

11. Provide tables of the calculated probabilities for M1 in the cases n = 25 and n = 100

with L = 1, ..., 8.

12. What are the important remarks that can be made from these probability tables?

13. The tables obtained from question 11 provide overfitting information as a function of

the sample size. We are now interested in the case of large sample size or when n↙﹢

(p0&L fixed). Derive the expressions of the probabilities of overfitting in this case.

14. What is the important observation that you can make?

Grading:

Total: 15 points

Question 1: 5 points

Question 2: 10 points

The assignment is to be submitted via LMS

联系我们

- QQ：99515681
- 邮箱：99515681@qq.com
- 工作时间：8:00-21:00
- 微信：codinghelp

- Comp 3711代写、辅导c/C++，Python编程 2022-12-01
- Ecn21004代写、辅导c/C++，Java编程 2022-12-01
- 代写mthm502、辅导r编程设计 2022-12-01
- 代做mthm003、辅导matlab程序设计 2022-12-01
- 1Econ7310辅导、辅导r编程语言 2022-11-14
- Cs 33400讲解、讲解java，Python编程 2022-11-14
- 辅导com6503、辅导java/Python程序 2022-11-14
- 辅导math6166、辅导r编程设计 2022-11-14
- 159.251编程辅导、Python/Java编程 2022-11-13
- 辅导dsc 40A、辅导python/C++程序 2022-11-13
- 讲解comp2113、C++语言程序辅导 2022-11-13
- Comp4161辅导、辅导c/C++设计编程 2022-11-13
- 辅导math 1021、辅导java/C++编程设计 2022-11-13
- 辅导program、辅导c++，Java编程 2022-11-12
- 辅导comp4161、辅导java，C++程序 2022-11-12
- Cosc 2637辅导、辅导c++/Java编程 2022-10-31
- Math5945讲解、C/C++,Java程序辅导 2022-10-31
- Program辅导、辅导python设计编程 2022-10-31
- 辅导data编程、辅导java/Python程序 2022-10-31
- Fit9136辅导、Python编程设计辅导 2022-10-30