首页 >
> 详细

Math 185 Final Project (Due December 8)

Problem 1

The baseball dataset consists of the statistics of 263 players in Major League

Baseball in the season 1986. The dataset (hitters.csv) consist of 20 variables:

Variable Description

AtBat Number of times at bat in 1986

Hits Number of hits in 1986

HmRun Number of home runs in 1986

Runs Number of runs in 1986

RBI Number of runs batted in in 1986

Walks Number of walks in 1986

Years Number of years in major leagues

CAtBat Number of times at bat during his career

CHits Number of hits during his career

CHmRun Number of home runs during his career

CRuns Number of runs during his career

CRBI Number of runs batted in during his career

CWalks Number of walks during his career

League A factor with levels A (coded as 1) and N (coded as 2) indicating

player’s league at the end of 1986

Division A factor with levels E (coded as 1) and W (coded as 2) indicating

player’s division at the end of 1986

PutOuts Number of put outs in 1986

Assists Number of assists in 1986

Errors Number of errors in 1986

Salary 1987 annual salary on opening day in thousands of dollars

NewLeague A factor with levels A (coded as 1) and N (coded as 2) indicating

player’s league at the beginning of 1987

In this problem, we use Salary as the response variable, and the rest 19 variables

as predictors/covariates, which measure the performance of each player in season

1986 and his whole career. Write R functions to perform variable selection using

best subset selection partnered with BIC (Bayesian Information Criterion):

1) Starting from the null model, apply the forward stepwise selection algorithm to

produce a sequence of sub-models iteratively, and select a single best model

using the BIC. Plot the “BIC vs Number of Variables” curve. Present the selected

model with the corresponding BIC.

2) Starting from the full model (that is, the one obtained from minimizing the

MSE/RSS using all the predictors), apply the backward stepwise selection

algorithm to produce a sequence of sub-models iteratively, and select a single

best model using the BIC. Plot the “BIC vs Number of Variables” curve. Present

the selected model with the corresponding BIC.

3) Are the selected models from 1) and 2) the same?

Problem 2

In this problem, we fit ridge regression on the same dataset as in Problem 1. First,

standardize the variables so that they are on the same scale. Next, choose a grid of

𝜆 values ranging from 𝜆 = 1010 to 𝜆 = 10−2

, essentially covering the full range of

scenarios from the null model containing only the intercept, to the least squares fit.

For example:

> grid = 10^seq(10, -2, length=100)

1) Write an R function to do the following: associated with each value of 𝜆 ,

compute a vector of ridge regression coefficients (including the intercept),

stored in a 20 × 100 matrix, with 20 rows (one for each predictor, plus an

intercept) and 100 columns (one for each value of 𝜆).

2) To find the “best” 𝜆 , use ten-fold cross-validation to choose the tuning

parameter from the previous grid of values. Set a random seed – set.seed(1),

first so your results will be reproducible, since the choice of the cross-validation

folds is random. Plot the “Cross-Validation Error versus 𝜆” curve, and report the

selected 𝜆.

3) Finally, refit the ridge regression model on the full dataset, using the value of 𝜆

chosen by cross-validation, and report the coefficient estimates.

Remark: You should expect that none of the coefficients are zero – ridge regression

does not perform variable selection.

Problem 3

In this problem, we revisit the best subset selection problem. Given a response

vector 𝑌 = (𝑦1, … , 𝑦𝑛)𝑇

and an 𝑛 × 𝑝 design matrix 𝑋 = (𝑥1, … , 𝑥𝑛)𝑇 with 𝑥𝑖 =(𝑥𝑖1, … , 𝑥𝑖𝑝)𝑇. For 1 ≤ 𝑘 ≤ 𝑝, let 𝛽̂0, 𝛽̂ be the solution to the following sparsityconstrained

least squares problem:

Based on the property 𝛽̂0 = 𝑦̅ − 𝑥̅𝑇𝛽̂, we can center 𝑌 and 𝑋 first to get rid of the

intercept,

where 𝑌̃ and 𝑋̃ represent the centered 𝑌 and 𝑋, respectively. To solve this, we

introduce the Gradient Hard Thresholding Pursuit (GraHTP) algorithm. Let 𝑓(𝛽) =

∥ 𝑌̃ − 𝑋̃𝛽 ∥2

2⁄(2𝑛) be the objective function.

GraHTP Algorithm.

Input: 𝑌̃, 𝑋̃, sparsity 𝑘, stepsize 𝜂 > 0

(Hint: normalize the columns of 𝑋̃ to have variance 1).

Initialization: 𝛽

0 = 0, 𝑡 = 1.

repeat

1) Compute 𝛽̃𝑡 = 𝛽𝑡−1 − 𝜂∇𝑓(𝛽𝑡−1);

2) Let 𝒮𝑡 = supp(𝛽̃𝑡, 𝑘) be the indices of 𝛽̃𝑡 with the largest 𝑘 absolute values;

3) Compute 𝛽𝑡 = argmin{𝑓(𝛽); supp(𝛽) ⊆ 𝒮𝑡};𝑡 = 𝑡 + 1;

until convergence, i.e. ∥ 𝛽𝑡 − 𝛽𝑡−1∥2< 10−4.

Output: 𝛽𝑡.

1) Write an R function to implement the above GraHTP algorithm.

2) Consider again the baseball dataset in Problem 1 with 𝑛 = 263, 𝑝 = 19. For 𝑘 =

1, … , 𝑝, use the above function to find the best 𝑘-sparse model, denoted by ℳ𝑘

.Then use BIC to select a single best model among ℳ1 … ,ℳ𝑝.

3) Compare your result with those obtained in Problem 1.

联系我们

- QQ：99515681
- 邮箱：99515681@qq.com
- 工作时间：8:00-23:00
- 微信：codinghelp

- Data留学生作业代做、代写r程序语言作业、代做r课程设计作业、代写boos 2019-12-13
- 代写cmt307留学生作业、代做machine Learning作业、代写p 2019-12-13
- Comp2050作业代做、Computer Science作业代写、代做ja 2019-12-13
- 代写compsci 671D作业、代做modeling留学生作业、代写r程序 2019-12-13
- 代做gui留学生作业、代做programming课程作业、Java程序语言作 2019-12-13
- 代做framework课程作业、Java程序设计作业代写、Java实验作业代 2019-12-13
- Ae2ace留学生作业代写、代做stock Trading作业、Java编程 2019-12-13
- Eee101留学生作业代做、Programming作业代写、C++编程语言作 2019-12-13
- Csci 340作业代做、代写java程序语言作业、代做java实验作业代写 2019-12-12
- Data课程作业代做、代写nbershade作业、代做r课程设计作业、代写r 2019-12-12
- 代写csci 1100作业、Program课程作业代做、Python语言作业 2019-12-12
- Data留学生作业代做、代写sql实验作业、Sql编程有作业调试、Pseud 2019-12-12
- 代做g6077留学生作业、System课程作业代写、代做web编程语言作业、 2019-12-12
- 代写comp529作业、代做analysis留学生作业、代写java语言作业 2019-12-12
- Ce235留学生作业代写、Program课程作业代写、C/C++程序语言作业 2019-12-12
- 代写system留学生作业、代做python语言作业、代写java，C/C+ 2019-12-12
- 代写ma705留学生作业、代写python程序语言作业、代写python实验 2019-12-11
- Stat 3312作业代做、R语言作业代写、代做r编程设计作业、代写sas 2019-12-11
- Comp201作业代做、代写software Engineering作业、J 2019-12-11
- Statistics 3022作业代做、代写data留学生作业、R编程设计作 2019-12-11