首页 >
> 详细

Homework Set 2

March 2, 2020

NOTICE: The homework is due on Mar. 13 (Friday) 11:59pm. Please provide

the R codes (Rmarkdown is highly recommended) and steps

that you use to get your solutions. You are allowed, and even encouraged,

to discuss the homeworks with your classmates. However, you must

write up the solutions on your own. Plagiarism and other anti-scholarly

behavior will be dealt with severely.

Problem 1. Usually when fitting regression models for explanation, dealing

with outliers is a complicated issue. When considering prediction, we can

empirically determine what to do. Let us use the Boston data from the

MASS package to see how outliers affect prediction.

Use the following code to test-train split the data.

Fit the following linear model that uses medv as the response.

Obtain the studentized residuals (computed by dividing each residual by its

estimated standard error, in R, use rstudent(fit)) from this fitted model.

Refit this model with each of the following modifications:

1

❼ Removing observations from the training data with absolute studentized

residuals greater than 2.

❼ Removing observations from the training data with absolute studentized

residuals greater than 3.

Use these three fitted models, including the original model fit to unmodified

data, to obtain train RMSE and test RMSE. Summarize these results in a

table. Include the number of observations removed for each. Which performs

the best?

Problem 2. An experiment is conducted among students to see the relationship

between X = study hours per day and Y = receive a grade A. In

the experiment, 120 students are divided into 6 groups and each group has 20

students. We name the groups by Group A, Group B, Group C, Group D,

Group E and Group F. The study hours for Group A to Group F are 0,

1, 2, 3, 4 and 5, respectively. The number of students in each group that

receive an A are recoded and given below

Group A B C D E F

study hours per day 0 1 2 3 4 5

receive a grade A 1 4 9 13 18 20

We use “1” to label the class: Y = receive a grade A and “2” to label the

class: Y = not receive a grade A.

(a) What is the observed proportion of students who receive a grade A, given

that the study hours per day is 3? What is the observed odds of Y = 1

at X = 3?

(b) We fit a logistic regression and produce estimated coefficient βˆ

0 =

−2.818555 and βˆ

1 = 1.258949. Give the logistic regression expression for

P rˆ (Y = 1|X = x). Predict the posterior probability and log-odds for

X = 5.

(c) Will P rˆ (Y = 1|X = x) increase or decrease if X increases? Why?

(d) If we increase the X from 4 to 5, by what multiple will the predicted

log-odds change? How about the change in the predicted posterior probability?

What would be the case if we increase X from 3 to 4?

2

(e) Now we apply the linear discriminant analysis to fit the data. What

are the discriminant functions ˆδ1(x) and ˆδ2(x)? (Give steps how you

derive them).

(f) What is the decision boundary? Predict the class when X = 0, 1, . . . , 5.

What is the average training error rate?

Problem 3. Run a simulation study to estimate the bias, variance, and

mean squared error of estimating p(x) using logistic regression. Recall that

p(x) = P r(Y = 1|X = x).

Use the following code to generate the data.

Evaluate estimates of p(x1 = 1, x2 = 1) from fitting three models:

Note that, internally in glm(), R considers a binary factor variable as 0 and

1 since logistic regression seeks to model p(x) = P r(Y = 1|X = x). But here

we have “Blue” and “Orange”. Which is 0 and which is 1?

3

Use 1000 simulations of datasets with a sample size of 30 to estimate squared

bias, variance, and root mean squared error of estimating p(x1 = 1, x2 =

1) using ˆp(x1 = 1, x2 = 1) for each model. Report your results using a

well formatted table and give some comments. At the beginning of your

simulation study, set.seed(42).

Problem 4. This question should be answered using the Weekly data set,

which is part of the ISLR package. It contains 1,089 weekly returns for 21

years, from the beginning of 1990 to the end of 2010.

(a) Produce some numerical and graphical summaries of the Weekly data.

Do there appear to be any patterns?

(b) Use the full data set to perform a logistic regression with Direction as the

response and the five lag variables plus Volume as predictors. Use the

summary function to print the results. Do any of the predictors appear

to be statistically significant? If so, which ones?

(c) Compute the confusion matrix and overall fraction of correct predictions.

Explain what the confusion matrix is telling you about the types of

mistakes made by logistic regression.

(d) Now fit the logistic regression model using a training data period from

1990 to 2008, with Lag2 as the only predictor. Compute the confusion

matrix and the overall fraction of correct predictions for the held out

data (that is, the data from 2009 and 2010).

(e) Repeat (d) using LDA.

(f) Repeat (d) using QDA.

(g) Repeat (d) using KNN with K = 3.

(h) Which of these methods appears to provide the best results on this data?

(i) (Optional) Experiment with different combinations of predictors, including

possible transformations and interactions, for each of the methods.

Report the variables, method, and error rate that appears to provide the

best results on the held out data.

4

Problem 5. This question should be answered using the Default data set.

We fit a logistic regression model that uses income and balance to predict

Default. Do not forget to set a random seed before beginning your analysis.

(a) Using the validation set approach, estimate the test error of this

model. In order to do this, you must perform the following steps:

i. Split the sample set into a training set (7000 observations) and a

validation set (3000 obervations).

ii. Fit a multiple logistic regression model using only the training observations.

iii. Obtain a prediction of default status for each individual in the validation

set by computing the posterior probability of default for that

individual, and classifying the individual to the default category if

the posterior probability is greater than 0.5.

iv. Compute the validation set error, which is the fraction of the observations

in the validation set that are misclassified.

(b) Repeat the process in (a) two times, using two different splits of the

observations into a training set and a validation set. Comment on the

results obtained.

(c) Now using the k-fold cross-validation approach with k = 10, estimate

the test error of this model.

(d) Now we will compute estimates for the standard errors of the income and

balance logistic regression coefficients in two different ways: (1) using

the bootstrap, and (2) using the standard formula for computing the

standard errors in the glm() function.

i. Using the summary() and glm() functions, determine the estimated

standard errors for the coefficients associated with income and balance

in a multiple logistic regression model that uses both predictors.

ii. Write a function, boot.fn(), that takes as input the Default data

set as well as an index of the observations, and that outputs the

coefficient estimates for income and balance in the multiple logistic

regression model.

iii. Use the boot() function together with your boot.fn() function to

estimate the standard errors of the logistic regression coefficients for

income and balance.

iv. Comment on the estimated standard errors obtained using the glm()

function and using your bootstrap function.

联系我们

- QQ：99515681
- 邮箱：99515681@qq.com
- 工作时间：8:00-23:00
- 微信：codinghelp2

- Tsp课程作业代写、代做algorithms留学生作业、代做java，C/C 2020-06-23
- Kit107留学生作业代做、C++编程语言作业调试、Data课程作业代写、代 2020-06-23
- Sta302h1f作业代做、代写r课程设计作业、代写r编程语言作业、代做da 2020-06-22
- 代写seng 474作业、代做data Mining作业、Python，Ja 2020-06-22
- Cmpsci 187 Binary Search Trees 2020-06-21
- Comp226 Assignment 2: Strategy 2020-06-21
- Math 504 Homework 12 2020-06-21
- Math4007 Assessed Coursework 2 2020-06-21
- Optimization In Machine Learning Assig... 2020-06-21
- Homework 1 – Math 104B 2020-06-20
- Comp1000 Unix And C Programming 2020-06-20
- General Specifications Use Python In T... 2020-06-20
- Comp-206 Mini Assignment 6 2020-06-20
- Aps 105 Lab 9: Search And Link 2020-06-20
- Aps 105 Lab 9: Search And Link 2020-06-20
- Mech 203 – End-Of-Semester Project 2020-06-20
- Ms980 Business Analytics 2020-06-20
- Cs952 Database And Web Systems Develop... 2020-06-20
- Homework 4 Using Data From The China H... 2020-06-20
- Assignment 1 Build A Shopping Cart 2020-06-20