首页 >
> 详细

Homework Set 2

March 2, 2020

NOTICE: The homework is due on Mar. 13 (Friday) 11:59pm. Please provide

the R codes (Rmarkdown is highly recommended) and steps

that you use to get your solutions. You are allowed, and even encouraged,

to discuss the homeworks with your classmates. However, you must

write up the solutions on your own. Plagiarism and other anti-scholarly

behavior will be dealt with severely.

Problem 1. Usually when fitting regression models for explanation, dealing

with outliers is a complicated issue. When considering prediction, we can

empirically determine what to do. Let us use the Boston data from the

MASS package to see how outliers affect prediction.

Use the following code to test-train split the data.

Fit the following linear model that uses medv as the response.

Obtain the studentized residuals (computed by dividing each residual by its

estimated standard error, in R, use rstudent(fit)) from this fitted model.

Refit this model with each of the following modifications:

1

❼ Removing observations from the training data with absolute studentized

residuals greater than 2.

❼ Removing observations from the training data with absolute studentized

residuals greater than 3.

Use these three fitted models, including the original model fit to unmodified

data, to obtain train RMSE and test RMSE. Summarize these results in a

table. Include the number of observations removed for each. Which performs

the best?

Problem 2. An experiment is conducted among students to see the relationship

between X = study hours per day and Y = receive a grade A. In

the experiment, 120 students are divided into 6 groups and each group has 20

students. We name the groups by Group A, Group B, Group C, Group D,

Group E and Group F. The study hours for Group A to Group F are 0,

1, 2, 3, 4 and 5, respectively. The number of students in each group that

receive an A are recoded and given below

Group A B C D E F

study hours per day 0 1 2 3 4 5

receive a grade A 1 4 9 13 18 20

We use “1” to label the class: Y = receive a grade A and “2” to label the

class: Y = not receive a grade A.

(a) What is the observed proportion of students who receive a grade A, given

that the study hours per day is 3? What is the observed odds of Y = 1

at X = 3?

(b) We fit a logistic regression and produce estimated coefficient βˆ

0 =

−2.818555 and βˆ

1 = 1.258949. Give the logistic regression expression for

P rˆ (Y = 1|X = x). Predict the posterior probability and log-odds for

X = 5.

(c) Will P rˆ (Y = 1|X = x) increase or decrease if X increases? Why?

(d) If we increase the X from 4 to 5, by what multiple will the predicted

log-odds change? How about the change in the predicted posterior probability?

What would be the case if we increase X from 3 to 4?

2

(e) Now we apply the linear discriminant analysis to fit the data. What

are the discriminant functions ˆδ1(x) and ˆδ2(x)? (Give steps how you

derive them).

(f) What is the decision boundary? Predict the class when X = 0, 1, . . . , 5.

What is the average training error rate?

Problem 3. Run a simulation study to estimate the bias, variance, and

mean squared error of estimating p(x) using logistic regression. Recall that

p(x) = P r(Y = 1|X = x).

Use the following code to generate the data.

Evaluate estimates of p(x1 = 1, x2 = 1) from fitting three models:

Note that, internally in glm(), R considers a binary factor variable as 0 and

1 since logistic regression seeks to model p(x) = P r(Y = 1|X = x). But here

we have “Blue” and “Orange”. Which is 0 and which is 1?

3

Use 1000 simulations of datasets with a sample size of 30 to estimate squared

bias, variance, and root mean squared error of estimating p(x1 = 1, x2 =

1) using ˆp(x1 = 1, x2 = 1) for each model. Report your results using a

well formatted table and give some comments. At the beginning of your

simulation study, set.seed(42).

Problem 4. This question should be answered using the Weekly data set,

which is part of the ISLR package. It contains 1,089 weekly returns for 21

years, from the beginning of 1990 to the end of 2010.

(a) Produce some numerical and graphical summaries of the Weekly data.

Do there appear to be any patterns?

(b) Use the full data set to perform a logistic regression with Direction as the

response and the five lag variables plus Volume as predictors. Use the

summary function to print the results. Do any of the predictors appear

to be statistically significant? If so, which ones?

(c) Compute the confusion matrix and overall fraction of correct predictions.

Explain what the confusion matrix is telling you about the types of

mistakes made by logistic regression.

(d) Now fit the logistic regression model using a training data period from

1990 to 2008, with Lag2 as the only predictor. Compute the confusion

matrix and the overall fraction of correct predictions for the held out

data (that is, the data from 2009 and 2010).

(e) Repeat (d) using LDA.

(f) Repeat (d) using QDA.

(g) Repeat (d) using KNN with K = 3.

(h) Which of these methods appears to provide the best results on this data?

(i) (Optional) Experiment with different combinations of predictors, including

possible transformations and interactions, for each of the methods.

Report the variables, method, and error rate that appears to provide the

best results on the held out data.

4

Problem 5. This question should be answered using the Default data set.

We fit a logistic regression model that uses income and balance to predict

Default. Do not forget to set a random seed before beginning your analysis.

(a) Using the validation set approach, estimate the test error of this

model. In order to do this, you must perform the following steps:

i. Split the sample set into a training set (7000 observations) and a

validation set (3000 obervations).

ii. Fit a multiple logistic regression model using only the training observations.

iii. Obtain a prediction of default status for each individual in the validation

set by computing the posterior probability of default for that

individual, and classifying the individual to the default category if

the posterior probability is greater than 0.5.

iv. Compute the validation set error, which is the fraction of the observations

in the validation set that are misclassified.

(b) Repeat the process in (a) two times, using two different splits of the

observations into a training set and a validation set. Comment on the

results obtained.

(c) Now using the k-fold cross-validation approach with k = 10, estimate

the test error of this model.

(d) Now we will compute estimates for the standard errors of the income and

balance logistic regression coefficients in two different ways: (1) using

the bootstrap, and (2) using the standard formula for computing the

standard errors in the glm() function.

i. Using the summary() and glm() functions, determine the estimated

standard errors for the coefficients associated with income and balance

in a multiple logistic regression model that uses both predictors.

ii. Write a function, boot.fn(), that takes as input the Default data

set as well as an index of the observations, and that outputs the

coefficient estimates for income and balance in the multiple logistic

regression model.

iii. Use the boot() function together with your boot.fn() function to

estimate the standard errors of the logistic regression coefficients for

income and balance.

iv. Comment on the estimated standard errors obtained using the glm()

function and using your bootstrap function.

联系我们

- QQ：99515681
- 邮箱：99515681@qq.com
- 工作时间：8:00-23:00
- 微信：codehelp

- Stat7017 Final Project 2020-03-29
- Cs3214 Spring 2020 Project 1 - “Extens 2020-03-29
- Co3090/Co7090 Distributed Systems And ... 2020-03-29
- Hw2: Sql 2020-03-29
- Hw1: 5 Points Entity-Relational (Er) 2020-03-29
- Math 104A Homework #3 2020-03-29
- Comp 250 Assignment 2 2020-03-29
- Cs 570课程作业代写、Program作业代做、C++语言作业代写、代做j 2020-03-29
- Comp-424作业代做、代写intelligence作业、Python，C 2020-03-29
- Database作业代做、代写cap Theorem作业、代写java程序语 2020-03-29
- 代做structure作业、代写python，Java,C++编程语言作业、 2020-03-29
- 代写sta238留学生作业、代做python，C++程序语言作业、Java编 2020-03-29
- Csc148留学生作业代做、代写computer Science作业、Pyt 2020-03-29
- Cmpt 365作业代做、代写programming作业、代做java，C+ 2020-03-29
- Fc712留学生作业代做、代写programming课程作业、代写pytho 2020-03-28
- Algorithms作业代写、代做dataset课程作业、C++，Pytho 2020-03-28
- 代做data留学生作业、代写r编程设计作业、代做r语言作业、代写progra 2020-03-28
- Csci3130作业代写、代做uml留学生作业、Python，C++，Jav 2020-03-28
- Eece5644作业代做、Matlab语言作业代做、代写matlab程序设计 2020-03-28
- 代写comp9321作业、代做python编程设计作业、代写python语言 2020-03-28