辅导Plagiarism、讲解R设计、辅导R编程语言、data留学生辅导讲解SPSS|辅导Python程序

Homework Set 2
March 2, 2020
NOTICE: The homework is due on Mar. 13 (Friday) 11:59pm. Please provide
the R codes (Rmarkdown is highly recommended) and steps
that you use to get your solutions. You are allowed, and even encouraged,
to discuss the homeworks with your classmates. However, you must
write up the solutions on your own. Plagiarism and other anti-scholarly
behavior will be dealt with severely.
Problem 1. Usually when fitting regression models for explanation, dealing
with outliers is a complicated issue. When considering prediction, we can
empirically determine what to do. Let us use the Boston data from the
MASS package to see how outliers affect prediction.
Use the following code to test-train split the data.
Fit the following linear model that uses medv as the response.
Obtain the studentized residuals (computed by dividing each residual by its
estimated standard error, in R, use rstudent(fit)) from this fitted model.
Refit this model with each of the following modifications:
1
❼ Removing observations from the training data with absolute studentized
residuals greater than 2.
❼ Removing observations from the training data with absolute studentized
residuals greater than 3.
Use these three fitted models, including the original model fit to unmodified
data, to obtain train RMSE and test RMSE. Summarize these results in a
table. Include the number of observations removed for each. Which performs
the best?
Problem 2. An experiment is conducted among students to see the relationship
between X = study hours per day and Y = receive a grade A. In
the experiment, 120 students are divided into 6 groups and each group has 20
students. We name the groups by Group A, Group B, Group C, Group D,
Group E and Group F. The study hours for Group A to Group F are 0,
1, 2, 3, 4 and 5, respectively. The number of students in each group that
receive an A are recoded and given below
Group A B C D E F
study hours per day 0 1 2 3 4 5
receive a grade A 1 4 9 13 18 20
We use “1” to label the class: Y = receive a grade A and “2” to label the
class: Y = not receive a grade A.
(a) What is the observed proportion of students who receive a grade A, given
that the study hours per day is 3? What is the observed odds of Y = 1
at X = 3?
(b) We fit a logistic regression and produce estimated coefficient βˆ
0 =
−2.818555 and βˆ
1 = 1.258949. Give the logistic regression expression for
P rˆ (Y = 1|X = x). Predict the posterior probability and log-odds for
X = 5.
(c) Will P rˆ (Y = 1|X = x) increase or decrease if X increases? Why?
(d) If we increase the X from 4 to 5, by what multiple will the predicted
log-odds change? How about the change in the predicted posterior probability?
What would be the case if we increase X from 3 to 4?
2
(e) Now we apply the linear discriminant analysis to fit the data. What
are the discriminant functions ˆδ1(x) and ˆδ2(x)? (Give steps how you
derive them).
(f) What is the decision boundary? Predict the class when X = 0, 1, . . . , 5.
What is the average training error rate?
Problem 3. Run a simulation study to estimate the bias, variance, and
mean squared error of estimating p(x) using logistic regression. Recall that
p(x) = P r(Y = 1|X = x).
Use the following code to generate the data.
Evaluate estimates of p(x1 = 1, x2 = 1) from fitting three models:
Note that, internally in glm(), R considers a binary factor variable as 0 and
1 since logistic regression seeks to model p(x) = P r(Y = 1|X = x). But here
we have “Blue” and “Orange”. Which is 0 and which is 1?
3
Use 1000 simulations of datasets with a sample size of 30 to estimate squared
bias, variance, and root mean squared error of estimating p(x1 = 1, x2 =
1) using ˆp(x1 = 1, x2 = 1) for each model. Report your results using a
well formatted table and give some comments. At the beginning of your
simulation study, set.seed(42).
Problem 4. This question should be answered using the Weekly data set,
which is part of the ISLR package. It contains 1,089 weekly returns for 21
years, from the beginning of 1990 to the end of 2010.
(a) Produce some numerical and graphical summaries of the Weekly data.
Do there appear to be any patterns?
(b) Use the full data set to perform a logistic regression with Direction as the
response and the five lag variables plus Volume as predictors. Use the
summary function to print the results. Do any of the predictors appear
to be statistically significant? If so, which ones?
(c) Compute the confusion matrix and overall fraction of correct predictions.
Explain what the confusion matrix is telling you about the types of
mistakes made by logistic regression.
(d) Now fit the logistic regression model using a training data period from
1990 to 2008, with Lag2 as the only predictor. Compute the confusion
matrix and the overall fraction of correct predictions for the held out
data (that is, the data from 2009 and 2010).
(e) Repeat (d) using LDA.
(f) Repeat (d) using QDA.
(g) Repeat (d) using KNN with K = 3.
(h) Which of these methods appears to provide the best results on this data?
(i) (Optional) Experiment with different combinations of predictors, including
possible transformations and interactions, for each of the methods.
Report the variables, method, and error rate that appears to provide the
best results on the held out data.
4
Problem 5. This question should be answered using the Default data set.
We fit a logistic regression model that uses income and balance to predict
Default. Do not forget to set a random seed before beginning your analysis.
(a) Using the validation set approach, estimate the test error of this
model. In order to do this, you must perform the following steps:
i. Split the sample set into a training set (7000 observations) and a
validation set (3000 obervations).
ii. Fit a multiple logistic regression model using only the training observations.
iii. Obtain a prediction of default status for each individual in the validation
set by computing the posterior probability of default for that
individual, and classifying the individual to the default category if
the posterior probability is greater than 0.5.
iv. Compute the validation set error, which is the fraction of the observations
in the validation set that are misclassified.
(b) Repeat the process in (a) two times, using two different splits of the
observations into a training set and a validation set. Comment on the
results obtained.
(c) Now using the k-fold cross-validation approach with k = 10, estimate
the test error of this model.
(d) Now we will compute estimates for the standard errors of the income and
balance logistic regression coefficients in two different ways: (1) using
the bootstrap, and (2) using the standard formula for computing the
standard errors in the glm() function.
i. Using the summary() and glm() functions, determine the estimated
standard errors for the coefficients associated with income and balance
in a multiple logistic regression model that uses both predictors.
ii. Write a function, boot.fn(), that takes as input the Default data
set as well as an index of the observations, and that outputs the
coefficient estimates for income and balance in the multiple logistic
regression model.
iii. Use the boot() function together with your boot.fn() function to
estimate the standard errors of the logistic regression coefficients for
income and balance.
iv. Comment on the estimated standard errors obtained using the glm()
function and using your bootstrap function.