MA308编程辅导、辅导Software程序、R编程语言调试辅导留学生 Statistics统计、回归、迭代|辅导留学生Prolog

MA308: Statistical Calculation and Software
Assignment 2 (Oct 9– Nov 11, 2020)
2.1 For the “PlantGrowth” dataset from R ,
(a) First draw three boxplots for the weights of three groups of plants, i.e. control
(ctrl) group, treatment1 (trt1) and treatment2 (trt2) group, put three boxplots
side by side in one figure. What will be the conclusion for testing the weight
of the control group at α = 0.05 level of significance,
H0 : µ = 5, v.s. H1 : µ 6= 5, (2.1)
with unknown variance? What if the variance is known to be the current
sample variance?
(b) Carry out the likelihood-ratio test in (2.1) for treatment1 group with unknown
variance and draw the conclusion at α = 0.05 level of significance. Compare
the result with that of using t-test.
(c) Test whether the weight of the control group and treatment1 group have the
same mean value at α = 0.05 level of significance. What if there is a “pairing”
between the control and treatment1 group?
(d) Test whether the spread of weight for the treatment1 group and the treatment2
group are the same or not.
2.2 This question should be answered using the Carseats.csv data set.
(a) Test whether Sales follow normal distribution.
(b) Fit a multiple regression model to predict Sales using Price, Urban, and US.
(c) Provide an interpretation of each coefficient in the model. Be careful some of
the variables in the model are qualitative!
2
(d) Write out the model in equation form, being careful to handle the qualitative
variables properly.
(e) For which of the predictors can you reject the null hypothesis H0 : βj = 0?
(f) On the basis of your response to the previous question, fit a smaller model that
only uses the predictors for which there is evidence of association with the
outcome.
(g) How well do the models in (b) and (f) fit the data?
(h) Using the model from (f), obtain 95% confidence intervals for the coefficient(s).
(i) Is there evidence of outliers or high leverage observations in the model from (f)?
(j) There is an indicator “US” in the “Carseat” data set, compare the mean Sales
of the “US” area with that of the “Non-US” area, show the results of the
likelihood ratio test and the Mann-Whitney test for testing the equality of
these two mean values. Can we use the Wilcoxon’s Signed-Rank test? Why?
(k) Fit a multiple regression model to predict Sales using all the other variables,
implement variable selection by stepwise methods and all-subsets regression.
(l) Consider using all the other variables to predict Sales, find out the most important
variable in predicting Sales via the concept of Relative Importance,
compare with the results in (k).
2.3 This question should be answered using the weekly.csv data set.
(a) Produce some numerical and graphical summaries of the Weekly data. Do there
appear to be any patterns?
(b) Use the full data set to perform a logistic regression with Direction as the
response and the five lag variables plus Volume as predictors. Use the summary
function to print the results. Do any of the predictors appear to be statistically
significant? If so, which ones?
(c) Compute the confusion matrix and overall fraction of correct predictions. Explain
what the confusion matrix is telling you about the types of mistakes made
by logistic regression.
3
(d) Now fit the logistic regression model using a training data period from 1990 to
2008, with Lag2 as the only predictor. Compute the confusion matrix and the
overall fraction of correct predictions for the held out data (that is, the data
from 2009 and 2010).