讲解MA308、辅导Python程序语言、讲解ISLR 编程、辅导ANOVA 辅导R语言程序|辅导R语言编程

MA308: Statistical Calculation and Software
Assignment 3 (Dec 24, 2019 - Jan 02, 2020)
3.1 For the “weightgain” dataset from HSAUR3 package, the data arise from an experi?ment to study the gain in weight of rats fed on four different diets, distinguished by
amount of protein (low and high) and by source of protein (beef and cereal). Ten
rats are randomized to each of the four treatments and the weight gain in grams
recorded. The question of interest is how diet affects weight gain.
(a) Summarize the main features of the data by calculating group means and stan?dard deviations, use the plotmeans() function in the gplots package to produce
an interaction plot of group means and their confidence intervals.
(b) Use interaction2wt() function in the HH package to produce a plot of both
main effects and two-way interactions for any factorial design of any order.
Explain whether there exists interaction between source and type.
(c) Carry out two-way factorial ANOVA analysis with and without interaction
terms respectively, explain the corresponding results.
(d) What are the assumptions that our data need to satisfy when we implement
one-way ANOVA? Now if we use one-way ANOVA to examine the difference of
weightgain between different source of protein, are these assumptions satisfied?
(e) Carry out the permutation test version of the two-way factorial ANOVA analysis
of weightgain～source*type with the lmPerm package, compare the result with
that in 3.1(c).
3.2 For the “planets” dataset from HSAUR3 package,
(a) Apply complete linkage and average linkage hierarchical clustering to the planets
data. Compare the results with the K-means (K=3) clustering results in the
lecture notes.
2
(b) Construct a three-dimensional drop-line scatterplot of the planets data in which
the points are labelled with a suitable cluster label, K-means (K=3) method
can be used for clustering.
(c) Write a R function to fit a parametric model based on two-component normal
mixture model for the eccen variable in the planet data. (Hint: refer to the
“Mixture distribution estimation” section in Chapter 6)
(d) In fact, package mclust offers high-level functionality for estimating mixture
models, apply Mclust to estimate normal mixture model for the eccen variable
in the planet data. Compare the result with that in 3.2(c).
(e) Implement principal component analysis on the planet data, find out the co?efficients for the first two principal components and the principal component
scores for each planet.
(f) Apply K-means (K=3) clustering to the first two principal components of the
planet data. Compare the clustering result with that based on the original
data mentioned in 3.2(a).
3.3 For the “Default” dataset from ISLR pacakge, we consider how to predict default for
any given value of balance and income. In particular, we will now compute estimates
for the standard errors of the income and balance logistic regression coefficients in
two different ways: (1) using the bootstrap, and (2) using the standard formula for
computing the standard errors in the glm() function. Do not forget to set a random
seed before beginning your analysis.
(a) Using the summary() and glm() functions, determine the estimated standard
errors for the coefficients associated with income and balance in a multiple
logistic regression model that uses both predictors.
(b) Write a function, boot.fn() , that takes as input the Default data set as well
as an index of the observations, and that outputs the coefficient estimates for
income and balance in the multiple logistic regression model.
(c) Use the boot() function together with your boot.fn() function to estimate the
standard errors of the logistic regression coefficients for income and balance.
3
(d) Comment on the estimated standard errors obtained using the glm() function
and using your bootstrap function.
3.4 For the “Default” dataset from ISLR pacakge, we consider how to predict default for
any given value of balance and income.
(a) Split the sample set into a training set (70%) and a validation set (30%). Fit a
multiple logistic regression model (default ～ balance + income) using only the
training observations. Obtain a prediction of default status for each individual
in the validation set by computing the posterior probability of default for that
individual, and classifying the individual to the default category if the posterior
probability is greater than 0.5. Compute the validation set error, which is the
fraction of the observations in the validation set that are misclassified.
[10 points]
(b) Apply Classical Decision Tree and Conditional Inference Tree on the Default
dataset. Use the plotcp() function to plot the cross-validated error against the
complexity parameter and choose the most appropriate tree size.
(c) Write down the algorithm for a random forest involves sampling cases and
variables to create a large number of decision trees. Implement random forest
algorighm based on traditional decision trees and conditional inference trees
respectively. Use the random forest models built to classify the validation
sample and compare the predictive accuracy of the two models.
(d) Fit a support vector machine classifier to the Default dataset. Use tune.svm()
function to choose a combination of gamma and cost which may lead to a more
effective model. Compare the sensitivity, specificity, positive predictive power
and negative predictive power of the svm, random forest and logistic regression
classifiers.