STAT 4620/5620 WINTER 2023
Assignment 4: Due Thursday March 23 2023
1. The following questions explore the fundamentals of nonparametric statis-
tics:
(a) [3] Describe smoothing and give two examples of popular smoothers.
(b) [2] Consider the generalized additive model (GAM) framework. What is
the most significant departure from the GLM framework?
(c) [3] Explain how model estimation proceeds for GAMs.
(d) [4] Suppose that you find yourself in a situation where both a GLM and
a GAM initially seem appropriate for your data. Explain the criteria you
would use to determine which of the two methods to recommend.
2. This question re-examines the hubble data.
(a) [6] Fit the model:
Vi = f(Di) + ?i
to the Hubble data, where f is a smooth function and the ?i are i.i.d.
N(0, σ2). Does a straight line model appear to be most appropriate?
How would you interpret the best fitting model?
(b) [4] Examine appropriate residual plots and refit the model with more
appropriate distributional assumptions. How are your conclusions
from part (a) modified?
3. Read and provide a one page summary of the lme4 documentation. [5]
4. The data frame Gun (library nlme) is from a trial examining methods for fir-
ing naval guns. Two firing methods were compared, with each of a number
of teams of 3 gunners; the gunners in each team were matched to have
similar physique (Slight, Average, Heavy). The response variable rounds
is rounds fired per minute, and there are 3 explanatory factor variables,
Physique (levels Slight, Medium and Heavy); Method (levels M1 and M2)
and Team with 9 levels. The main interest is in determining which method
and/or physique results in the highest firing rate and in quantifying team-
to-team variability in firing rate.
(a) [2] Identify which factors should be treated as random and which as
fixed, in the analysis of these data.
(b) [4] Write out a suitable mixed model as a starting point for the analysis
of these data.
(c) [6] Analyse the data using lme in order to answer the main questions
of interest and report your conclusions.
1
5. The Carseats dataset from the R package ISLR is a simulated dataset of
carseat sales at 400 different stores. Full information on the variables in
this dataset can be found using help(Carseats) after loading the package.
(a) [4] Create a new factor variable for the Carseats representing whether
or not Sales is greater than 8. Randomly split the dataset into a testing
and training set. On the training set grow a classification tree using the
R rpart package to classify whether a store had high carseat sales or not
(Hint: Remove the Sales variable). Report the classification accuracy
you got on the testing data set and on the training set.
(b) [4] Prune the tree you grew in part a. Report the pruned tree’s classi-
fication accuracy on the testing data set and on the training set. Why
might pruning have improved the classification accuracy on the testing
set? Why might it have reduced accuracy on the training set?
(c) [4] Grow a random forest using the randomForest package the same
way you did the tree. Is performance on the testing set better than the
classification trees? Why might that be the case?
(d) [4] Briefly outline the similarities and differences between CARTs and
random forests.