STAT6030代写、代写R设计程序

STAT6030 GENERALISED LINEAR MODELLING
The Australian National University
Assignment 2
2023 Summer Session
Instructions
This assignment is worth 55 marks in total and 25% of your overall marks for this
course. The assignment is compulsory and must be submitted by 5pm on Monday
6 March 2023.
You must write your answers to this assignment individually and by yourself. If you
copy someone else’s work or allow your work to be copied, you will receive a mark of
zero for the assignment and risk severe academic consequences.
Your answers should be individually submitted through Turnitin on Wattle as a
single pdf/Word document (less than 50MB) including the following:
1. The assignment Cover Sheet (available on Wattle).
2. Your answers (no more than 10 pages including graphs, summaries, tables, etc...
but not Appendix and Cover Sheet, and respecting the other requirements for each
part).
3. An Appendix including all the R commands you used (no page limit).
Assignments should be typed and not handwritten. Your assignment may include some
carefully edited R output (e.g., graphs, summaries, tables, etc...) and appropriate dis-
cussion of these results, as well as some selected R commands. Please be selective about
what you present and only include as many pages and as much R output as necessary
to justify your solution. Clearly label each part and question of your assignment and
appendix with the corresponding numbers.
Unless otherwise advised, use a significance level of 5%.
Round numeric answers to 4 decimal places (e.g., 0.00115 is rounded to 0.0012).
Marks will be deducted if these instructions are not strictly respected, especially when
the total report is of an unreasonable length, i.e., more than the above page limit. The
Appendix will generally not be marked and checked if what you have written or done
needs clarifications.
Name your submission “CourseCode Uid”, e.g., “STAT6030 u1234567”.
Try to submit your assignment at least 30 minutes before the deadline in case
something unexpected happens, for instance an internet connection problem.
Late submissions will NOT be accepted. Extensions will usually be granted on
medical or compassionate grounds on production of appropriate evidence, but must
receive lecturer’s approval at least 24 hours before the deadline.
1
Part 1 [16 Marks]
Please provide your answers to the following questions and include short working out if there
is any. There is a limit of 3 pages on your answers for Part 1.
(a) [1 mark] What is the definition of canonical link function in the context of generalised
linear models?
(b) [1 mark] Explain in words and/or by drawing a plot when a link function of a generalised
linear model is valid.
(c) [1 mark] In the context of generalised linear models, does the value of the maximised
log-likelihood for the saturated model depend on the choice of link function and why?
(d) [1 mark] The mean of a generalised linear model is known to lie between 1 and 2
whatever the value of the linear predictor ηi = x

i β is, i.e. 1 < μi < 2. Let Φ denote
the cumulative distribution function of the standard normal distribution N(0, 1) and
Φ1 denote the inverse function of Φ. Which function below is an appropriate link
function in this setting? Notes: (i) precisely one answer below is correct and the other
ones are incorrect; (ii) an incorrect answer scores zero while the correct answer scores
full marks for the question.
A. ηi = g(μi), where g(μi) = Φ(μi 1).
B. ηi = g(μi), where g(μi) = Φ(μi/2).
C. ηi = g(μi), where g(μi) = Φ
1(μi 1).
D. ηi = g(μi), where g(μi) = Φ
1(μi/2).
(e) [1 mark] The gamma distribution has probability density function
f(y;α, β) = {βα/Γ(α)}yα?1 exp(?βy),
where y > 0, α > 0 is a shape parameter, β > 0 is a rate parameter and Γ(·) is the
gamma function. You may assume that
(i) the mean μ of the gamma distribution is given by μ = α/β;
(ii) the gamma distribution is a generalised linear model with dispersion parameter
= 1/α, in the notation of equation (4.1) of Topic 4.
What is the canonical link function when the generalised linear model is gamma?
(f) [3 marks] The geometric distribution has probability mass function f(y; p) = (1? p)py,
for y = 0, 1, . . ., where 0 < p < 1. What are the canonical link function and variance
function of the geometric distribution?
The deviance residual for observation i is given by sign(yi μi)
} {b(h(yi)) b(h(μi))}]
is the deviance associated with observation i, which is written as a function of the
response variable yi and of the fitted value μ?i, while sign(·) is the sign function defined
in the lecture notes. Also recall that b′?1(μ) ? h(μ). What is the expression for d2i , as
a function of yi and μ?i, when the generalised linear model is geometric? Please simplify
your expression as much as you can.
2
(g) [1 mark] Consider a generalised linear model with linear predictor ηi = υi+x
i β, where
υi is an offset, xi is a vector of covariates of length p and β is a parameter vector of
length p to be estimated. Assuming that the model’s dispersion parameter ? = 1 is
known, how many free parameters (i.e., parameters to estimate) are there in this model?
(h) [1 mark] A logistic regression model was fitted to a dataset consisting of a binary
outcome variable, yi, taking values 0 and 1, and a single numerical covariate xi. The
estimated intercept and slope on the linear predictor scale were found to be ?0.47 and
1.3, respectively, so that the linear predictor as a function of xi is given by
η(xi) = 0.47 + 1.3xi.
Recall the estimated probability Prob[yi = 1|xi] is given by
Prob[yi = 1|xi] = exp{η?(xi)}/[1 + exp{η?(xi)}]
and so the estimated probability Prob[yi = 0|xi] is given by 1? Prob[yi = 1|xi]. What
is the value of xi such that the odds of the event yi = 1 is 0.75? Recall that the odds
of an event that occurs with probability π is given by π/(1? π).
(i) [2 marks] Consider a distribution with the probability density function
f(y;μ) = [1/(2πy3)]?1/2 exp[?(y ? μ)2/(2μ2y)],
where μ is the mean of the distribution and y > 0. What is the variance function, V (μ),
of this distribution?
(j) [1 mark] The following output from a linear regression model fit in R was obtained.
Calculate the value for ++++ that the R program would give if the sample size is 10.
Call :
lm( formula = y ? x )
C o e f f i c i e n t s :
Estimate Std . Error t value Pr(>| t | )
( I n t e r c ep t ) ?0.08888 0.66793 ?0.133 0 .897
x 1.06903 0.10765 ???? ++++
(k) [1 mark] Suppose we fit a Poisson regression model A with log link to a dataset whose
response variable is a count. No offset is included. In the fitted model we have included
a covariate x and the estimated coefficient of x is β?A. Suppose that we then decide to
fit a second model B which is the same as model A but with x included as an offset as
well as included in the linear predictor as before. Suppose the estimated coefficient of
x is β?B in model B. Which of the following statements about the second fitted model is
correct?
Notes: (i) precisely one answer below is correct and the other ones are incorrect; (ii) an
incorrect answer scores zero while the correct answer scores full marks for the question.
A. β?B = β?A ? 1 and the residual deviance of model B will (usually) change compared
to that of model A.
3
B. β?B = β?A? 1 and the residual deviance of model B will not change compared to that
of model A.
C. β?B = β?A + 1 and the residual deviance of model B will (usually) change compared
to that of model A.
D. β?B = β?A+1 and the residual deviance of model B will not change compared to that
of model A.
(l) [2 marks] Suppose we have fitted a Poisson log-linear regression with extra-Poisson
variation and the estimate of the dispersion parameter ? is greater than 1. If the
standard Poisson model was used in this situation, would this be likely to be a case of
underdispersion or overdispersion, and which assumption between mean and variance
of the Poisson distribution should fail? What would happen to the estimates of the β
parameters for the standard Poisson model?
4
Part 2 [12 Marks]
Different doses of two chemicals, A and B, were used in a trial whose purpose was to reduce
cockroach numbers. The variable x1 gives the dose of chemical A and the variable x2 gives
the dose of chemical B. In the R code below, the first column of c gives the number of
cockroaches killed and the second column of c gives the number of cockroaches that survived.
The following R outputs were obtained:
Please provide your answers to the following questions and include short working out if there
is any. There is a limit of 2 pages on your answers for Part 2.
(a) [1 mark] What type of generalised linear model is being fitted here and what link
function is being used?
5
(b) [5 marks] Determine the missing information indicated by the letters A, B, C, D, E, F,
G, H, J and K. Note that for E you are required to specify the link function.
(c) [2 marks] Write down the relevant model in mathematical form, focusing on the contri-
bution of observation i to the model.
(d) [2 marks] Briefly indicate your impressions of the results of the statistical analysis
provided above.
(e) [2 marks] What are the next questions you would investigate in the statistical analysis?
State what your next two steps would be.
6
Part 3 [12 Marks]
The presence of sprouted or diseased kernels in wheat can reduce the value of a wheat pro-
ducer’s entire crop. It is important to identify these kernels after being harvested but prior
to sale. To facilitate this identification process, automated systems have been developed to
separate healthy kernels from the rest. Improving these systems requires a better understand-
ing of the measurable ways in which healthy kernels differ from kernels that have sprouted
prematurely or are infected with a fungus. To this end, Martin et al. (1998) conducted a
study examining numerous physical properties of kernels - density, hardness, size, weight,
and moisture - measured on a sample of wheat kernels from two different classes of wheat,
hard red winter (hrw) and soft red winter (srw) (represented by the categorical variable class)
in the wheat.csv dataset on Wattle. Each kernel’s condition was also classified as “Healthy”,
“Partly Diseased” and “Diseased” by human visual inspection (represented by the categorical
variable type2).
Please provide your answers to the following questions and include short working out if there
is any. There is a limit of 3 pages on your answers for Part 3.
Throughout the following questions, treat type2 as the response variable.
Suppose that we have conducted the following R analysis and obtained the R output below:
7
(a) [2 marks] Describe the interpretations of coefficient estimates -10.95451 and -0.6480912
in the summary() output.
(b) [2 marks] What are the null and alternative hypotheses corresponding to the p-value
0.0291 in the Anova() output? What conclusion can you obtain based on the p-value?
(c) [2 marks] Suppose we have a new observation of the following form:
> xnew=data . frame ( class=’ srw ’ ,density=1, hardness=25, s i z e =2,
+ weight=25,moisture=12)
> xnew
class density hardness s i z e weight moisture
1 srw 1 25 2 25 12
If we use predict(), what are the predicted probabilities for the different categories
of the response type2 and what is the prediction of the response type2 for this new
observation?
Suppose that we conducted further R analysis and obtained the R output below:
8
(d) [2 marks] Describe the interpretations of coefficient estimates -0.17370 and 13.50540
in the summary() output, respectively.
(e) [2 marks] What are the null and alternative hypotheses corresponding to the p-value
0.65749 in the Anova() output? What conclusion can you obtain based on the p-value?
(f) [2 marks] Fit a nominal logistic regression model and an ordinal logistic regression model,
respectively, with covariates class, density, hardness, size, weight, moisture,
class:density, class:hardness, class:size, class:weight and class:moisture.
Based on the model fitting results, which model is better? Please explain why this
model is better.
9
Part 4 [15 Marks]
An analysis of some ship damage data is presented below. The data consists of a factor typ,
corresponding to ship type, with 3 levels, A, B and C; a factor cons, corresponding to the
period of construction of the ship, with 3 levels, 1960-1964, 1965-1969 or 1975-1979; a factor
opr, corresponding to years of operation of the ship, with 2 levels, either 1960-1975 or 1975-
1979; a numerical variable mnths, corresponding to the total number of months at risk; and
dmge, corresponding to the number of damage incidents reported for the ship. The following
R output was obtained.
10
Please provide your answers to the following questions and include short working out if there
is any. There is a limit of 2 pages on your answers for Part 4.
(a) [1 mark] What type of generalised linear model i being fitted here to obtain the output
out1 and what link function is being used?
(b) [7 marks] Determine the missing information indicated by the letters A, B, C, D, E, F,
G, H, J, K, M, N, P and Q. Note that F should consist of either a blank, a dot, one
star, two stars or three stars; and for J you should specify the link function that was
used. All the other letters apart from A represent a number.
(c) [2 marks] Explain what is meant by an offset and the motivation for offsetting L=log(mnths)
rather than mnths itself.
(d) [2 marks] Using the R printout for out1, give the value of the linear predictor for a ship
of Type A that was constructed in the period 1965-1969 and operated in the period
1975-1979, assuming that mnths=1095.
(e) [3 marks] Write down brief notes on what you would conclude about the wave damage
data from the R output. Can we draw any conclusions as to whether overdispersion is
present in this dataset? What action you would consider taking if overdispersion were
suspected to be present.