MATH3029-E1

The University of Nottingham

SCHOOL OF MATHEMATICAL SCIENCES

A LEVEL 3 MODULE, SPRING SEMESTER 2019-2020

APPLIED STATISTICAL MODELLING

Suggested time to complete: TWO Hours THIRTY Minutes

Paper set: 18/05/2020 - 10:00

Paper due: 26/05/2020 - 10:00

Answer ALL questions

Your solutions should be written on white paper using dark ink (not pencil), on a tablet, or

typeset. Do not write close to the margins. Your solutions should include complete

explanations and all intermediate derivations. Your solutions should be based on the material

covered in the module and its prerequisites only. Any notation used should be consistent with

that in the Lecture Notes.

Guidance on the Alternative Assessment Arrangements can be found on the Faculty of Science

Moodle page: https://moodle.nottingham.ac.uk/course/view.php?id=99154#section-2

Submit your answers as a single PDF with each page in the correct orientation, to the

appropriate dropbox on the module’s Moodle page. Use the standard naming

convention for your document: [StudentID]_[ModuleCode].pdf. Please check the

box indicated on Moodle to confirm that you have read and understood the statement

on academic integrity: https://moodle.nottingham.ac.uk/pluginfile.php/6288943/mod_

tabbedcontent/tabcontent/8496/FoS%20Statement%20on%20Academic%20Integrity.pdf

A scan of handwritten notes is completely acceptable. Make sure your PDF is easily readable

and does not require magnification. Text which is not in focus or is not legible for any other

reason will be ignored. If your scan is larger than 20Mb, please see if it can easily be reduced

in size (e.g. scan in black white, use a lower dpi — but not so low that readability is

compromised).

Staff are not permitted to answer assessment or teaching queries during the assessment

period. If you spot what you think may be an error on the exam paper, note this in your

submission but answer the question as written. Where necessary, minor clarifications or

general guidance may be posted on Moodle for all students to access.

Students with approved accommodations are permitted an extension of 3 days.

The standard University of Nottingham penalty of 5% deduction per working day will

apply to any late submission.

MATH3029-E1 Turn over

MATH3029-E1

Academic Integrity in Alternative Assessments

The alternative assessment tasks for summer 2020 are to replace exams that would have

assessed your individual performance. You will work remotely on your alternative assessment

tasks and they will all be undertaken in “open book” conditions. Work submitted for

assessment should be entirely your own work. You must not collude with others or employ the

services of others to work on your assessment. As with all assessments, you also need to avoid

plagiarism. Plagiarism, collusion and false authorship are all examples of academic misconduct.

They are defined in the University Academic Misconduct Policy at: https://www.nottingham.ac.

uk/academicservices/qualitymanual/assessmentandawards/academic-misconduct.aspx

Plagiarism: representing another person’s work or ideas as your own. You could do this by

failing to correctly acknowledge others’ ideas and work as sources of information in an

assignment or neglecting to use quotation marks. This also applies to the use of graphical

material, calculations etc. in that plagiarism is not limited to text-based sources. There is

further guidance about avoiding plagiarism on the University of Nottingham website.

False Authorship: where you are not the author of the work you submit. This may include

submitting the work of another student or submitting work that has been produced (in whole

or in part) by a third party such as through an essay mill website. As it is the authorship of an

assignment that is contested, there is no requirement to prove that the assignment has been

purchased for this to be classed as false authorship.

Collusion: cooperation in order to gain an unpermitted advantage. This may occur where you

have consciously collaborated on a piece of work, in part or whole, and passed it off as your

own individual effort or where you authorise another student to use your work, in part or

whole, and to submit it as their own. Note that working with one or more other students to

plan your assignment would be classed as collusion, even if you go on to complete your

assignment independently after this preparatory work. Allowing someone else to copy your

work and submit it as their own is also a form of collusion.

Statement of Academic Integrity

By submitting a piece of work for assessment you are agreeing to the following statements:

1. I confirm that I have read and understood the definitions of plagiarism, false authorship

and collusion.

2. I confirm that this assessment is my own work and is not copied from any other person’s

work (published or unpublished).

3. I confirm that I have not worked with others to complete this work.

4. I understand that plagiarism, false authorship, and collusion are academic offences and I

may be referred to the Academic Misconduct Committee if plagiarism, false authorship or

collusion is suspected.

MATH3029-E1 Turn over

1 MATH3029-E1

Submission instructions

• Release and submission times are with respect to British Standard Time (BST). Please plan

accordingly.

• Please take time to write clearly and neatly. This is especially important since you will be

handing in scanned documents. If I can’t read your writing clearly, I will not be able to

mark appropriately.

• In accordance with University guidelines for this assessment, please write your name and

student id on the first page of your submitted document.

• It is your responsibility to ensure that the requirements for a valid submission on moodle

are met (e.g. file size; invalidity of ‘draft’ submissions). Please try and submit ahead of

time to avoid complications close to the deadline.

MATH3029-E1

2 MATH3029-E1

1. (a) Consider the one-way ANOVA model

= + + , = 1, 2, 3; = 1, 2.

Assume that are IID Normal random variables with () = 0 and () =

2 > 0

for all , .

i) Suppose the model is used to determine efficacy of three drugs A, B and C on

cholestrol levels of patients. Interpret within this context each term in the model

above, and the corresponding assumptions.

ii) For the model above construct the corresponding design matrix , the vector of

responses , the vector of regression coefficients , and the error vector . Justify

why the least squares estimator ()−1 of cannot be computed without

further constraints.

[15 marks]

(b) A farmer wanted to compare four types of wheat to find which gives greatest yield.

Since he suspected growing conditions might vary across his field, he divided the field

into four plots and performed experiments which led to the following data on yield (in

tonnes).

Plot 1 Plot 2 Plot 3 Plot 4 Sum

Wheat 1 6.5 6.6 6.3 5.9 25.3

Wheat 2 7.2 6.4 6.4 6.2 26.2

Wheat 3 6.3 6.1 5.9 5.9 24.2

Wheat 4 6.4 6.4 6.3 6.1 25.2

Sum 26.4 25.5 24.9 24.1 100.9

Note: ∑4=1∑

4

=1

2

= 637.85.

i) What type of design has been used by the farmer?

ii) Explain how you would ensure this design is randomised.

iii) Write down an appropriate model for this experiment, clearly defining your notation

and explaining any assumptions you make.

iv) Calculate the ANOVA table for this data.

v) Test for the significance of wheat type and comment on your findings.

[25 marks]

MATH3029-E1 Turn Over

3 MATH3029-E1

2. (a) Show that the pdf of a normal distribution with mean ∈ ℝ and variance 1 belongs to

the one-parameter GLM family. Clearly identify , (⋅), (⋅, ⋅), and (⋅). [5 marks]

(b) Suppose , = 1,… , are IID (0, 1) random variables. Denote by and their

pdf and cdf (cumulative distribution function), respectively. For real numbers , define

= 1 if ≤ or = 0 otherwise.

i) For fixed , write down the joint distribution of .

ii) Consider = 1 + 2 with = 1,… , , where are real-valued. Using ,

write down the log-likelihood function (1, 2). Also show that the score statistic

=

(

1

2 )

, where = /, = 1, 2 is:

1 =

∑

=1

[

(1 + 2)

(1 + 2)

−

(1 − )(1 + 2)

1 − (1 + 2) ]

2 =

∑

=1

[

(1 + 2)

(1 + 2)

−

(1 − )(1 + 2)

1 − (1 + 2) ]

iii) Verify that () = .

iv) Why is −1 ∶ [0, 1] → ℝ a valid link function for linking () with ?

[20 marks]

(c) In a study examining relationship between Alzheimer’s disease (yes=1 and no=0) and

Age on 98 people, a binary logistic regression model was used. Output from R is given

on the next page.

i) Using Output1: (1) interpret, in the context of the problem, the estimate of the

Age parameter, and (2) explain the values obtained for the degrees of freedom.

ii) UsingOutput1 explain, using the GLM form of a Bernoulli distribution, the statement:

‘Dispersion parameter for binomial family taken to be 1’.

iii) Information on economic status (‘Lower’, ‘Middle’, ‘Higher’) of each person was

added to the model containing Age. UsingOutput1 andOutput2 perform a Deviance

test to ascertain if economic status has a significant relationship with the chances

of being diagnosed with Alzheimer’s.

iv) Using Output2 predict the probability of being diagnosed with Alzhemeir’s for a

person aged 48 and classified as having a ‘Lower’ economic status.

[15 marks]

MATH3029-E1

4 MATH3029-E1

Output 1:

Estimate Std.Error z value Pr(>|z|)

(Intercept) -1.62437 0.40575 -4.003 6.25e-05 ***

Age 0.03183 0.01204 2.644 0.00819 **

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 122.32 on 97 degrees of freedom

Residual deviance: 114.91 on 96 degrees of freedom

Output 2:

Estimate Std. Error z value Pr(>|z|)

(Intercept) -1.49037 0.52223 -2.854 0.00432 **

Age 0.03127 0.01247 2.507 0.01216 *

Lower -0.70309 0.56145 -1.252 0.21047

Middle 0.37988 0.55692 0.682 0.49517

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 122.32 on 97 degrees of freedom

Residual deviance: 111.50 on 94 degrees of freedom

MATH3029-E1 Turn Over

5 MATH3029-E1

3. (a) i) Give an example of an offset in a Poisson GLM.

ii) How would you test for the significance of an offset variable in a Poisson GLM?

iii) Suppose are independent Poisson random variables with mean , offset , and

rate for = 1,… ,. With as responses, consider a Poisson GLM with log link

function consisting of a single real-valued predictor with regression coefficient

. Show that, for each = 1,… , the rate parameter changes by a factor of

1

when increases by one unit.

[15 marks]

(b) The data below is on the monthly accident counts on a major US highway for each of

the 12 months of 1970, then for each of the 12 months of 1971, and finally for the first

9 months of 1972.

1970 52 37 49 29 31 32 28 34 32 39 50 63

1971 35 22 27 27 34 23 42 30 36 56 48 40

1972 33 26 31 25 23 20 25 20 36

Output from R showing results from fitting a GLM modelling number of accidents with

appropriately defined predictors year and month is provided below.

Call:

glm(formula = y~year + month, family = poisson)

Coefficients:

Estimate Std. Error z value Pr(> |z|)

(Intercept) 3.81969 0.09896 38.600 < 2e − 16 ***

Year1971 -0.12516 0.06694 -1.870 0.061521 .

Year1972 -0.28794 0.08267 -3.483 0.000496 ***

month2 -0.34484 0.14176 -2.433 0.014994 *

month3 -0.11466 0.13296 -0.862 0.388459

month4 -0.39304 0.14380 -2.733 0.006271 **

month5 -0.31015 0.14034 -2.210 0.027108 *

month6 -0.47000 0.14719 -3.193 0.001408 **

month7 -0.23361 0.13732 -1.701 0.088889 .

month8 -0.35667 0.14226 -2.507 0.012168 *

month9 -0.14310 0.13397 -1.068 0.285444

month10 0.10167 0.13903 0.731 0.464628

month11 0.13276 0.13788 0.963 0.335639

month12 0.18252 0.13607 1.341 0.179812

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’

(Dispersion parameter for poisson family taken to be 1)

Null deviance: 101.143 on 32 degrees of freedom

Residual deviance: 27.273 on 19 degrees of freedom

Number of Fisher Scoring iterations: 3

MATH3029-E1

6 MATH3029-E1

i) Write down the mathematical model fitted along with assumptions.

ii) Based on the output, is it fair to state that the average number of accidents appears

to have decreased from 1970 to 1972? Justify your answer.

iii) The Transport Authority wishes to check if the number of accidents tend to be

higher from September-December when compared to January. What would be

your recommendation? Justify accordingly.

iv) Construct a 95% confidence interval for the coefficent of Year1972 in the model

in i), and corroborate the conclusion obtained from the p-value corresponding to

Year1972 in the output.

v) What is your prediction for the number of accidents in October 1972?

[25 marks]

MATH3029-E1 END