#
MT5764课程作业代做、Data Analysis作业代写、R语言作业代做、代写R编程设计作业
代写留学生 Statistics统计、回归、迭代|代做留学生P

MT5764: Advanced Data Analysis

Major Course Assignment

Housekeeping

• This project is about modelling Covid-19 data. I hope you will find this an opportunity to demonstrate

the skills that you have developed in this module while simultaneously aiding in the effort to answer

questions of global interest. However, if you find this a distressing topic to work on, please contact me

directly before Tuesday, 14th April 2020 to arrange for a different dataset to complete this assignment.

• This major course assignment replaces your exam and thus comprises 60% of the overall module

mark.

• This is an individual project. The submitted coursework should reflect the work of you as an

individual. Suspected cases of copying will be taken very seriously, so please adhere to the University’s

guidelines on good academic practice. If you have any uncertainties or questions about this, please

contact me.

• I recommend you attempt every part of the assignment; even if you do not finish everything, marks are

likely to be awarded for incomplete tasks/code. Remember I cannot allocate marks to a blank sheet of

paper, so help me to help you.

Submission

• Write a succinct report that includes a clear and detailed description of how you have answered each

task in the assignment, justifying each decision taken along the way and referring to the corresponding

code.

• Only include model output summaries and well-labelled plots that you describe and refer to in the

write-up. You will be penalised for including superflous outputs and/or code that you are not discussing

in the report or is not attempting to answer the task. Please include just your student ID at the start

of your report. Do not include your name anywhere in your report.

• Please comment and annotate your final code, and name functions and variables sensibly. Make sure

you only submit the code that was used to answer the specific tasks. Your code needs to be succinct

and comprehensible. Marks will be deducted if I cannot follow what you have done.

• You can use R and/or SAS to answer each task in this assignment. For example, if you find the

modelling more convenient in SAS but prefer R’s graphical tools (or vice-versa), then feel free to use a

mix, making it clear in your report. Save your final, well-annotated code, as single scripts using your

student ID; ID_12345.R (replace 12345 with your student ID). Please include just your student ID at

the top of your script. Do not include your name anywhere in your scripts.

• You are free to write your report using whatever software you are comfortable with. For example, R

Markdown, Jupyter, LATEXor Word. However, you must convert your final report into a single PDF,

saved using your student number (e.g. ID_12345.pdf).

1

• Compress your report (e.g. ID_12345.pdf) and code (e.g. ID_12345.R and/or ID_12345.sas) into a

single zip file (.zip), saved using your student ID (e.g. ID_12345.zip) and upload to Moodle.

• To be clear, you are required to upload to Moodle a single zip file (e.g. ID_12345.zip), which contains

a single PDF of your final report and one or two final scripts (R/SAS)1

.

• Deadline is Friday, 15th May 2020, 23:59 (UK time). Please do not leave it to the last minute to

upload your work.

• The School has a lateness policy. The standard policy is an initial penalty of 15% of the maximum

available mark, then a further 5% per 8-hour period, or part thereof for work submitted late without

good reason.

1

If you are writing your report using interactive notebooks, such as R Markdown or Jupyter, then you do not need to

re-upload your code, as long as it is all included and commented within your notebook.

2

Assignment

Data

The assignment involves in-depth statistical analysis of the following two Covid-19 datasets that you will

need to download from Moodle. The main data source is John Hopkins University’s repository (who in turn

pooled data from various other sources), coupled with some country-level statistics.

1. CovidCases.csv - The number of Covid-19 cases as of the day this assignment was set and a few

country-level metrics.

Country Deaths Confirmed PopDensity MedianAge UrbanPop Bed Lung

1 Afghanistan 14 444 60 18 25 0.5 37.62

2 Albania 22 400 105 36 63 2.9 11.67

3 Algeria 205 1572 18 29 73 1.9 8.77

4 Antigua and Barbuda 2 19 223 34 26 3.8 11.76

5 Argentina 63 1715 17 32 93 5.0 29.27

6 Armenia 9 881 104 35 63 4.2 23.86

HealthExp GDP

1 184 481.2432

2 774 5357.5704

3 1031 3940.1799

4 1105 17236.9778

5 1390 9856.4304

6 883 4536.9212

• Country: Data from different states/regions/counties were pooled under a single Country name.

• Deaths: The number of deaths due to Covid-19 (persons).

• Confirmed: The number of confirmed Covid-19 cases (persons).

• PopDensity: Population density (persons/km2

).

• MedianAge: Median age of the population (years).

• UrbanPop: The percentage of the population that live in urban areas (%).

• Bed: Hospital beds per 1,000 people (beds/1000 persons).

• Lung: Death rate per 100,000 people due to lung disease (deaths/100,000 persons).

• HealthExp: Total health expenditure per capita in US dollars ($/person).

• GDP: The nominal gross domestic product per capita (a measure of a country’s economy) in US dollars

($/person).

2. CovidConfirmedTime.csv - The number of confirmed cases of Covid-19 over time, for the 30 worst

affected countries, excluding China2

.

Country Day Confirmed

1 Australia 0 107

2 Australia 1 128

3 Australia 2 128

4 Australia 3 200

5 Australia 4 250

6 Australia 5 297

• Country: Data from different states/regions/counties were pooled under a single Country name.

• Day: Days since the 100th confirmed case.

• Confirmed: The number of confirmed Covid-19 cases up to and including that day.

2Unfortunately data for China from the early part of the outbreak is not available from the John Hopkins University’s

repository.

3

Tasks for CovidCases.csv

1. Fit a generalised linear model (GLM) or quasi-likelihood model (whichever you deem the most pertinent

in this case), using an appropriate error structure, to model the number of deaths per confirmed cases

(known as the case fatality rate). Use all the predictors available in the dataset (i.e. fit a full model).

Justify your decisions along the way. Show and interpret your final model output, in particular comment

on the effect size of each predictor. [4 marks]

2. Refit the model identified in task 1. but now only consider countries that have recorded 10 or more

deaths. Show and interpret your final model output relative to the model in task 1. Do you think it is

more sensible to fit a model to this subset of the data if we were interested in performing inference on

the factors associated with the case fatality rate due to Covid-19? Justify your answer. [3 marks]

3. Assess the assumptions of the model fitted in task 2. using appropriate model diagnostic tests and

plots. Provide a clear explanation and interpretation for each test and plot used. [4 marks]

4. Starting with the full model identified in task 2. perform an all-possible-subsets model selection using

an appropriate information criterion (justify your choice). Show the top 5 models and interpret the

results. [3 marks]

5. Use data from countries that have recorded 10 or more deaths to fit a LASSO model. Consider all the

available predictors and use 10-fold cross-validation to estimate the regularisation parameter λ. Plot

how the regression coefficients (label them) and residual deviance change as a function of log (λ). On

the plot clearly highlight the value for λ that minimises the cross-validation (CV) error (quantified by

the residual deviance). Show and interpret both plots and the final fitted model (taken to be the one

that minimises the CV error). [5 marks]

6. Use data from countries that have recorded 10 or more deaths to fit a penalised regression spline.

Include a smooth term for each predictor. Set the value for k (the dimension of the basis used to

represent the smooth term) to be the same for all covariates. Compare the partial residual plots (show

the residuals and confidence bands) for each predictor for a model with k=5 and k=10. Show and discuss

the fitted models. [5 marks]

Tasks for CovidConfirmedTime.csv

7. Explore the dataset using any appropriate plots. [2 marks]

8. Use generalised estimating equations (GEEs) to fit a generalised linear model with the number of

confirmed Covid-19 cases as outcome and day as a single explanatory variable. Use an appropriate

error structure and a within-group correlation matrix to accommodate observations from the same

country (justify your choice). Show and interpret the fitted model. [5 marks]

9. Plot the trajectory for the number of confirmed Covid-19 cases over time for the “average” country

and compare that to the observed trajectory for the UK and Germany. Are these countries recording

cases at a faster, slower or similar pace to the “average” country?. Use CovidCases.csv to comment

on whether the case fatality rates for these two countries are associated with the rate at which they are

acquiring new cases. Justify your answers. [3 marks]

10. Assess the GEE model fitted in task 8. using appropriate model diagnostic tests and plots. Provide a

clear explanation and interpretation for each plot. [4 marks]

11. Fit a mixed model with the number of confirmed Covid-19 cases as outcome and day as a single

explanatory variable, but allowing for each country to have its own intercept and slope. Use an

appropriate error structure and a within-group correlation matrix to accommodate observations from

the same country (justify your choice). Display and interpret the model output. Hint: If you run into

convergence issues you might want to model log(Confirmed) instead of Confirmed. [5 marks]

12. Extract and plot as a histogram, the estimated slopes for each country. Pick the top three countries

whose slope differs the most from the “average” country. For these three countries, compare (graphically)

4

the fitted models to what was observed, and to the model fit for the “average” country. Comment on

the results. [3 marks]

13. Refit the model in task 11. but this time assume a common intercept (i.e. only allow for random slopes).

Repeat task 12. and compare and comment on the two sets of results. Hint: If you run into convergence

issues you might want to model log(Confirmed) instead of Confirmed. [3 marks]

14. Comment on the validity of the models fitted in this section (using the CovidConfirmedTime.csv

dataset) once countries have passed the peak of the outbreak. [1 mark]

5