MT5764讲解、Data Analysis辅导、R语言讲解、辅导R编程设计辅导留学生 Statistics统计、回归、迭代|讲解留学生P

MT5764: Advanced Data Analysis
Major Course Assignment
Housekeeping
• This project is about modelling Covid-19 data. I hope you will find this an opportunity to demonstrate
the skills that you have developed in this module while simultaneously aiding in the effort to answer
questions of global interest. However, if you find this a distressing topic to work on, please contact me
directly before Tuesday, 14th April 2020 to arrange for a different dataset to complete this assignment.
• This major course assignment replaces your exam and thus comprises 60% of the overall module
mark.
• This is an individual project. The submitted coursework should reflect the work of you as an
individual. Suspected cases of copying will be taken very seriously, so please adhere to the University’s
guidelines on good academic practice. If you have any uncertainties or questions about this, please
contact me.
• I recommend you attempt every part of the assignment; even if you do not finish everything, marks are
likely to be awarded for incomplete tasks/code. Remember I cannot allocate marks to a blank sheet of
paper, so help me to help you.
Submission
• Write a succinct report that includes a clear and detailed description of how you have answered each
task in the assignment, justifying each decision taken along the way and referring to the corresponding
code.
• Only include model output summaries and well-labelled plots that you describe and refer to in the
write-up. You will be penalised for including superflous outputs and/or code that you are not discussing
in the report or is not attempting to answer the task. Please include just your student ID at the start
of your report. Do not include your name anywhere in your report.
• Please comment and annotate your final code, and name functions and variables sensibly. Make sure
you only submit the code that was used to answer the specific tasks. Your code needs to be succinct
and comprehensible. Marks will be deducted if I cannot follow what you have done.
• You can use R and/or SAS to answer each task in this assignment. For example, if you find the
modelling more convenient in SAS but prefer R’s graphical tools (or vice-versa), then feel free to use a
mix, making it clear in your report. Save your final, well-annotated code, as single scripts using your
student ID; ID_12345.R (replace 12345 with your student ID). Please include just your student ID at
the top of your script. Do not include your name anywhere in your scripts.
• You are free to write your report using whatever software you are comfortable with. For example, R
Markdown, Jupyter, LATEXor Word. However, you must convert your final report into a single PDF,
saved using your student number (e.g. ID_12345.pdf).
1
• Compress your report (e.g. ID_12345.pdf) and code (e.g. ID_12345.R and/or ID_12345.sas) into a
single zip file (.zip), saved using your student ID (e.g. ID_12345.zip) and upload to Moodle.
• To be clear, you are required to upload to Moodle a single zip file (e.g. ID_12345.zip), which contains
a single PDF of your final report and one or two final scripts (R/SAS)1
.
• Deadline is Friday, 15th May 2020, 23:59 (UK time). Please do not leave it to the last minute to
upload your work.
• The School has a lateness policy. The standard policy is an initial penalty of 15% of the maximum
available mark, then a further 5% per 8-hour period, or part thereof for work submitted late without
good reason.
1
If you are writing your report using interactive notebooks, such as R Markdown or Jupyter, then you do not need to
re-upload your code, as long as it is all included and commented within your notebook.
2
Assignment
Data
The assignment involves in-depth statistical analysis of the following two Covid-19 datasets that you will
need to download from Moodle. The main data source is John Hopkins University’s repository (who in turn
pooled data from various other sources), coupled with some country-level statistics.
1. CovidCases.csv - The number of Covid-19 cases as of the day this assignment was set and a few
country-level metrics.
Country Deaths Confirmed PopDensity MedianAge UrbanPop Bed Lung
1 Afghanistan 14 444 60 18 25 0.5 37.62
2 Albania 22 400 105 36 63 2.9 11.67
3 Algeria 205 1572 18 29 73 1.9 8.77
4 Antigua and Barbuda 2 19 223 34 26 3.8 11.76
5 Argentina 63 1715 17 32 93 5.0 29.27
6 Armenia 9 881 104 35 63 4.2 23.86
HealthExp GDP
1 184 481.2432
2 774 5357.5704
3 1031 3940.1799
4 1105 17236.9778
5 1390 9856.4304
6 883 4536.9212
• Country: Data from different states/regions/counties were pooled under a single Country name.
• Deaths: The number of deaths due to Covid-19 (persons).
• Confirmed: The number of confirmed Covid-19 cases (persons).
• PopDensity: Population density (persons/km2
).
• MedianAge: Median age of the population (years).
• UrbanPop: The percentage of the population that live in urban areas (%).
• Bed: Hospital beds per 1,000 people (beds/1000 persons).
• Lung: Death rate per 100,000 people due to lung disease (deaths/100,000 persons).
• HealthExp: Total health expenditure per capita in US dollars ($/person).
• GDP: The nominal gross domestic product per capita (a measure of a country’s economy) in US dollars
($/person).
2. CovidConfirmedTime.csv - The number of confirmed cases of Covid-19 over time, for the 30 worst
affected countries, excluding China2
.
Country Day Confirmed
1 Australia 0 107
2 Australia 1 128
3 Australia 2 128
4 Australia 3 200
5 Australia 4 250
6 Australia 5 297
• Country: Data from different states/regions/counties were pooled under a single Country name.
• Day: Days since the 100th confirmed case.
• Confirmed: The number of confirmed Covid-19 cases up to and including that day.
2Unfortunately data for China from the early part of the outbreak is not available from the John Hopkins University’s
repository.
3
Tasks for CovidCases.csv
1. Fit a generalised linear model (GLM) or quasi-likelihood model (whichever you deem the most pertinent
in this case), using an appropriate error structure, to model the number of deaths per confirmed cases
(known as the case fatality rate). Use all the predictors available in the dataset (i.e. fit a full model).
Justify your decisions along the way. Show and interpret your final model output, in particular comment
on the effect size of each predictor. [4 marks]
2. Refit the model identified in task 1. but now only consider countries that have recorded 10 or more
deaths. Show and interpret your final model output relative to the model in task 1. Do you think it is
more sensible to fit a model to this subset of the data if we were interested in performing inference on
the factors associated with the case fatality rate due to Covid-19? Justify your answer. [3 marks]
3. Assess the assumptions of the model fitted in task 2. using appropriate model diagnostic tests and
plots. Provide a clear explanation and interpretation for each test and plot used. [4 marks]
4. Starting with the full model identified in task 2. perform an all-possible-subsets model selection using
an appropriate information criterion (justify your choice). Show the top 5 models and interpret the
results. [3 marks]
5. Use data from countries that have recorded 10 or more deaths to fit a LASSO model. Consider all the
available predictors and use 10-fold cross-validation to estimate the regularisation parameter λ. Plot
how the regression coefficients (label them) and residual deviance change as a function of log (λ). On
the plot clearly highlight the value for λ that minimises the cross-validation (CV) error (quantified by
the residual deviance). Show and interpret both plots and the final fitted model (taken to be the one
that minimises the CV error). [5 marks]
6. Use data from countries that have recorded 10 or more deaths to fit a penalised regression spline.
Include a smooth term for each predictor. Set the value for k (the dimension of the basis used to
represent the smooth term) to be the same for all covariates. Compare the partial residual plots (show
the residuals and confidence bands) for each predictor for a model with k=5 and k=10. Show and discuss
the fitted models. [5 marks]
Tasks for CovidConfirmedTime.csv
7. Explore the dataset using any appropriate plots. [2 marks]
8. Use generalised estimating equations (GEEs) to fit a generalised linear model with the number of
confirmed Covid-19 cases as outcome and day as a single explanatory variable. Use an appropriate
error structure and a within-group correlation matrix to accommodate observations from the same
country (justify your choice). Show and interpret the fitted model. [5 marks]
9. Plot the trajectory for the number of confirmed Covid-19 cases over time for the “average” country
and compare that to the observed trajectory for the UK and Germany. Are these countries recording
cases at a faster, slower or similar pace to the “average” country?. Use CovidCases.csv to comment
on whether the case fatality rates for these two countries are associated with the rate at which they are
acquiring new cases. Justify your answers. [3 marks]
10. Assess the GEE model fitted in task 8. using appropriate model diagnostic tests and plots. Provide a
clear explanation and interpretation for each plot. [4 marks]
11. Fit a mixed model with the number of confirmed Covid-19 cases as outcome and day as a single
explanatory variable, but allowing for each country to have its own intercept and slope. Use an
appropriate error structure and a within-group correlation matrix to accommodate observations from
the same country (justify your choice). Display and interpret the model output. Hint: If you run into
convergence issues you might want to model log(Confirmed) instead of Confirmed. [5 marks]
12. Extract and plot as a histogram, the estimated slopes for each country. Pick the top three countries
whose slope differs the most from the “average” country. For these three countries, compare (graphically)
4
the fitted models to what was observed, and to the model fit for the “average” country. Comment on
the results. [3 marks]
13. Refit the model in task 11. but this time assume a common intercept (i.e. only allow for random slopes).
Repeat task 12. and compare and comment on the two sets of results. Hint: If you run into convergence
issues you might want to model log(Confirmed) instead of Confirmed. [3 marks]
14. Comment on the validity of the models fitted in this section (using the CovidConfirmedTime.csv
dataset) once countries have passed the peak of the outbreak. [1 mark]
5

MT5764讲解、Data Analysis辅导、R语言讲解、辅导R编程设计 辅导留学生 Statistics统计、回归、迭代|讲解留学生P

MT5764讲解、Data Analysis辅导、R语言讲解、辅导R编程设计辅导留学生 Statistics统计、回归、迭代|讲解留学生P