MATH6152讲解、辅导R程序语言、讲解R编程设计、讲解Statistical Computing讲解R语言编程|辅导Python编程

MATH6152: Statistical Computing (R) Coursework
Assignment
This coursework assignment is worth 50% of the overall mark for the module.
The deadline is 1200 on Friday 25th October 2019 and your completed work must be submitted
electronically via Blackboard.
Your submission must consist of exactly one script containing R code to answer the two questions below.
Your script must have the filename .R. For example, if your student ID is 12345678, then
your script must have the filename 12345678.R.
You must informatively comment your code including indicating where each question and task begins.
Your entire script should be reprodicible and run without any errors.
Some tasks in the questions require you to create R objects. These should be named exactly as requested in
the question.
You must not load any add-on packages, e.g. using library or require.
If you are required to import data from a file, assume that this file is in your working directory. This means
you should not call the function setwd.
At the end of each question, there are some hints that you may find useful.
The standard Faculty rules on late submissions apply: for every weekday, or part of a weekday, that the
coursework is late, the mark will be reduced by a factor of 10%. No credit will be given for work which is
more than one week late.
All coursework must be carried out independently. You are reminded of the University’s Academic Integrity
Policy.
In the interests of fairness, queries which relate directly to the coursework must be made via the MATH6152:
Statistical Computing (R) Coursework Assignment Discussion Forum on Blackboard. This ensures
that everybody has access to the same information.
Total marks available: 100
1
Question 1: Random walk Metropolis-Hastings algorithm (50
marks)
In this question, you will write a function to simulate a sample from a given distribution using a random
walk Metropolis-Hastings algorithm.
Often in a statistical analysis, it is essential to be able to simulate (or generate) a sample from a given
continuous probability distribution. For example, R functions rnorm, runif and rgamma can simulate samples
from the normal, uniform and gamma distributions, respectively. However, sometimes we encounter a
continuous distribution which does not have a known form and for which an in-built R function does not
exist. In this case, we can use, for example, the following random walk Metropolis-Hastings algorithm to
simulate a sample.
Suppose we wish to simulate a sample of size M from the probability distribution of the continuous random
variable X œ (≠Œ,Œ). We are given the function g(x) Ø 0 such that the probability density function of X is
f(x) = g(x)
s Œ≠Œ g(x)dx.
The steps of the random walk Metropolis-Hastings algorithm are as follows.
1. Let x0 be the starting value.
2. For i = 1,...,M complete the following steps.
(a) Simulate a proposal
yi ≥ N (xi≠1, exp(„)),
i.e. a normal distribution with mean xi≠1 and variance exp(„).
(b) Calculate acceptance probability
–i = min ;
(c) Simulate ui ≥ uniform(0, 1). If –i Ø ui, then accept the proposal and set xi = yi; otherwise reject
the proposal and set xi = xi≠1.
The chain of values x1,...,xM represents an approximate (and not independent) sample from the probability
distribution of X (irrespective of the value of „). The proportion of proposals which are accepted is known
as the acceptance rate.
Complete the following tasks.
(A) Write a function called rwmet that has exactly two arguments:
• M: the sample size, M;
• phi: the quantity „ controlling the variance of the proposal distribution in step 2(a) of the
algorithm;
that will simulate a sample of size M from a probability distribution with
g(x) = exp (◊x) exp (≠ exp (◊x)),
where ◊ is the last digit of your student ID using the random-walk Metropolis-Hastings algorithm.
(If the last digit of your student ID is 0, then ◊ = 0.5). Set the starting value in the random walk
Metropolis-Hastings algorithm to be x0 = 1. [20 marks]
2
(B) We can use a sample simulated from the distribution of X to approximate the expectation of X.
This is simply done by approximating E(X) by the sample mean. It is a random (or Monte Carlo)
approximation in that if another sample is generated a dierent
approximation of E(X) will be produced.
We want to ensure that it is a good approximation, i.e. it has low variance. Use your function to
generate N = 100 samples of size M = 10000 with „ = ≠4. Calculate the variance of the N means of
your samples. [5 marks]
(C) In general, if „ is increased then the average acceptance probability will decrease. Conversely, if „
is decreased then the average acceptance probability will increase. Theoretical results suggest that
under certain conditions the optimal acceptance rate for a random walk Metropolis-Hastings algorithm
is 0.234. Create a plot of acceptance rate against an equally-spaced sequence of C = 100 dierent
values of „ ranging from -5 to 10. For each „, the accptance rate should be based on a sample of size
M = 10000. Based on your plot, approximately what value of „ is optimal? [20 marks]
(D) Use your function to generate N = 100 samples of size M = 10000 with „ equal to the optimal value
chosen in Task (C). Calculate the variance of the N means of your samples. Comment on the dierence
between the variance calculated here and in Task (B). [5 marks]
Hints:
Question 2: Iterated weighted least squares (50 marks)
In this question, you will write a function to find the maximum likelihood estimates for a logistic regression
model using iterated weighted least squares (IWLS).
Under a logistic regression model, the ith response has
yi ≥ Bernoulli(µi), (1)
for i = 1,...,n. In (1), the mean µi is linked to the linear predictor, ÷i, via
g(µi) = log 3 µi
where — = (—1,..., —p) is a p ◊ 1 vector of unknown parameters and xi = (xi1,...,xip)
T is a p ◊ 1 vector of
explanatory variables for the i response.
IWLS involves iterating the following weighted least squares estimate for j = 1, 2,... until convergence
where z(j≠1) is an n ◊ 1 vector with ith value given by
and W(j≠1) is an n ◊ n diagonal matrix with ith diagonal value given by. (3)
Note that, in (2) and (3).
Convergence occurs when
max
k=1,...,p |—ˆ(j≠1)
k ≠ —ˆ(j)
k | < ‘,
where ‘ > 0 is a pre-specified tolerance and —ˆ(j)
k is the kth element of —ˆ(j)
. At the point of convergence, —ˆ(j)
is returned as the maximum likelihood estimates of —. Note that —(j) does not mean — to the power of j, it
means that —(j) is one vector in a sequence —(0), —(1),..., —(j)
,.... The same applies for µ(j)
i and W(j)
ii .
Complete the following tasks.
(A) Write a function called iwls that given a vector of binary responses (0s and 1s), and a model matrix
will return the maximum likelihood estimates of — for a logistic regression model. The function should
have exactly four arguments:
4
• X: the n ◊ p model matrix X;
• y: the n ◊ 1 vector of responses y = (y1,...,yn);
• itmax: the maximum number of iterations of the IWLS algorithm, i.e. the highest value that j
can take;
• eps: the tolerance, ‘, used to assess convergence.
Set itmax and eps to have default values of 25 and 10≠9, respectively. The starting value —(0) should
be the corresponding least squares estimate for —. [35 marks]
(B) The data stored in the file diab.txt is originally from the US National Institute of Diabetes and
Digestive and Kidney Diseases and contains measurements from n = 768 female individuals. The
objective of the formation of the data is to predict whether or not a patient has diabetes based on a
series of eight measurements. The variables in the dataset are described in Table 1.
Table 1: Variables in the diab.txt file.
R name Symbol Description
preg z1 Number of times pregnant
gluc z2 Plasma glucose concentration a 2 hours in an oral glucose tolerance test
bp z3 Diastolic blood pressure
skin z4 Triceps skin fold thickness
insulin z5 2-Hour serum insulin
bmi z6 Body mass index
ped z7 Diabetes pedigree function
age z8 Age
out y Binary response variable (0 = no diabetes, 1 = diabetes)
Read in the data in the file diab.txt. Consider a logistic regression model with yi as the response (see
Table 1) and the following linear predictor:÷i = ÿk=1
—kxik,
where, if k = 1, then xik = 1 and if k > 1 then xik is given by the value of zk≠1 for the ith person (see
Table 1). In other words, the model has a linear predictor with an intercept, and a term for each of the
explanatory variables (the z’s) in Table 1.
Using your iwls function, find the maximum likelihood estimates of — = (—1,..., —9)T for the above
model. Use the default values for itmax and eps. Fit the same model with the in-built glm function
and comment on any dierences
between the estimates found by your iwls function and by glm.
[15 marks]
Hints:
• A diagonal matrix is a matrix with non-zero elements on the leading diagonal and zeros elsewhere.
• The R function as.vector can convert an R matrix into an R vector.
• The R function as.matrix can convert an R data frame into an R matrix.