代写MTHM502、辅导R编程设计

MTHM502 Introduction to Data Science and Statistical Modelling
Assignment
Please make sure that the submitted work is your own. This is NOT a group assignment,
therefore approaches, solutions shouldn’t be discussed with other students. Plagiarism and
collusion with other students are examples of academic misconduct and will be reported. More
information on academic honesty can be found here.
1. The colour of the human eye is determined by a pair of genes. If both of these genes code the colour
blue, then the given person will have blue eyes. If at least one of the genes codes the colour brown,
then the person will have brown eyes. That is, if we denote by ‘A’ the gene coding the colour brown,
and by ‘a’ the gene coding the colour blue, then we have the following
Gene Eye colour
AA Brown
Aa Brown
aA Brown
aa Blue
A child inherits one gene from each of their parents. That is one gene is chosen randomly (with
equal probability) from the gene-pair of their father, and one gene is chosen randomly (with equal
probability) from the gene-pair of their mother. Below are two examples, where the entries of the
tables show the possible gene-pairs of the children. Note that each of these gene-pairs has equal
probability.
Example 1:
Father’s genes
A a
Mother’s A AA Aa
genes a aA aa
Example 2:
Father’s genes
A A
Mother’s A AA AA
genes a aA aA
Assume that Aaron and both of his parents have brown eyes, but Aaron’s sister has blue eyes.
(a) [3 marks] What is the probability that Aaron has a blue eye gene?
(b) [6 marks] Assume that Aaron’s wife has blue eyes. What is the probability that their first child
will have blue eyes?
1
(c) [10 marks] Suppose that Aaron and his wife’s first child ended up having brown eyes (and not
blue). How does this information change the probability that Aaron has a blue eye gene? What
is the probability that their second child will have brown eyes too?
2. Assume that a new Conservative Party leadership election has been triggered in the UK at a time when
there are 361 conservative MPs in the parliament. Two of these MPs, M and B, join the leadership
contest, where the aim is to get the majority support of the remaining 359 conservative MPs.
We further assume that on the day the leadership contest is announced 184 of these MPs support
M, and the remaining 175 MPs support B in becoming the next party leader. The announcement
is followed by an election campaign, during which MPs can decide to change their allegiance. In
particular, we know that on any given day, there is a probability of 0.005 that an MP who has been
supporting M will become a B supporter by the end of the day, while the probability that an MP who
has been supporting B will become an M supporter by the end of the day is 0.004. Each MP makes
their decision independently of each other, and independently of the decision they made the day before.
(a) [4 marks] Introduce the following random variables:
X
(1)
i =
{
1, B supporter number i still supports B at the end of day 1,
0, B supporter number i changes to an M supporter at the end of day 1,
for i = 1, . . . , 175; and
X
(2)
i =
{
1, M supporter number i changes to a B supporter at the end of day 1,
0, M supporter number i still supports M at the end of day 1,
for i = 1, . . . , 184.
Using these random variables express the number of B supporters at the end of the first day, then
use your formula to find the expected number of B supporters at the end of the first day. Justify
every step of your argument.
(b) [3 marks] Define random variables X?(1)i , i = 1, . . . , 175 and X?
(2)
i , i = 1, . . . , 184 whose sum gives
you the number of M supporters at the end of the first day. What is the expected number of M
supporters at the end of the first day?
(c) [6 marks] R: The election campaign is set to last for 2 weeks. This means that each MP would
vote according to the allegiance they have at the end of day 14, that is, the candidate they would
vote for is the one they are supporting after the first 14 days of the campaign. Using simulation
find the probability that in this election B would hold the majority of the votes among the 359
MPs.
(d) [3 marks] R: Now suppose that the election had to be postponed, and with the new date, candi-
dates now have a 60 day long campaign period (as opposed to 14 days). Adjust your code from
part 2c to find the probability that B will win the delayed election. How does this probability
compare to the one computed in part 2c?
3. Observations Y1, Y2, . . . , Yn are assumed to be independent and identically distributed samples from a
data model following a Rayleigh distribution, with probability density function:
f(y; θ) = ye
y2/2θ
θ
for θ > 0 and 0 < y <∞.
The mean of this distribution is
μ =
√
πθ
2 ,
and the variance is
σ2 = θ(4? π)2 .
(Note that here π is not a parameter, it is the usual mathematical constant i.e. 3.14...)
2
(a) [2 marks] Find the method of moments estimator θ? of θ.
(b) [5 marks] Is your estimator θ? unbiased? If not, then suggest an adjustment to this estimator that
would make it unbiased and report your final unbiased estimator. Hint: If E(θ?) = cθ, then the
estimator 1c θ? is unbiased. Also, remember that we can express second moments using the formula
of the variance.
(c) [4 marks] An alternative estimator is θ? = 12n
∑n
i=1 Y
2
i . Is this estimator unbiased? If not, suggest
an adjustment that makes it unbiased. See hints given in part 3b.
(d) [5 marks] Using the fact that the random variable X = Y 2 is exponentially distributed with rate
1
2θ , assess whether the estimator θ? from part 3c is consistent.
(e) [6 marks] We have 150 samples from a Rayleigh distribution with sample mean 3.2. Using an
appropriate point estimator of θ, suggest a suitable estimate of the variance, and use this variance
estimate to construct an approximate 95% confidence interval for the mean of the distribution.
(You can use R to find the relevant quantiles).
4. Consider the data set Y1, Y2, . . . , Yn that is assumed to have arisen from the data model with probability
density function
f(y; θ) =
{
k(1? y)yθ+1, 0 < y < 1,
0, otherwise,
where θ > 0.
(a) [4 marks] Find the constant k that makes the above function a probability density function.
(b) [6 marks] Show that the maximum likelihood estimator, θ? of θ is given by the solution to the
equation:
θ?2
n∑
i=1
log(Yi) + θ?
[
5
n∑
i=1
log(Yi) + 2n
]
+ 6
n∑
i=1
log(Yi) + 5n = 0.
(c) [5 marks] R: Let y1, . . . , y30 below correspond to 30 samples of this distribution
0.573 0.770 0.652 0.827 0.821 0.789
0.898 0.718 0.382 0.668 0.647 0.477
0.661 0.380 0.870 0.794 0.783 0.732
0.629 0.777 0.600 0.724 0.553 0.693
0.687 0.935 0.494 0.411 0.530 0.478
To produce a maximum likelihood estimate for θ based on these data, use the polyroot function
of R.
Hint: Polyroot finds the roots of a polynomial. Its argument is the vector of polynomial coefficients
in increasing order. For example, to find the roots of the polynomial p(x) = x2 + 2x ? 3 we can
use
rt <- polyroot(c(-3,2,1))
Even though both roots that you will get are real, polyroot gives these roots in complex form
(don’t worry about what this means). You can use the Re() function to extract the real part of
complex numbers. That is if the outcome of the polyroot function is stored in the variable rt,
then we can use the following to get the desired roots.
rt_real <- Re(rt)
rt_real
## [1] 1 -3
Note that this code lists all the roots of a polynomial. You will have to check which one of these
is a local maximum.
3
(d) [3 marks] R: Produce a plot of the fitted probability density function using the estimate of θ
obtained from 4c.
5. The file ‘ozone.csv’, available on the course ELE page, contains information on ozone levels recorded
over 111 days from May to September 1973 in New York. The variables measured were:
ozone Ozone levels, in parts per billion (ppb),
radiation in langleys
temperature in farenheit
wind in miles per hour (mph)
Read these data into R and answer the following questions.
(a) [6 marks] Carry out exploratory data analysis, and produce a matrix scatterplot of the dataset.
Comment on your findings and what these plots suggest about the likely relationships between
the response variable (ozone) and the other variables.
(b) [9 marks] Fit a multiple regression of ozone as the response variable, against radiation,
temperature and wind as the explanatory variables (use all three, when fitting the model).
Comment on the summary of the model. What do these coefficients suggest about the relationship
between ozone and the other variables? Are these findings consistent with your earlier descriptive
plots? Also include suitable residual plots, commenting as appropriate.
(c) [10 marks] A colleague suggests you implement the following model,
log(ozonei) = β0+β1 log(radiationi)+β2 log(temperaturei)+β3 log(windi)+?i where ?i ～ N(0, σ2).
Fit this new model to the data to obtain estimates for the regression coefficients. Produce a plot of
the residuals against the fitted values, and a Q-Q plot of the residuals. Comment on the outputs
from the modelling (comparing it to the previously fitted model), paying particular attention to
the interpretation of the coefficients. Express the impact of the explanatory variables on the ozone
levels, with the latter expressed on the original (untransformed) scale.
Total for paper = 100 marks.
The submitted work should be your own work! The questions apart from Q2(c), Q2(d),
Q4(c), Q4(d) and Q5 are theoretical exercises, and should be solved using results we covered
in lectures. Make sure you justify each step of the theoretical reasoning by clearly stating the
theorem/property you are using (marks will be awarded for these). Also make sure that you
add comments to each section of your R code, explaining what you’re doing. All the relevant
R output (computed probabilities, plots, etc) should be included in your submission! A pdf
document with your R code, R output and the solutions to the theoretical exercises should be
submitted through EBART by Noon (12pm), 2nd December. Note that late submissions will
be penalised.