辅导MATH6166、辅导R编程设计

MATH6166 Statistical Computing for Data Scientist — Coursework
Instructions for Coursework
1. This coursework assignment is worth 100% of the overall mark for the module MATH6166.
2. The deadline is 1500 on Thursday 17 November 2022, and your completed work must be submitted
electronically via Blackboard.
3. Your submission must consist of exactly one PDF file, which includes all the plots/graphics asked
in the questions, together with one R script file containing all the R code you used to obtain your
results for all questions. Missing R codes will be penalised. In the PDF, you must label clearly with
which questions the figures are associated. In the R script, you must use comments to make it clear
to which questions the R codes are answering. Please ensure the submitted files have the correct file
format; otherwise it cannot be marked. Note an R markdown file is a different format to an R script
file.
4. Please name your PDF file as .pdf. For example, if your student ID is 12345678,
then your report must have the file name 12345678.pdf.
5. Your R script file must have the file name .R. For example, if your student ID is
12345678, then your script must have the file name 12345678.R.
6. Your entire R script file must be reproducible and run without any errors.
7. You must present elegant and informative plots in the PDF file, including the use of appropriate plotting
area, points/lines sizes and colours, as well as including meaningful labels, titles and legends.
8. You must informatively comment your R code indicating where each question and task begins. When
writing an R function, the 1) function and 2) argument names must be exactly the same as those
specified in the questions; otherwise a heavy penalty will be applied.
Please note that uppercase and lowercase of the same alphabet are not the same character, so
if the question asks to write a function/argument called, apple, but a function/argument called
applE is presented in your answer, than that would be wrong.
9. You must not load any add-on packages, i.e., using functions library or require etc.
10. The standard Faculty rules on late submissions apply: for every weekday, or part of a weekday, that
the coursework is late, the mark will be reduced by a factor of 10%. No credit will be given for work
which is more than one week late.
11. All coursework must be carried out independently. You are reminded of the University’s Academic
Integrity Policy, see https://www.southampton.ac.uk/quality/assessment/academic_integrity.page.
12. In the interests of fairness, queries which relate directly to the coursework must be made via the
MATH6166 Discussion Forum on Blackboard. This ensures that everybody has access to the same
information.
Total marks available: 100
1
Question 1: Wild Life Population Density
[30 marks] An ecologist is interested in the population density of a bird species at various sites across three
different habitat types. The data collected by the ecologist is stored in the data file ecoStudy.txt, which
has two variables:
density: the population density recorded at each site (count per km2),
habitat: the habitat type of the site.
(a) [6 marks] Import the dataset from the ecoStudy.txt file into the R workspace.
(i) Find the minimum, first quartile, median, mean, third quartile, and maximum values of the density
variable.
(ii) Find the number of observations in each habitat type.
Report your answer as comments below your R commands for each part of this question in the R script.
Where the answer is not an integer, round to 3 decimal places.
(b) [6 marks] Create a new variable called logDensity in the data.frame object containing the data
imported from ecoStudy.txt, where logDensity is defined by
log(density + 1). (1)
(i) Create a side-by-side box plot of density for all habitat types.
(ii) Create a side-by-side box plot of logDensity for all habitat types.
(iii) Use the appropriate R command(s) to arrange the graphs from (i) & (ii) into a lattice of two rows
and one column.
(c) [8 marks] We would like to test whether the population mean density is the same across all habitat
types. A common approach is to apply the F -test in an one-way analysis of variance (ANOVA). The
F -test statistic, Fd1,d2 , is given by ∑K
i=1Ni(yˉi ? yˉ)2/(K ? 1)
[
∑K
i=1
∑Ni
j=1 (yij ? yˉi)2]/(N ?K)
, (2)
where yij is the density value of the jth observation in the ith habitat type, yˉi is the sample mean
density of the ith habitat type and yˉ is the overall sample mean density. The term K represents the
number of habitat types and Ni is the number observations in the ith habitat type.
The degrees of freedom d1 and d2 of the F -statistic are respectively defined by K? 1 and N ?K, where
N is the total number of observations in the dataset.
(i) Write your own R commands to calculate the F -test statistic for testing the hypothesis that
the the population mean density is same across all habitat types.
Important: for the mathematical calculations, you may only use (but do not need to include all)
the operators +, -, *, /, ?, %% and the functions mean() and sum().
(ii) Use the pf() function to find the p-value of the F -test.
Hint: the p-value is the probability of observing an F -statistic equal or larger than the one
calculated in part (i).
Report your answer (to 3 significant figures) as comments below your R commands for each part of this
question in the R script.
(d) [4 marks] Repeat part (c); however, this time you must use one of the apply functions. Apply functions
include apply(), sapply(), tapply etc. You need to choose the appropriate function to perform the
task correctly. Report your answer (to 3 significant figures) as comments below your R commands for
this question in the R script.
2
(e) [6 marks] Calculate the residuals of the F -test and create a Q-Q plot to visually examine whether
the residuals are normally distributed. The residual value of the jth observation of the ith habitat is
yij ? yˉi, where yij and yˉi are defined in part (c). Make sure to include a reference line in your Q-Q
plot. In the R script, as comments below the R commands for this question, write a brief description
on whether the normality assumption has been satisfied.
Question 2: Mixture Models
[16 marks]
The probability density function of a gamma mixture model with two components is given by
fX(x) = p
xα1?1e?β1xβα11
Γ(α1)
+ (1? p)x
α2?1e?β2xβα22
Γ(α2)
, (3)
where x is the observed value of the random variable X, which follows the above gamma mixture model. The
parameters α1, α2, β1, and β2 only take positive values, while the parameter p is a probability.
(a) [6 marks] Write a function calcLogMix2Gamma that calculates the natural logarithm of the joint
probability density for any number of independent random variables X1, ..., Xn assumed to be distributed
according to the gamma mixture model in equation (3). The calcLogMix2Gamma function have the
following features.
Arguments:
(1) x is a numeric vector of positive values only, representing the observed values of X1, ..., Xn, where
n can be any positive integer, which means x can be of any length.
(2) alpha is a positive numeric vector of length two, where the first and second elements respectively
correspond to the values of α1 and α2 in equation (3).
(3) beta is a positive numeric vector of length two, where the first and second elements respectively
represent the values of β1 and β2 in equation (3).
(4) p is a single numeric value between 0 and 1 inclusive, representing p in equation (3).
Computation:
The calcLogMix2Gamma function calculates the natural logarithm of the joint probability density
of x, given alpha, beta and p according equation (3).
Return:
A single numeric value as calculated in Computation.
Important: You must write you own code to perform the specified computation, i.e., you must
not use any existing functions for calculating the density of gamma distributions or mixture models;
otherwise a heavy penalty will be applied.
Hint: the functions gamma() or lgamma() maybe used.
(b) [6 marks] To enable the calculations in part (a), the calcLogMix2Gamma function needs to ensure objects
passed the arguments satisfy the following requirements.
(i) x is a numeric vector containing positive values only.
(ii) alpha and beta are both positive numeric vectors of length two.
(iii) p is a single numeric value between 0 and 1 inclusive.
Add additional R commands at the beginning of your calcLogMix2Gamma function in part (a) to check
that valid inputs have been provided before any calculations. If invalid inputs are passed to this function,
it will stop prematurely by using the appropriate R function and print out an informative error message.
3
(c) [4 marks] Let x be the observed values of random variables (X1, ..., X10), each distributed according to
the model in equation (3). Use the calcLogMix2Gamma function to calculate the natural logarithm of
the joint probability density for the observed values
x = (0.06, 2.32, 4.81, 0.02, 2.33, 2.18, 0.83, 2.45, 2.10, 3.27), (4)
given the parameter values (α1, α2) = (0.5, 6), (β1, β2) = (0.5, 2.5) and p = 0.35. Report your answer
(to 3 decimal places) as comments below your R commands for this question in the R script.
Question 3: Riemann Integration
[31 marks] Figure 1 illustrates the function below:
g(x) = sin
(
1
2α log
(
Γ(x+ β)kpαqβ
Γ(α)kΓ(β)k
))
, for ?∞ < x <∞, (5)
where α, β, p, q and k are the parameters of function g(x), while α > 0 and β > 0.
2 4 6 8 10
(b)
Figure 1: The curve represented by function g(x) in equation (5): In panel (a), the area shaded in grey is the
area between g(x) and the x-axis within the x-interval [a, b], while the shaded region in panel (b) is the area
between g(x) and the x-axis within the x-interval [a′, b′]
.
When a curve g(x) is above the x-axis for the x-interval [a, b] as in Figure 1 (a), we can calculate the area
between g(x) and the x-axis within [a, b] using the middle Riemann sum to approximate the integration∫ b
(6)
with subintervals defined by a = x0 < ... < xN = b.
(a) [6 marks] Write an R function smoothCurve that calculates the mathematical function g(x) defined in
equation (5).
Arguments:
4
(1) x is a numeric value or a numeric vector.
(2) alpha is a single positive numeric value.
(3) beta is a single positive numeric value.
(4) p is a single numeric value.
(5) q is a single numeric value.
(6) k is a single numeric value.
Computation:
This function calculates the value(s) of g(x) given by equation (5).
Return:
If x is a numeric value, this function returns the numeric value calculated in Computation, or
if x is a numeric vector, the function returns the numeric vector calculated in Computation.
(b) [5 marks] Write a function called calcMidRiemannLoop that calculates the area between g(x) in equation
(5) and the x-axis for any given x-interval [a, b] by using the middle Riemann sum.
Arguments:
(1) xVec is a numeric vector with elements corresponding to x0, ..., xN in equation (6).
Note: You can assume that for marking, the x-intercepts will not fall within any of the subintervals;
however, the subintervals may have different lengths. The x-intercepts are the values of x that
give f(x) = 0.
(2) —(6) alpha, beta, p, q and k have the exact same specifications as in part (a).
Computation:
This function calculates the area between the x-axis and g(x) in equation (5) according to the
middle Riemann sum with subintervals defined by xVec.
Hint: An area is always positive.
Return:
A numeric value which is the area calculated in Computation.
Important: You must use a loop to answer this question; otherwise a heavy penalty will be applied.
(c) [4 marks] Write a function called calcMidRiemann which has the exact same Arguments, Computation
and Return specifications as in part (b), except you must not use a loop for this part.
(d) [4 marks] For the calcMidRiemann function to work the function need to check that the valid inputs
are passed to the argument, specifically, its argument xVec must be a numeric vector with values in
increasing order.
Add additional R commands at the beginning of your calcMidRiemann function in part (c) to check
that valid inputs have been provided before any calculations. If invalid inputs are passed to this function,
it will stop prematurely by using the appropriate R function and print out an informative error message.
(e) [3 marks] Use the calcMidRiemann to calculate the area between x = 2 and x = 8.5 with the sequence
a = x0 = 2.0 < 2.01 < ... < 8.49 < 8.5 = xN = b, given the parameter values α = 2.1, β = 0.5, p = 3,
q = 6 and k = 2. Report your answer (to 3 decimal places) as comments below your R commands for
this question in the R script.
(f) [5 marks] Write a function called calcMidRiemannAreas, which calculates the areas between the g(x)
curve and the x-axis for multiple intervals on the x-axis. This function has the following specifications.
Arguments:
5
(1) xSeqList is a list of numeric vectors, where each vector corresponds to an increasing numeric
sequence x0, ..., xN , and the lengths of the vectors do not need to be the same.
(2) —(6) alpha, beta, p, q and k have the exact same specifications as in part (a).
Computation:
For each numeric sequence (defined by each vector in xSeqList), this function calculates the
area between the x-axis and g(x) in equation (5) according to the middle Riemann sum with
subintervals defined by each vector of xSeqList.
Return:
A numeric vector containing the areas as calculated in Computation.
(g) [4 marks] Use the calcMidRiemannAreas to calculate the areas within the x-intervals defined by the
following sequences
3.50 < 3.51 < ... < 7.99 < 8.00
2.0 < 2.1 < ... < 4.2 < 4.3
1.07 < 1.071 < ... < 9.011 < 9.012
given the parameter values α = 1.9, β = 0.75, p = 2.15, q = 7 and k = 1.65. Report your answer (to 3
decimal places) as comments below your R commands for this question in the R script.
Question 4: Hidden Markov Model
[23 marks] Consider a sequence of weather events X1, X2, ..., Xn, where each Xi can be one of the three
possible weather states: sunny, cloudy or rainy on the ith day for all i in {1, ..., n}. Assume that the
probability of the weather on the ith day, Xi only depends on the weather on the previous day Xi?1 for all i
in {2, ..., n}. This means that
p(Xi|X1, ..., Xi?1) = p(Xi|Xi?1), (7)
where the conditional probability p(Xi = b|Xi?1 = a) is termed the transition probability from weather state
a to whether state b. Thus, we can define the transition probability matrix PT as
PT =
pss psc psrpcs pcc pcr
prs prc prr , (8)
where the subscripts s, c and r represent sunny, cloudy and rainy, respectively. For example, pcs = p(Xi =
sunny|Xi?1 = cloudy) for the all i in {2, ..., n}. The initial probabilities for the weather state of X1 is given
by pi = (pis, pic, pir), so for example, p(X1 = sunny) = pis.
The probability of a sequence of weather events p(X1, X2, ..., Xn) is then given by
p(X1, X2, ..., Xn) = p(X1)
n∏
i=2
p(Xi|Xi?1). (9)
For example, the probability of observing cloudy on day 1, rainy on day 2 and sunny on day 3 is therefore
p(X1 = cloudy, X2 = rainy, X3 = sunny)
=p(X3 = sunny|X2 = rainy)p(X2 = rainy|X1 = cloudy)p(X1 = cloudy)
=prspcrpic.
(a) [8 marks] Write a function called weatherSeqProb, which has the following specifications.
Arguments:
6
(1) weatherSeq is a character vector, where each element must one of the three character "s", "c" or
"r".
(2) trProbs is a numeric matrix object with three rows and three columns, where the values are the
transition probabilities arranged according to equation (11).
(3) initProbs is a numeric vector with three elements, where the first, second and third elements
correspond to the initial probabilities pis, pic and pir.
Computation:
– Calculate the natural logarithm of the probability of observing the sequence of weather specify
by weatherSeq given the transition probability matrix trProbs and initial state probabilities
initProbs.
Return:
– A numeric value, which is the natural log probability calculated in Computation.
(b) [3 marks] Calculate the natural logarithm of the probability for observing
X1 = cloudy, X2 = sunny, X3 = cloudy, X4 = rainy, X5 = sunny and X6 = sunny, (10)
given pi = (0.45, 0.25, 0.3) and
PT =
??0.50 0.40 0.100.33 0.35 0.32
0.30 0.30 0.40
?? . (11)
Report your answer (to 3 decimal places) as comments below your R commands for this question in the
R script.
(c) [8 marks] Our hypothetical friend chooses their jacket (black or white) depending on weather with some
randomness. The conditional probability matrix for the jacket chosen given the weather is defined by
the matrix
PE =psB psWpcB pcW
prB prW, (12)
where the subscripts B and W stand for black and white (jackets), respectively. For example, pcW
is probability that our friend wears a White jacket on a cloudy day. Note that in the subscripts, the
weather states are in lowercase, while the colour states are in uppercase!
Let Yi represent the jacket colour on the ith day. The joint probability of a sequence of events weather
and colour events p(X1, X2, ..., Xn, Y1, Y2, ..., Yn) is then given by
p(X1, X2, ..., Xn, Y1, Y2, ..., Yn) = p(Y1|X1)p(X1)
n∏
i=2
p(Yi|Xi)p(Xi|Xi?1). (13)
For example, the joint probability of observing cloudy weather and white jacket on day 1, rainy weather
and black jacket on day 2 and sunny weather and black jacket on day 3 is
p(X1 = cloudy, X2 = rainy, X3 = sunny, Y1 = white, Y2 = black, Y3 = black)
= p(Y3 = black|X3 = sunny)p(X3 = sunny|X2 = rainy)
× p(Y2 = black|X3 = rainy)p(X2 = rainy|X1 = cloudy)p(Y1 = white|X1 = cloudy)p(X1 = cloudy)
= psBprsprBpcrpcWpic.
Write a function called weatherColourProbs with the following specifications.
Arguments:
(1) colourSeq is a character vector, wherein each element is restricted to be either "B" or "W".
7
(2) emitProbs is a numeric matrix object with three rows and two columns, wherein the values are
the probabilities of the jacket colour conditioned on weather as arranged according to equation
(12).
(3) —(5) weatherSeq, trProbs and initProbs are specified exactly the same as in part (a).
Computation:
– Calculate the natural logarithm of the joint probability of observing the sequence of weather
specified by weatherSeq and the sequence of jacket colour specified by colourSeq, given the
probability arguments emitProbs, trProbs and initProbs.
Return:
– A numeric value, which is the natural log probability calculated in Computation.
Important: In Computation, you must not repeat the codes in weatherSeqProb here.
(d) [4 marks] Calculate the natural logarithm of the joint probability for observing
X = {X1, X2, ..., X8}
= {rainy, sunny, cloudy, rainy, cloudy, rainy, sunny, sunny}
Y = {Y1, Y2, ..., Y8}
= {black,white,white,black,black,white,white,white}
with
pi = (0.35, 0.45, 0.2), PT =
0.55 0.25 0.200.25 0.35 0.40
0.20 0.15 0.65 and PE =0.20 0.800.55 0.45
0.90 0.10 .
Report your answer (to 3 decimal places) as comments below your R commands for this question in the
R script.