首页 >
> 详细

MATH6166 Statistical Computing for Data Scientist — Coursework

Instructions for Coursework

1. This coursework assignment is worth 100% of the overall mark for the module MATH6166.

2. The deadline is 1500 on Thursday 17 November 2022, and your completed work must be submitted

electronically via Blackboard.

3. Your submission must consist of exactly one PDF file, which includes all the plots/graphics asked

in the questions, together with one R script file containing all the R code you used to obtain your

results for all questions. Missing R codes will be penalised. In the PDF, you must label clearly with

which questions the figures are associated. In the R script, you must use comments to make it clear

to which questions the R codes are answering. Please ensure the submitted files have the correct file

format; otherwise it cannot be marked. Note an R markdown file is a different format to an R script

file.

4. Please name your PDF file as .pdf. For example, if your student ID is 12345678,

then your report must have the file name 12345678.pdf.

5. Your R script file must have the file name .R. For example, if your student ID is

12345678, then your script must have the file name 12345678.R.

6. Your entire R script file must be reproducible and run without any errors.

7. You must present elegant and informative plots in the PDF file, including the use of appropriate plotting

area, points/lines sizes and colours, as well as including meaningful labels, titles and legends.

8. You must informatively comment your R code indicating where each question and task begins. When

writing an R function, the 1) function and 2) argument names must be exactly the same as those

specified in the questions; otherwise a heavy penalty will be applied.

Please note that uppercase and lowercase of the same alphabet are not the same character, so

if the question asks to write a function/argument called, apple, but a function/argument called

applE is presented in your answer, than that would be wrong.

9. You must not load any add-on packages, i.e., using functions library or require etc.

10. The standard Faculty rules on late submissions apply: for every weekday, or part of a weekday, that

the coursework is late, the mark will be reduced by a factor of 10%. No credit will be given for work

which is more than one week late.

11. All coursework must be carried out independently. You are reminded of the University’s Academic

Integrity Policy, see https://www.southampton.ac.uk/quality/assessment/academic_integrity.page.

12. In the interests of fairness, queries which relate directly to the coursework must be made via the

MATH6166 Discussion Forum on Blackboard. This ensures that everybody has access to the same

information.

Total marks available: 100

1

Question 1: Wild Life Population Density

[30 marks] An ecologist is interested in the population density of a bird species at various sites across three

different habitat types. The data collected by the ecologist is stored in the data file ecoStudy.txt, which

has two variables:

density: the population density recorded at each site (count per km2),

habitat: the habitat type of the site.

(a) [6 marks] Import the dataset from the ecoStudy.txt file into the R workspace.

(i) Find the minimum, first quartile, median, mean, third quartile, and maximum values of the density

variable.

(ii) Find the number of observations in each habitat type.

Report your answer as comments below your R commands for each part of this question in the R script.

Where the answer is not an integer, round to 3 decimal places.

(b) [6 marks] Create a new variable called logDensity in the data.frame object containing the data

imported from ecoStudy.txt, where logDensity is defined by

log(density + 1). (1)

(i) Create a side-by-side box plot of density for all habitat types.

(ii) Create a side-by-side box plot of logDensity for all habitat types.

(iii) Use the appropriate R command(s) to arrange the graphs from (i) & (ii) into a lattice of two rows

and one column.

(c) [8 marks] We would like to test whether the population mean density is the same across all habitat

types. A common approach is to apply the F -test in an one-way analysis of variance (ANOVA). The

F -test statistic, Fd1,d2 , is given by ∑K

i=1Ni(yˉi ? yˉ)2/(K ? 1)

[

∑K

i=1

∑Ni

j=1 (yij ? yˉi)2]/(N ?K)

, (2)

where yij is the density value of the jth observation in the ith habitat type, yˉi is the sample mean

density of the ith habitat type and yˉ is the overall sample mean density. The term K represents the

number of habitat types and Ni is the number observations in the ith habitat type.

The degrees of freedom d1 and d2 of the F -statistic are respectively defined by K? 1 and N ?K, where

N is the total number of observations in the dataset.

(i) Write your own R commands to calculate the F -test statistic for testing the hypothesis that

the the population mean density is same across all habitat types.

Important: for the mathematical calculations, you may only use (but do not need to include all)

the operators +, -, *, /, ?, %% and the functions mean() and sum().

(ii) Use the pf() function to find the p-value of the F -test.

Hint: the p-value is the probability of observing an F -statistic equal or larger than the one

calculated in part (i).

Report your answer (to 3 significant figures) as comments below your R commands for each part of this

question in the R script.

(d) [4 marks] Repeat part (c); however, this time you must use one of the apply functions. Apply functions

include apply(), sapply(), tapply etc. You need to choose the appropriate function to perform the

task correctly. Report your answer (to 3 significant figures) as comments below your R commands for

this question in the R script.

2

(e) [6 marks] Calculate the residuals of the F -test and create a Q-Q plot to visually examine whether

the residuals are normally distributed. The residual value of the jth observation of the ith habitat is

yij ? yˉi, where yij and yˉi are defined in part (c). Make sure to include a reference line in your Q-Q

plot. In the R script, as comments below the R commands for this question, write a brief description

on whether the normality assumption has been satisfied.

Question 2: Mixture Models

[16 marks]

The probability density function of a gamma mixture model with two components is given by

fX(x) = p

xα1?1e?β1xβα11

Γ(α1)

+ (1? p)x

α2?1e?β2xβα22

Γ(α2)

, (3)

where x is the observed value of the random variable X, which follows the above gamma mixture model. The

parameters α1, α2, β1, and β2 only take positive values, while the parameter p is a probability.

(a) [6 marks] Write a function calcLogMix2Gamma that calculates the natural logarithm of the joint

probability density for any number of independent random variables X1, ..., Xn assumed to be distributed

according to the gamma mixture model in equation (3). The calcLogMix2Gamma function have the

following features.

Arguments:

(1) x is a numeric vector of positive values only, representing the observed values of X1, ..., Xn, where

n can be any positive integer, which means x can be of any length.

(2) alpha is a positive numeric vector of length two, where the first and second elements respectively

correspond to the values of α1 and α2 in equation (3).

(3) beta is a positive numeric vector of length two, where the first and second elements respectively

represent the values of β1 and β2 in equation (3).

(4) p is a single numeric value between 0 and 1 inclusive, representing p in equation (3).

Computation:

The calcLogMix2Gamma function calculates the natural logarithm of the joint probability density

of x, given alpha, beta and p according equation (3).

Return:

A single numeric value as calculated in Computation.

Important: You must write you own code to perform the specified computation, i.e., you must

not use any existing functions for calculating the density of gamma distributions or mixture models;

otherwise a heavy penalty will be applied.

Hint: the functions gamma() or lgamma() maybe used.

(b) [6 marks] To enable the calculations in part (a), the calcLogMix2Gamma function needs to ensure objects

passed the arguments satisfy the following requirements.

(i) x is a numeric vector containing positive values only.

(ii) alpha and beta are both positive numeric vectors of length two.

(iii) p is a single numeric value between 0 and 1 inclusive.

Add additional R commands at the beginning of your calcLogMix2Gamma function in part (a) to check

that valid inputs have been provided before any calculations. If invalid inputs are passed to this function,

it will stop prematurely by using the appropriate R function and print out an informative error message.

3

(c) [4 marks] Let x be the observed values of random variables (X1, ..., X10), each distributed according to

the model in equation (3). Use the calcLogMix2Gamma function to calculate the natural logarithm of

the joint probability density for the observed values

x = (0.06, 2.32, 4.81, 0.02, 2.33, 2.18, 0.83, 2.45, 2.10, 3.27), (4)

given the parameter values (α1, α2) = (0.5, 6), (β1, β2) = (0.5, 2.5) and p = 0.35. Report your answer

(to 3 decimal places) as comments below your R commands for this question in the R script.

Question 3: Riemann Integration

[31 marks] Figure 1 illustrates the function below:

g(x) = sin

(

1

2α log

(

Γ(x+ β)kpαqβ

Γ(α)kΓ(β)k

))

, for ?∞ < x <∞, (5)

where α, β, p, q and k are the parameters of function g(x), while α > 0 and β > 0.

2 4 6 8 10

(b)

Figure 1: The curve represented by function g(x) in equation (5): In panel (a), the area shaded in grey is the

area between g(x) and the x-axis within the x-interval [a, b], while the shaded region in panel (b) is the area

between g(x) and the x-axis within the x-interval [a′, b′]

.

When a curve g(x) is above the x-axis for the x-interval [a, b] as in Figure 1 (a), we can calculate the area

between g(x) and the x-axis within [a, b] using the middle Riemann sum to approximate the integration∫ b

(6)

with subintervals defined by a = x0 < ... < xN = b.

(a) [6 marks] Write an R function smoothCurve that calculates the mathematical function g(x) defined in

equation (5).

Arguments:

4

(1) x is a numeric value or a numeric vector.

(2) alpha is a single positive numeric value.

(3) beta is a single positive numeric value.

(4) p is a single numeric value.

(5) q is a single numeric value.

(6) k is a single numeric value.

Computation:

This function calculates the value(s) of g(x) given by equation (5).

Return:

If x is a numeric value, this function returns the numeric value calculated in Computation, or

if x is a numeric vector, the function returns the numeric vector calculated in Computation.

(b) [5 marks] Write a function called calcMidRiemannLoop that calculates the area between g(x) in equation

(5) and the x-axis for any given x-interval [a, b] by using the middle Riemann sum.

Arguments:

(1) xVec is a numeric vector with elements corresponding to x0, ..., xN in equation (6).

Note: You can assume that for marking, the x-intercepts will not fall within any of the subintervals;

however, the subintervals may have different lengths. The x-intercepts are the values of x that

give f(x) = 0.

(2) —(6) alpha, beta, p, q and k have the exact same specifications as in part (a).

Computation:

This function calculates the area between the x-axis and g(x) in equation (5) according to the

middle Riemann sum with subintervals defined by xVec.

Hint: An area is always positive.

Return:

A numeric value which is the area calculated in Computation.

Important: You must use a loop to answer this question; otherwise a heavy penalty will be applied.

(c) [4 marks] Write a function called calcMidRiemann which has the exact same Arguments, Computation

and Return specifications as in part (b), except you must not use a loop for this part.

(d) [4 marks] For the calcMidRiemann function to work the function need to check that the valid inputs

are passed to the argument, specifically, its argument xVec must be a numeric vector with values in

increasing order.

Add additional R commands at the beginning of your calcMidRiemann function in part (c) to check

that valid inputs have been provided before any calculations. If invalid inputs are passed to this function,

it will stop prematurely by using the appropriate R function and print out an informative error message.

(e) [3 marks] Use the calcMidRiemann to calculate the area between x = 2 and x = 8.5 with the sequence

a = x0 = 2.0 < 2.01 < ... < 8.49 < 8.5 = xN = b, given the parameter values α = 2.1, β = 0.5, p = 3,

q = 6 and k = 2. Report your answer (to 3 decimal places) as comments below your R commands for

this question in the R script.

(f) [5 marks] Write a function called calcMidRiemannAreas, which calculates the areas between the g(x)

curve and the x-axis for multiple intervals on the x-axis. This function has the following specifications.

Arguments:

5

(1) xSeqList is a list of numeric vectors, where each vector corresponds to an increasing numeric

sequence x0, ..., xN , and the lengths of the vectors do not need to be the same.

(2) —(6) alpha, beta, p, q and k have the exact same specifications as in part (a).

Computation:

For each numeric sequence (defined by each vector in xSeqList), this function calculates the

area between the x-axis and g(x) in equation (5) according to the middle Riemann sum with

subintervals defined by each vector of xSeqList.

Return:

A numeric vector containing the areas as calculated in Computation.

(g) [4 marks] Use the calcMidRiemannAreas to calculate the areas within the x-intervals defined by the

following sequences

3.50 < 3.51 < ... < 7.99 < 8.00

2.0 < 2.1 < ... < 4.2 < 4.3

1.07 < 1.071 < ... < 9.011 < 9.012

given the parameter values α = 1.9, β = 0.75, p = 2.15, q = 7 and k = 1.65. Report your answer (to 3

decimal places) as comments below your R commands for this question in the R script.

Question 4: Hidden Markov Model

[23 marks] Consider a sequence of weather events X1, X2, ..., Xn, where each Xi can be one of the three

possible weather states: sunny, cloudy or rainy on the ith day for all i in {1, ..., n}. Assume that the

probability of the weather on the ith day, Xi only depends on the weather on the previous day Xi?1 for all i

in {2, ..., n}. This means that

p(Xi|X1, ..., Xi?1) = p(Xi|Xi?1), (7)

where the conditional probability p(Xi = b|Xi?1 = a) is termed the transition probability from weather state

a to whether state b. Thus, we can define the transition probability matrix PT as

PT =

pss psc psrpcs pcc pcr

prs prc prr , (8)

where the subscripts s, c and r represent sunny, cloudy and rainy, respectively. For example, pcs = p(Xi =

sunny|Xi?1 = cloudy) for the all i in {2, ..., n}. The initial probabilities for the weather state of X1 is given

by pi = (pis, pic, pir), so for example, p(X1 = sunny) = pis.

The probability of a sequence of weather events p(X1, X2, ..., Xn) is then given by

p(X1, X2, ..., Xn) = p(X1)

n∏

i=2

p(Xi|Xi?1). (9)

For example, the probability of observing cloudy on day 1, rainy on day 2 and sunny on day 3 is therefore

p(X1 = cloudy, X2 = rainy, X3 = sunny)

=p(X3 = sunny|X2 = rainy)p(X2 = rainy|X1 = cloudy)p(X1 = cloudy)

=prspcrpic.

(a) [8 marks] Write a function called weatherSeqProb, which has the following specifications.

Arguments:

6

(1) weatherSeq is a character vector, where each element must one of the three character "s", "c" or

"r".

(2) trProbs is a numeric matrix object with three rows and three columns, where the values are the

transition probabilities arranged according to equation (11).

(3) initProbs is a numeric vector with three elements, where the first, second and third elements

correspond to the initial probabilities pis, pic and pir.

Computation:

– Calculate the natural logarithm of the probability of observing the sequence of weather specify

by weatherSeq given the transition probability matrix trProbs and initial state probabilities

initProbs.

Return:

– A numeric value, which is the natural log probability calculated in Computation.

(b) [3 marks] Calculate the natural logarithm of the probability for observing

X1 = cloudy, X2 = sunny, X3 = cloudy, X4 = rainy, X5 = sunny and X6 = sunny, (10)

given pi = (0.45, 0.25, 0.3) and

PT =

??0.50 0.40 0.100.33 0.35 0.32

0.30 0.30 0.40

?? . (11)

Report your answer (to 3 decimal places) as comments below your R commands for this question in the

R script.

(c) [8 marks] Our hypothetical friend chooses their jacket (black or white) depending on weather with some

randomness. The conditional probability matrix for the jacket chosen given the weather is defined by

the matrix

PE =psB psWpcB pcW

prB prW, (12)

where the subscripts B and W stand for black and white (jackets), respectively. For example, pcW

is probability that our friend wears a White jacket on a cloudy day. Note that in the subscripts, the

weather states are in lowercase, while the colour states are in uppercase!

Let Yi represent the jacket colour on the ith day. The joint probability of a sequence of events weather

and colour events p(X1, X2, ..., Xn, Y1, Y2, ..., Yn) is then given by

p(X1, X2, ..., Xn, Y1, Y2, ..., Yn) = p(Y1|X1)p(X1)

n∏

i=2

p(Yi|Xi)p(Xi|Xi?1). (13)

For example, the joint probability of observing cloudy weather and white jacket on day 1, rainy weather

and black jacket on day 2 and sunny weather and black jacket on day 3 is

p(X1 = cloudy, X2 = rainy, X3 = sunny, Y1 = white, Y2 = black, Y3 = black)

= p(Y3 = black|X3 = sunny)p(X3 = sunny|X2 = rainy)

× p(Y2 = black|X3 = rainy)p(X2 = rainy|X1 = cloudy)p(Y1 = white|X1 = cloudy)p(X1 = cloudy)

= psBprsprBpcrpcWpic.

Write a function called weatherColourProbs with the following specifications.

Arguments:

(1) colourSeq is a character vector, wherein each element is restricted to be either "B" or "W".

7

(2) emitProbs is a numeric matrix object with three rows and two columns, wherein the values are

the probabilities of the jacket colour conditioned on weather as arranged according to equation

(12).

(3) —(5) weatherSeq, trProbs and initProbs are specified exactly the same as in part (a).

Computation:

– Calculate the natural logarithm of the joint probability of observing the sequence of weather

specified by weatherSeq and the sequence of jacket colour specified by colourSeq, given the

probability arguments emitProbs, trProbs and initProbs.

Return:

– A numeric value, which is the natural log probability calculated in Computation.

Important: In Computation, you must not repeat the codes in weatherSeqProb here.

(d) [4 marks] Calculate the natural logarithm of the joint probability for observing

X = {X1, X2, ..., X8}

= {rainy, sunny, cloudy, rainy, cloudy, rainy, sunny, sunny}

Y = {Y1, Y2, ..., Y8}

= {black,white,white,black,black,white,white,white}

with

pi = (0.35, 0.45, 0.2), PT =

0.55 0.25 0.200.25 0.35 0.40

0.20 0.15 0.65 and PE =0.20 0.800.55 0.45

0.90 0.10 .

Report your answer (to 3 decimal places) as comments below your R commands for this question in the

R script.

联系我们

- QQ：99515681
- 邮箱：99515681@qq.com
- 工作时间：8:00-21:00
- 微信：codinghelp

- 代写comp9012、辅导python程序语言 2023-01-23
- Comp9414: Artificial Intelligence Assi... 2023-01-05
- Comp9444 Assignment 1 Neural Networks ... 2023-01-05
- Final Assignment - Apply All Your Skil... 2023-01-05
- Data7202 Statistical Methods For Data ... 2023-01-04
- Comp307/Aiml420 Assignment 4: Planning... 2023-01-04
- Comp3170 Assignment 3 Moonlit Forest 2023-01-04
- Cs 179: Introduction To Graphical Mode... 2023-01-04
- Computer Vision Assignment 3: Deep L... 2023-01-03
- C++ Implementation Of Graph Algorithms 2023-01-03
- Computer Vision Assignment 3: Deep Le... 2023-01-03
- Cs 179: Introduction To Graphical Mode... 2023-01-03
- Comp Sci 3306, Comp Sci 7306 Assignmen... 2023-01-03
- Comp Sci 3306, Comp Sci 7306 Assignmen... 2023-01-03
- C++编程设计辅导、辅导program程序 2022-12-22
- Mthm002代做、辅导java/Python编程 2022-12-22
- Buss6002代做、辅导python编程 2022-12-22
- Cmpen 431代写、辅导c/C++语言编程 2022-12-20
- 辅导program、辅导java/C++编程 2022-12-19
- 代写engf0004、Matlab编程设计辅导 2022-12-19