STAT2203/7203 (S2-2022): Assignment 03
Due: 21-October-2022 @16:59
1. [5 marks each]
We have seen how to simulate from a distribution using the inverse-transform method;
see §5.8 of the course notes as well as slide 8/14 of Lecture4-3. Another method to
simulate random variables from a given distribution is using rejection sampling. This
question concerns a particular application of rejection sampling.
Benford’s law is a distribution on the integers {1, 2, . . . , 9} with probability mass function
fD(d) = log10
(
d+ 1
d
)
, d ∈ {1, 2, . . . , 9}.
We would like to be able to simulate the random variable D from this distribution.
Suppose X has probability mass function
fX(x) =
1
9 , x ∈ {1, 2, . . . , 9}
Conditional on X = x, the random variable Y has a Bernoulli(fD(x)/ log10(2)) distribu-
tion.
(a) Verify that fD(x)/ log10(2) ≤ 1 for all x ∈ {1, 2, . . . , 9}.
(b) What is the joint probability mass function of (X, Y )?
(c) Determine P(Y = 1).
(d) Determine the conditional probability mass function of X given Y = 1.
(e) This suggests we can simulate a random variable with probability mass function fD
using the following algorithm
Y = 0
While (Y = 0) {
Simulate X from a uniform distribution on {1,2,...,9}
Simulate Y from a Bernoulli distribution with success
probability f(X)/log10(2)
}
Return X
1
In each loop a new pair of random variables (X, Y ) is simulated, independent of
all previously simulated random variables. Implement this algorithm in R (or any
programming language of your choice). You will need to use a while loop. In R,
the general form is
while (cond) {
expressions
}
where cond is a length one logical vector.
(f) What is the distribution of the number of pairs of random variables (X, Y ) that
need to be simulated to simulate a single random variable from Benford’s law?
2. [5 marks each]
For The following questions, work out your answers ‘by hand’. You may still use R (or
any other programming language) to obtain probabilities and quantiles from the appro-
priate distributions and calculate your final answers.
A study investigated if psychotherapy combined with limited administration of Methylene-
dioxymethamphetamine (MDMA) can reduce symptoms of post-traumatic stress disor-
der. Severity of symptoms was measured via the CAPS-IV score with higher scores
indicating more severe symptoms. Forty-eight patients recruited to the study with
twenty-four patients being randomly allocated each of the two dosage levels (Low –
40 mg, High – 125 mg). The primary outcome was the reduction in CAPS-IV score one
month after the end of treatment.
The forty-eight patients at the commencement of the study had an average CAPS-IV
score of 81.35 with a sample standard deviation of 17.54. At the end of treatment, the
High dose group experienced an average drop in CAPS-IV score of 24.2 with a sample
standard deviation of 23.1. The Low dose group experienced an average drop in CAPS-
IV score of 12.7 with a sample standard deviation of 19.4.
(a) Determine a 95% confidence interval of the population mean CAPS-IV score of
patients at commencement of the study.
(b) Does the data provide evidence that the high dose MDMA treatment is associated
with a decrease in mean CAPS-IV score? State the null and alternative hypotheses,
and determine the appropriate test statistic and p-value. What do you conclude?
(c) Researchers would like to determine if patients experience a greater decrease in
CAPS-IV score with the high dose MDMA treatment than low dose MDMA treat-
ment. State the null and alternative hypotheses, and determine the appropriate
test statistic and p-value. What do you conclude?
(d) A secondary outcome was whether the patient experienced a drop of 20% or more
in CAPS-IV score. In the high dose treatment group 11 patients experienced such
2
a drop in CAPS-IV score. Construct a 95% confidence interval for the population
proportion of patients that would experience a 20% or more drop in CAPS-IV score
with the high dose treatment.
(e) In addition to the 11 patients in the high dose treatment group who experienced a
20% or more drop in CAPS-IV score, 6 patients in the low dose treatment group
also experienced a 20% or more drop in CAPS-IV score. The researchers would
like to test if the population proportion of patients that would experience a 20%
or more drop in CAPS-IV score is greater with a high dose treatment than the
low dose treatment. State the null and alternative hypotheses, and determine the
appropriate test statistic and p-value. What do you conclude?
(f) Are the assumptions/approximations you used for the analysis in part (e) valid?
Justify your answer.
3. [7 marks each]
This question concerns the analysis of simple random samples from two populations;
the first population having a N (μ1, σ21) distribution and the second population having a
N (μ2, σ22) distribution. All parameters are unknown.
(a) A researcher wishes to compare the means of the two populations. Based on two
simple random samples of equal size from these populations, the researcher con-
structs the 95% exact confidence intervals for each of the population means. The
researcher then makes the following claim “. . . as the 95% confidence intervals for
the means do not overlap, we can conclude there is moderate evidence suggesting
that the true means are different (p < 0.05)”. Justify the researchers claim.
(b) Even if the 95% confidence intervals of population means overlap, it is still possible
that the p-value from testing
H0 : μ1 = μ2 against H1 : μ1 6= μ2
is less than 0.05. Provide example summary statistics (sample means, sample stan-
dard deviations and sample sizes) for which the confidence intervals of the popu-
lation means overlap but the p-value from the test of the above hypotheses is less
than 0.05. The overlap of the interval must be more than just the end points of
the intervals matching. You must show your summary statistics have the required
property by constructing the confidence intervals of the means and carry out the
hypothesis test.
4. [2 marks each]
Exposure to ground level ozone (O3) is believed to impair airway function in healthy
individuals. To investigate this, researchers recruited 60 individuals (34 males and 26
females) and had them exercise for one hour on a cycle ergometer while breathing 0.30
3
parts per million of ozone. The Forced Expiratory Volume (FEV) and Forced Vital
Capacity (FVC) of each subject was measured before and after the test and the change
recorded as a percentage.
The file ozone.csv contains the following variables:
? FVC – Percentage change in Forced Vital Capacity
? FEV – Percentage change in Forced Expiratory Volume
(a) Run linear regression for Change in FEV% against Change in FVC% using R (or
any programming language of your choice) and give the summary output. Produce
diagnostic plots, namely scatterplot of residuals against fitted values and the normal
quantile plot of residuals, for the linear regression fit. Give these captions and figure
numbers and refer to them as needed in later questions.
(b) List the assumptions of the linear regression model. For each, explain whether or
not there is evidence that this assumption is violated, based on the diagnostic plots.
(c) A researcher suggest that the linear regression model is not appropriate because the
Change in FVC % does not have a normal distribution. Are they correct? Justify
your answer.
For the following parts you may assume that the model assumptions hold.
(d) Report a 99% confidence interval for the slope of the linear regression model.
(e) Provide both a 90% prediction interval for the change in FEV% for a healthy
individual with a change in FVC of 10% and a 90% confidence interval for the
mean change in FEV% for a healthy individual with a change in FVC of 10%.
Briefly explain the difference of between the two intervals.
(f) Researchers believe that the changes in FEV% and FVC% are both due to a change
a decline in inspiratory capacity so that the intercept of the regression line should
be zero. Is the result of the regression analysis consistent with this belief? State
the null and alternative hypotheses, and report the appropriate test statistic and
p-value. What do you conclude?
(g) Explain the meaning of the R-squared value in the regression output. [1 mark]
5. [6 marks each]
Consider the simple linear regression model as discussed in class where the observations
are modeled Yi iid~ N (β0 + β1xi, σ2), i = 1, . . . , n. Consider the least squares estimators,
β0 and β1, for the respective coefficients β0 and β1 .
This assignment counts for 20% of the total mark for the course.
Although not mandatory, if you could type up your work, e.g., LaTex, it would be
greatly appreciated.
Show all your work and attach your code and all the plots (if there is a programming
question).
Combine your solutions, all the additional files such as your code and numerical
results, all in one single PDF file.
Please submit your single PDF file on Blackboard.