Question Mark Out of
Instructions
Answer each question in the space provided. You can write in pen or pencil. Marks are
indicated next to each question. The total mark for the exam is 100.
Part A (45 marks in total)
Question A.1 (1+1+1+1+2+1+1=8 marks)
Consider the following set of numbers: -25, 2, 3, 8, 10, 14, 18, 21, 32. For each of the questions
below, state your answer, showing working if necessary.
(a) What is the median?
(b) What is the 1st quartile?
(c) What is the 3rd quartile?
(d) What is the interquartile range.
(e) Hence sketch a box-plot. Lay it out horizontally below. Be sure to mark the values of
the various parts.
Marks / 6 Page 3 of 43
(f) You are told the mean of the numbers is 9.222 and the mean of their square is 309.666.
What is the sample standard deviation?
(g) If you only knew the mean and sample standard deviation of the sample, what does
Chebyshev’s inequality tell you?
Marks / 2 Page 4 of 43
Question A.2 (4+2+2+4=12 marks)
Throughout this question, show your working and leave your answer in a clear from. Of those
reporting to a medical clinic, 2% have medical condition Z. It is assumed that this figure of
2% is also the base rate across the population. There is a test for condition Z such that, for
those patients who have condition Z, 85% will test positive; and for those patients who do
not have condition Z, 25% will test positive.
(a) If a patient tests positive, what is the probability that the patient has condition Z?
After some consideration, it is decided that the test gives too many false positives, and it
is decided to modify the test as follows. The new test is simply to administer the original
test twice, where it is assumed that these two tests give results that are independent of one
another. A patient will be considered to have tested positive on the new test precisely in
those cases where both tests on the original test return a positive result.
(b) If a patient has condition Z, what is the probability that the patient will test positive
on the new test?
Marks / 6 Page 5 of 43
(c) If a patient does not have condition Z, what is the probability that the patient will test
positive on the new test?
(d) If a patient returns a positive result on this new test, what is the probability that the
patient has condition Z?
Marks / 6 Page 6 of 43
Question A.3 (2+3+3+2=10 marks)
Consider the probability density func-
tion given at the right, defined by
(c) Calculate V [(X + 1)(Y + 1)].
Marks / 6 Page 9 of 43
Question A.5 (3+3+3=9 marks)
Consider the probability density function given by a mixture of two Gaussians with identical
standard deviation σ, as
p(x|ρ, µ1, µ2, σ) = ρN(x|µ1, σ) + (1− ρ)N(x|µ2, σ)
where N(·|·) is the probability debsity function of a Gaussian. Thus the expected value of
function f(x) under this distribution is given by
Eρ,µ1,µ2,σ [f(x)] = ρEN(µ1,σ) [f(x)] + (1− ρ)EN(µ2,σ) [f(x)]
where the two expected values on the right hand side are done using Gaussian distributions.
(a) What is the mean of x for the mixture of two Gaussians?
(b) What is the mean of x2 for the mixture of two Gaussians?
Marks / 6 Page 10 of 43
(c) What is the variance for the mixture of two Gaussians?
Marks / 3 Page 11 of 43
Part B (25 marks in total)
Question B.1 (3+2+3=8 marks)
You have data x distributed as Poisson with rate λ = 16, so x ∼ Pois(16).
(a) Show how to use the central limit theorem to get an approximate value for p(10 ≤ x ≤
20). Compute the approximate value, noting that the Z tables are only accurate to 2 decimal
places.
(b) You have a sample of 10 values from this distribution, and compute its mean x. What
is an approximate distribution for x?
(c) What are 95% confidence intervals for the mean x, according to this approximation?
Marks / 8 Page 12 of 43
Question B.2 (2+5=7 marks)
While IQ is considered to have a mean of 100 and standard deviation of 15. You expect
students in your masters class will have a higher mean.
(a) Given a sample of size 10, compute a one-sided 95% confidence interval in the form
(−∞, I] for where the measured mean should lie.
(b) You get data from 10 students with the form [104, 120, 100, 112, 133, 138, 111, 118, 114, 118].
Note that the mean of the sample is 116.8 and the mean of the squares of the sample is 13765.8.
Test the null hypothesis that the students’ IQ has mean 100. Without assuming you know
the standard deviation, give the test statistic and the p-value for this data. Note the tables
of statistics given at the back of the exam will not allow you to lookup the p-value precisely.
Marks / 7 Page 13 of 43
Question B.3 (2+2+4+2=10 marks)
You obtain paired data (X,Y ) with values ~x = [4.59, 4.60, 6.32, 4.85, 3.27, 5.92, 1.92, 6.90, 4.82, 5.39]
and ~y = [2.89, 2.46, 3.28, 2.34, 2.11, 3.56, 1.77, 3.29, 2.46, 2.60]. The various sample means (us-
ing the above data) are:
x = 4.859
y = 2.677
x2 = 25.516
y2 = 7.460
xy = 13.670
(a) What is the correlation co-efficient between X and Y ? What does this tell you about
X and Y ?
(b) Fit a simple linear model to this data in the form
Yˆ = β0 + β1X
What are your estimates for β0 and β1?
Marks / 4 Page 14 of 43
(c) What are the standard errors for β0 and β1?
(d) Test the hypothesis the β1 = 0. What is your test statistic and its p-value? What is
the outcome of the test?
Marks / 6 Page 15 of 43
Part C (30 marks in total)
Question C.1 (2+2+2=6 marks)
You have a data set supplied as real-valued pairs (X,Y ) and you wish to regress X onto Y .
You have 2 models:
A: a 4 degree polynomial
yˆ =
4∑
i=0
aix
i
B: a 20 degree polynomial
yˆ =
20∑
i=0
aix
i
(a) Describe how the bias of models A and B differ.
(b) Describe how the variance of models A and B differ.
Marks / 4 Page 16 of 43
(c) If you had 100 data points in your sample, which of ther two models would you recom-
mend? Justify your answer.
Marks / 2 Page 17 of 43
Question C.2 (5+3+2+2=12 marks)
(a) You wish to build a na¨ıve Bayes classifier regressing Booleans A, B and C onto the
Boolean X. Someone has already counted the data for you to create frequency tables below:
A=0 A=1 B=0 B=1 C=0 C=1
X=0 10 40 30 20 15 35
X=1 30 20 5 45 40 10
Construct probability tables as needed to specify the estimated na¨ıve Bayes classifier for the
task. Then give the formula for the classifier and describe how it would be used.
Marks / 5 Page 18 of 43
(b) Consider the probabilities p(A=0|X=0) and p(B=0|X=1). Compute their standard
errors, making any assumptions as needed? What can you say about the resulting estimates?
(c) Which would be better, the na¨ıve Bayesian classifier or the logistic regression classifier
for this data set? Justify your answer.
Marks / 5 Page 19 of 43
(d) The first step of the k-means algorithm is to initialise the centroids. Describe a way
this could be done, and why it is OK to use it.
Marks / 2 Page 20 of 43
Question C.3 (6=6 marks)
Consider the probability density function given below, defined by
This is two semi-circles side-by-side of radius 1/2, then scaled by 4/pi to get a PDF.
Page 21 of 43
(a) Devise pseudo-code for a rejection sampler for this distribution. Note the maximum
value is marked at 2pi .
Marks / 6 Page 22 of 43
Question C.4 (5+1=6 marks)
You wish to build a decision tree to predict a three-valued variable X. The first two features to
test are Booleans A and B. Someone has already counted the data for you to create frequency
tables below:
A=0 A=1 B=0 B=1
X=0 10 40 30 20
X=1 30 20 5 45
X=2 30 20 45 5
(a) Compute and report the quality measure for the attributes A and B using the informa-
tion gain metric.
Marks / 6 Page 23 of 43
(b) Hence say which attribute is recommended to use at the root of the tree?
Page 24 of 43
Blank page for additional answers if needed.
Page 25 of 43
Blank page for additional answers if needed.
Page 26 of 43