The University of Sydney Page 1
STAT5003
Week 13
Review and Final Exam
Presented by
Dr. Justin Wishart
The University of Sydney Page 2
Exam format
– Two hour written exam
– 20 Multiple Choice questions
– Questions can have one or two correct answers. You need to select the
exact correct answer(s) to get a mark
– Some short answer questions
– Two longer answer questions
The University of Sydney Page 3
Topics covered
– Everything in the lectures/tutorials from Weeks 1 to 12 (except
any topic that was marked as not examinable)
– Writing R code is not tested, but there could be questions on
interpreting R outputs
– You should understand how the algorithms work and be able to
sketch out the key steps in pseudo code
The University of Sydney Page 4
Methods we have learnt
– Regression
– Multivariate linear regression
– Clustering
– Hierarchical clustering
– K-means clustering
– Classification
– Logistic regression
– LDA
– KNN
– SVM
– Random Forest
– Decision trees
– Boosted trees (Adaboost, XGBoost, GBM)
The University of Sydney Page 5
Multiple Regression
= 0 + 1 1 + 2 2 + …+ +
– Find coefficients to minimise the total sum of squares of the
residuals
The University of Sydney Page 6
Local regression (smoothing)
A typical model in this case is
= +
– The function f is some smooth function (differentiable).
The University of Sydney Page 7
Density estimation
– Maximum Likelihood approach
– Reformulate as
(1, 2, … , |) Probability of observing 1, 2, … , given parameter(s)
= ς=1
→ln = σ=1
ln
The University of Sydney Page 8
Kernel density estimation
– Smooths the data with a chosen hyperparameter (bandwidth)
to estimate the density.
መ =
1
ℎ
=1
−
ℎ
The University of Sydney Page 9
Hierarchical Clustering
– Bottom-up clustering approach.
– Each point is its own cluster
– Clustering tuned by merging
close values
The University of Sydney Page 10
K-means algorithm
– 1. Data randomly allocated
– 2. Centres computed.
– Data matched to closest
centre.
– Repeat.
The University of Sydney Page 11
Principal Components Analysis (PCA)
– Find linear combinations of variables that maximum the
variability.
The University of Sydney Page 12
PCA and t-SNE
PCA tSNE
The University of Sydney Page 13
Logistic Regression
Logistic regression model:
= log
1 −
= 0 + 11 +⋯+ =
= Pr( = 1|) = ℎ =
=
1
1 + −
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
P
ro
ba
bi
lit
y
-5 -4 -3 -2 -1 0 1 2 3 4 5
X
1
0.5
The University of Sydney Page 14
Linear Discriminant Analysis (LDA)
: Probability of coming from class k (prior probability)
: Density function for X given that X is an observation from
class k
The University of Sydney Page 15
Cross validation
– Fitting model to entire dataset can overfit the data and not
perform well on new data
– Split data into training and tests sets to alleviate this and find
the right bias/variance trade-off.
The University of Sydney Page 16
Bootstrap
– Simulate related data (sampling with replacement) and
examine statistical performance on all the re-sampled data.
The University of Sydney Page 17
Support Vector Machines (SVM)
– Find the best hyperplane or boundary to separate data into
classes.
– Image taken from
https://en.wikipedia.org/wiki/Support_vector_machine
The University of Sydney Page 18
Missing Data
– Remove missing data (complete cases)
– Single Imputation
– Multiple imputation
– Expert knowledge of reasons for missing data.
The University of Sydney Page 19
Basic decision trees
– Partition space into rectangular regions that minimise outcome
deviation.
Millions
The University of Sydney Page 20
Bagging trees and random forests
– Use bootstrap technique to
create resampled trees and
average the result.
– መ =
1
σ=1
መ∗()
– Random forests do further
sampling to improve model.
The University of Sydney Page 21
Boosting
– Fit tree to residuals and learn slowly
– Slowly improve the fit in areas where the model doesn’t
perform well.
– Some boosting algorithms discussed
– AdaBoost
– Stochastic gradient boosting
– XGBoost
The University of Sydney Page 22
Feature Selection
– Filter selection via fold changes.
– Best subset selection.
– Forward selection.
– Backward selection.
– Choose model that minimises test error
– Directly via test set
– Indirectly via penalised criterion.
The University of Sydney Page 23
Ridge Regression and Lasso
– Constrained optimisation techniques that minimise the squares
with different constraints.
– Lasso has the extra benefit of feature selection as a free
bonus.
The University of Sydney Page 24
Monte Carlo Methods
– Repeated simulation to estimate the full distribution and
summary values.
– Exploits law of large numbers.
– Can sample from f if inverse of exists, then we can
generate as: = −1
– Acceptance rejection method to handle more difficult
distributions.
= න() ∙ ≈
1
=1
()
The University of Sydney Page 25
Markov Chain Monte Carlo
– Big use in modelling Bayesian methods.
– Simulates a process (random variable that changes over time)
– Simulate new point based off the current point.
– Can estimate even more complex distributions that in Monte
Carlo methods.
The University of Sydney Page 26
Methods and metrics to evaluate models
– Sensitivity and specificity
– Accuracy
– Residual sum of squares (for regression)
– ROC curves and AUC
– K-fold cross-validation
The University of Sydney Page 27
Example multiple choice question
Which of the following method(s) is/are unsupervised learning
methods?
A. K means clustering
B. Logistic regression
C. Random forest
D. Support vector machines
The University of Sydney Page 28
Example short answer question
a. Explain how the parameters are estimated in simple least
squares regression.
b. Explain a scenario where simple linear regression is not
appropriate.
c. Compute the predicted weight for a person that is 160cm tall
and compute the residual of the first person in the table below.
Sample X : Height (cm) Y: Weight (kg)
1 160 60
2 170.2 77
3 172 62
= 50.412 + 0.0634
The University of Sydney Page 29
Example long answer question
– Describe the Markov Chain Monte Carlo procedure. You may
use pseudo code as part of your answer.