首页 > > 详细

辅导STAT5003解析R编程、R设计辅导留学生

The University of Sydney Page 1 
STAT5003 
Week 13 
Review and Final Exam 
Presented by 
Dr. Justin Wishart 
The University of Sydney Page 2 
Exam format 
– Two hour written exam 
– 20 Multiple Choice questions 
– Questions can have one or two correct answers. You need to select the 
exact correct answer(s) to get a mark 
– Some short answer questions 
– Two longer answer questions 
The University of Sydney Page 3 
Topics covered 
– Everything in the lectures/tutorials from Weeks 1 to 12 (except 
any topic that was marked as not examinable) 
– Writing R code is not tested, but there could be questions on 
interpreting R outputs 
– You should understand how the algorithms work and be able to 
sketch out the key steps in pseudo code 
The University of Sydney Page 4 
Methods we have learnt 
– Regression 
– Multivariate linear regression 
– Clustering 
– Hierarchical clustering 
– K-means clustering 
– Classification 
– Logistic regression 
– LDA 
– KNN 
– SVM 
– Random Forest 
– Decision trees 
– Boosted trees (Adaboost, XGBoost, GBM) 
The University of Sydney Page 5 
Multiple Regression 
= 0 + 1 1 + 2 2 + …+ + 
– Find coefficients to minimise the total sum of squares of the 
residuals 
The University of Sydney Page 6 
Local regression (smoothing) 
A typical model in this case is 
= + 
– The function f is some smooth function (differentiable). 
The University of Sydney Page 7 
Density estimation 
– Maximum Likelihood approach 
– Reformulate as 
(1, 2, … , |) Probability of observing 1, 2, … , given parameter(s) 
= ς=1 
→ln = σ=1 
ln 
The University of Sydney Page 8 
Kernel density estimation 
– Smooths the data with a chosen hyperparameter (bandwidth) 
to estimate the density. 
መ = 
ℎ 
෍ 
=1 
 
 
− 
ℎ 
The University of Sydney Page 9 
Hierarchical Clustering 
– Bottom-up clustering approach. 
– Each point is its own cluster 
– Clustering tuned by merging 
close values 
The University of Sydney Page 10 
K-means algorithm 
– 1. Data randomly allocated 
– 2. Centres computed. 
– Data matched to closest 
centre. 
– Repeat. 
The University of Sydney Page 11 
Principal Components Analysis (PCA) 
– Find linear combinations of variables that maximum the 
variability. 
The University of Sydney Page 12 
PCA and t-SNE 
PCA tSNE 
The University of Sydney Page 13 
Logistic Regression 
Logistic regression model: 
= log 
 
1 − 
= 0 + 11 +⋯+ = 
 
= Pr( = 1|) = ℎ = 
1 + − 
 
0.1 
0.2 
0.3 
0.4 
0.5 
0.6 
0.7 
0.8 
0.9 
ro 
ba 
bi 
lit 
-5 -4 -3 -2 -1 0 1 2 3 4 5 
0.5 
 
The University of Sydney Page 14 
Linear Discriminant Analysis (LDA) 
: Probability of coming from class k (prior probability) 
: Density function for X given that X is an observation from 
class k 
 
 
The University of Sydney Page 15 
Cross validation 
– Fitting model to entire dataset can overfit the data and not 
perform well on new data 
– Split data into training and tests sets to alleviate this and find 
the right bias/variance trade-off. 
The University of Sydney Page 16 
Bootstrap 
– Simulate related data (sampling with replacement) and 
examine statistical performance on all the re-sampled data. 
The University of Sydney Page 17 
Support Vector Machines (SVM) 
– Find the best hyperplane or boundary to separate data into 
classes. 
– Image taken from 
https://en.wikipedia.org/wiki/Support_vector_machine 
 
The University of Sydney Page 18 
Missing Data 
– Remove missing data (complete cases) 
– Single Imputation 
– Multiple imputation 
– Expert knowledge of reasons for missing data. 
The University of Sydney Page 19 
Basic decision trees 
– Partition space into rectangular regions that minimise outcome 
deviation. 
Millions 
The University of Sydney Page 20 
Bagging trees and random forests 
– Use bootstrap technique to 
create resampled trees and 
average the result. 
– መ = 
 
σ=1 
መ∗() 
– Random forests do further 
sampling to improve model. 
The University of Sydney Page 21 
Boosting 
– Fit tree to residuals and learn slowly 
– Slowly improve the fit in areas where the model doesn’t 
perform well. 
– Some boosting algorithms discussed 
– AdaBoost 
– Stochastic gradient boosting 
– XGBoost 
The University of Sydney Page 22 
Feature Selection 
– Filter selection via fold changes. 
– Best subset selection. 
– Forward selection. 
– Backward selection. 
– Choose model that minimises test error 
– Directly via test set 
– Indirectly via penalised criterion. 
The University of Sydney Page 23 
Ridge Regression and Lasso 
– Constrained optimisation techniques that minimise the squares 
with different constraints. 
– Lasso has the extra benefit of feature selection as a free 
bonus. 
The University of Sydney Page 24 
Monte Carlo Methods 
– Repeated simulation to estimate the full distribution and 
summary values. 
– Exploits law of large numbers. 
– Can sample from f if inverse of exists, then we can 
generate as: = −1 
– Acceptance rejection method to handle more difficult 
distributions. 
= න() ∙ ≈ 
 
෍ 
=1 
 
() 
The University of Sydney Page 25 
Markov Chain Monte Carlo 
– Big use in modelling Bayesian methods. 
– Simulates a process (random variable that changes over time) 
– Simulate new point based off the current point. 
– Can estimate even more complex distributions that in Monte 
Carlo methods. 
The University of Sydney Page 26 
Methods and metrics to evaluate models 
– Sensitivity and specificity 
– Accuracy 
– Residual sum of squares (for regression) 
– ROC curves and AUC 
– K-fold cross-validation 
The University of Sydney Page 27 
Example multiple choice question 
Which of the following method(s) is/are unsupervised learning 
methods? 
A. K means clustering 
B. Logistic regression 
C. Random forest 
D. Support vector machines 
The University of Sydney Page 28 
Example short answer question 
a. Explain how the parameters are estimated in simple least 
squares regression. 
b. Explain a scenario where simple linear regression is not 
appropriate. 
c. Compute the predicted weight for a person that is 160cm tall 
and compute the residual of the first person in the table below. 
Sample X : Height (cm) Y: Weight (kg) 
1 160 60 
2 170.2 77 
3 172 62 
෠ = 50.412 + 0.0634 
The University of Sydney Page 29 
Example long answer question 
– Describe the Markov Chain Monte Carlo procedure. You may 
use pseudo code as part of your answer. 
联系我们
  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-21:00
  • 微信:codinghelp
热点标签

联系我们 - QQ: 99515681 微信:codinghelp
程序辅导网!