首页 > > 详细

辅导DATA7202-Assignment 3讲解Python程序

Statistical Methods for Data Science 
DATA7202 
Semester 1, 2020 
Assignment 3 (Weight: 20%) 
Assignment 3 is due on 2 June, 2020, 2:00pm. 
There are four questions below. For questions 1 and 3, you should present your 
analysis of data using Python, Matlab, or R, as a short report, clearly answering the 
objectives and justifying the modeling (and hence statistical analysis) choices you make, 
as well as discussing your conclusions. Do not include excessive amounts of output in 
your reports, though you can append additional output (with explanation) to your report 
as an appendix. 
1. (10%) Consider a function 
f(x) = 3x+ x2 − 200 cos(x) 1 6 x 6 8. 
Write a Crude Monte Carlo algorithm for the estimation of 
using N = 10000 sample size. Deliver the 95% confidence interval. Compare the 
obtained estimation with the true value `. 
2. (10%) Consider the following variant of the cross-validation procedure. 
(i) Using the available data, find a subset of “good” predictors that show corre- 
lation with the response variable. 
(ii) Using these predictors, construct a model (for regression or classification). 
(iii) Use cross-validation to estimate the model prediction error. 
Is this a good method? Do you expect to obtain the true prediction error? Explain 
your answer. 
3. Consider the Hitters data-set (given in Hitters.csv). Our objective is to predict a 
hitter’s salary via linear models. 
(a) (5%) Load the data-set and replace all categorical values with numbers. (You 
can use the LabelEncoder object in Python). 
(b) (5%) Fit linear regression and report 10-Fold Cross-Validation mean squared 
error. 
(c) (10%) Apply Principal Component Regression (PCR) with all possible number 
of principal components. Using the 10-Fold Cross-Validation, plot the mean 
squared error as a function of the number of components and determine the 
optimal number of components. 
(d) (10%) Apply the Lasso method and plot the the 10-Fold Cross-Validation mean 
squared error as a function of λ. Determine the best λ and the corresponding 
mean squared error. 
4. (10%) Specify a method to generate a random variable from the discrete pdf 
f(x) = 
n+1 
x = 0, 1, 2, . . . , n, 
0 otherwise. 
Discuss the time complexity of your method in terms of n, e.g. is it O(n), O(ln(n)), 
etc. Give a short explanation (at most 2 sentences) for your answer. 
5. Answer the following questions. 
(a) (10%) Let X be a random variable and consider the estimation of the proba- 
bility `γ = P(X > γ) for some large γ ∈ R. The Crude Monte Carlo (CMC) 
estimator of `γ is ̂` 
where Zi = 1{Xi>γ} is the indicator random variable, and X1, . . . , XN are iid 
copies of X for i = 1, . . . , N . Find the squared coefficient of variation CV2 of 
Z. (Recall that CV2 = Var(Z)/ (E[Z])2.) 
(b) (10%) Find the relative error of the estimator ̂`γ in terms of N and `γ. 
(c) (20%) The estimator (1) of `γ = E(Z) is said to be logarithmically efficient if 
联系我们
  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-21:00
  • 微信:codinghelp
热点标签

联系我们 - QQ: 99515681 微信:codinghelp
程序辅导网!