首页 > > 详细

DATA7202 Assignment 1

Statistical Methods for Data Science

DATA7202

Semester 1, 2022

Assignment 1 (Weight: 25%)

Assignment 1 is due on 31 Mar 2022 16:00).

Please answer the questions below. For theoretical questions, you should present rigorous proofs

and appropriate explanations. Your report should be visually appealing and all questions should

be answered in the order of their appearance. For programming questions, you should present your

analysis of data using Python, Matlab, or R, as a short report, clearly answering the objectives

and justifying the modeling (and hence statistical analysis) choices you make, as well as discussing

your conclusions. Do not include excessive amounts of output in your reports. All the code should

be copied into the appendix and the sources should be packaged separately and submitted on the

blackboard in a zipped folder with the name:

"student_last_name.student_first_name.student_id.zip".

For example, suppose that the student name is John Smith and the student ID is 123456789.

Then, the zipped file name will be John.Smith.123456789.zip.

1. [15 Marks] Repeat the advertisement exercise with the following changes.

(a) The data is generated via the following data generation mechanism: Xi ∼ Gamma(1, 1)

for i ∈ {1, 2, 3}; here Gamma(1, 1) stands for the continuous Gamma distribution with

both scale and shape parameters equal to 1.

(b) In addition, the model for y is as follow:

Y = 0.5X1 + 3X2 + 5X3 + 5X2X3 + 2X1X2X3 + W, (1)

where W ∼ N(0, σ2

) where σ = 2.

Similar to the original example, generate train and test sets of size N = 1000. Fit the linear regression and the random forest models to the data. For the linear regression, make an inference

about the coefficients, specifically, comment about the contributions of different advertisement

types to sales. Use the linear model and the RF (with 500 trees), to make a prediction (using

the test set), and report the corresponding mean squared errors.

When constructing datasets, please use “1” and “2” seeds for the train and the test sets,

respectively.

2. [10 Marks] Consider the following variant of the cross-validation procedure.

(i) Using the available data, find a subset of “good” predictors that show correlation with

the response variable.

(ii) Using these predictors, construct a model (for regression or classification).

(iii) Use cross-validation to estimate the model prediction error.

Is this a good method? Do you expect to obtain the true prediction error? Explain your answer.

Please note that no coding is required here and one paragraph general answer is sufficient.

1

3. [5 Marks] Suppose that we observe X1, . . . , Xn ∼ F. We model F as a Gamma distribution

with shape parameter α > 0 and rate parameter β > 0. For this problem, determine the

hypothesis class

H = {f(x, θ); θ ∈ Θ}.

and state explicitly what is θ and Θ.

4. [15 Marks] Let H be a class of binary classifiers over a set Z. Let D be an unknown distribution

over X , and let g be a target hypothesis in H. Show that the expected value of LossT (g) over

the choice of T equals LossD(g), namely,

ET LossT (g) = LossD(g).

5. [15 Marks (see details below)] Consider the following dataset.

x1 y

0 1

1 2

2 3

3 2

4 1

Now, suppose that we would like to consider two models.

Model1 : y = β0 + ε,

and

Model2 : y = β1x1 + ε,

where ε ∼ N(0, 1). That is, we consider two linear models Model1 is the constant model and

Model2 is a regular linear model without the intercept.

(a) [5 Marks)] Fit these models tot the data and write the corresponding coefficients. Namely,

fill the following table:

Model β0 β1

Model1 0

Model2

(b) [5 Marks)] Consider the squared error loss, the absolute error loss, and the L1.5 loss. Find

the average loss for each model. Namely, fill the following table:

Model squared error loss absolute error loss L1.5 loss

Model1

Model2

(c) [5 Marks)] Draw a conclusion from the obtained results.

6. [30 Marks (see details below)] Consider the Hitters data-set (given in Hitters.csv). Our

objective is to predict a hitter’s salary via linear models.

(a) [5 Marks)] Load the data-set and replace all categorical values with numbers. (You can

use the LabelEncoder object in Python).

(b) [5 Marks)] Generally, it is better to use OneHotEncoder when dealing with categorical

variables. Justify the usage of LabelEncoder in (a).

2

(c) [20 Marks)] Fit linear regression and report 10-Fold Cross-Validation mean squared error.

7. [10 Marks] Consider a function

f(x) = 3 + x2 − 2sin(x) 1 6 x 6 8.

Write a Crude Monte Carlo algorithm for the estimation of

`

= Z 1 8 f(x) dx,

using N = 10000 sample size. Deliver the 95% confidence interval. Compare the obtained

estimation with the true value ` . 3

联系我们

QQ：99515681
邮箱：99515681@qq.com
工作时间：8:00-21:00
微信：codinghelp

热点文章

更多

讲解 inst0007/web technologi... 2025-12-09
辅导 5bus1205 project manage... 2025-12-09
讲解 stats project: financia... 2025-12-09
讲解 iot302tc wireless senso... 2025-12-09
辅导 csc61604 computer netwo... 2025-12-09
辅导 project 3 - multimodal ... 2025-12-09
讲解 atw153 – financial acc... 2025-12-09
讲解 assignment 3讲解留学生... 2025-12-09
讲解 assignment 2: microsoft... 2025-12-09
辅导 hsm3002 health and soci... 2025-12-09
讲解 elec0086 communications... 2025-12-09
讲解 ss5302 research methods... 2025-12-09
辅导 mkt60104 principles of ... 2025-12-09
辅导 intermediate microecono... 2025-12-09
辅导 intermediate microecono... 2025-12-09
辅导 hum 110 readings in wor... 2025-12-09
讲解 fms 85, fall 2025 analy... 2025-12-09
辅导 bus1104 managerial acco... 2025-12-09
讲解 ece – gy 6403 fundamen... 2025-12-09
辅导 acf301 – coursework 20... 2025-12-09

热点标签

engn4536/engn6536

comp(2041|9044)

litr1-uc6201.200

int2067/int5051

csci-ua.0480-003

cs247—assignment

msinm014/msing014/msing014b

联系我们 - QQ: 99515681 微信：codinghelp

© 2024 www.7daixie.com

程序辅导网！