讲解 QUIZ 1: BIG DATA讲解 Prolog

QUIZ 1: BIG DATA

Please generate a single PDF ﬁle using R Markdown.

You may either knit directly to PDF or create an HTML document and convert it to PDF.

Once completed, submit the PDF via Turnitin on the course webpage.

Caution: Do not set a seed. If you do, no credit will be given for this quiz. The same penalty applies if you do not use R Markdown to generate a single document. When a word limit is speciﬁed (e.g., 50 words), do not exceed it; otherwise, no credit will be given. You may count words at https://wordcounter.net/.

Total 10 marks (each 1 mark).

1. Import the dataset Carseats from R-package ISLR2. You can view information about the data by typing

> ??ISLR2::Carseats

This will display a one-line description of the dataset, along with the sample size and number of variables. Reprint the one-line description in your answer (1 mark).

2. In the command window, type

> data("Carseats")

Explain in 20 words what you see in the Environment tab of RStudio. Speciﬁcally, how many observations and variables are in the Carseats dataset? Hint: You may need to click the dataset name to view details.

3. Create a new binary variable, HighSales, to indicating whether Sales is above its median value. Try:

> HighSales <- Carseats$Sales >= median(Carseats$Sales)

Explain in 20 words what changed in the Environment tab of RStudio.

4. Remove HighSales, which you created in Q3, using rm(HighSales), and run the following code.

> Carseats$HighSales <- Carseats$Sales >= median(Carseats$Sales)

Explain in 20 words what the code right above produces, referring to the Environment tab of RStudio.

5. Additionally, each observation is assigned to the training set with 75% probability and to the test set with 25% probability. The following three lines of R code perform this sample split.

> train <- runif(N) <= 0 .75 # N is number of observations; you should find N . > Carseats .train <- Carseats[train, ] # training sample

> Carseats .test <- Carseats[! train, ] # test sample

Use the R function length() to print the numbers of observations in the training and test samples.

6. Let X1 := Price. Select two additional predictors, denoted (X2 ; X3 ) (do not select Sales or HighSales). Using the training sample from Q5, regress Sales on each Xj , j = 1; 2; 3, separately. Display the re- sults in a ﬁgure with three horizontally arranged plots, similar to the example below. Ensure that the axes are labeled with appropriate variable names rather than R symbols or generic labels like Xj .

7. Using the training sample from Q5, regress Sales on (X1 , X2 , X3 ), the three predictors selected in Q6. Predict Sales at the median values of (X1 , X2 , X3 ) and compute a 95% conﬁdence interval. Then, repeat the prediction to compute a 95% prediction interval.

Hint: median(A) computes the median of variable A.

8. Using the predictions from Q7, compute the mean squared error (MSE) for the test sample created in Q5. The test MSE is given as

where Ntest and are, respectively, the sample size of the test sample and the predicted value of Sales for observation i in the test sample.

Hint: The mean squared error (MSE) on the training sample is

where Ntrain is the sample size of the training sample and is the predicted value of Sales for observation i in the training sample. There are several ways of computing MSEtrain in R, including

> mean(summary(fit)$residualsΛ 2)

> mean((Carseats.train$Sales - predict(fit,Carseats.train))Λ 2)

9. Compute the training error rate of a logistic regression for the qualitative variable HighSales. Using the training set from Q5, ﬁt a logistic model to predict HighSales using the predictors (X1 , X2 , X3 ) selected above. Let = 1 if