讲解 EN.553.413-613, Spring 2024 EN.553.413-613, Spring 2024 Exam 1讲解留学生SQL语言

Applied Stats and Data Analysis EN.553.413-613, Spring 2024 Feb 21, 2024 Exam 1

Question 1 (18 pts). The following TRUE/FALSE questions concern the Simple Linear Regression model Yi = β0 + β1Xi + εi , E(εi) = 0, V ar(εi) = σ 2 , cov(εi , εj ) = 0, for i = j. (a) TRUE or FALSE. For the least squares estimates b0, b1 we require the errors to be normally distributed. (b) TRUE or FALSE. The estimated mean of the response variable at Xi is defined as b0 + b1Xi . (c) TRUE or FALSE. One of the Gauss Markov conditions is P n i=1 ei = 0. (d) TRUE or FALSE. Plotting e 2 i vs Yˆ i is one of the diagnostic plots. (e) TRUE or FALSE. QQ plot of the Yi ’s is one of the diagnostic plots. (f) TRUE or FALSE. Low R2 means that X and Y are not related. (g) TRUE or FALSE. The s 2 is an estimate of the variance of Yi . (h) TRUE or FALSE. Coefficient of simple determination R2 measures the proportion of the explained variation in Y over the unexplained variation in Y . (i) TRUE or FALSE. In the Correlation model of the regression Xi ’s are random variables.

Question 2 (18 pts). Let X, Y, Z ∼ iid N(0, 1), i.e. they are independent, identically distributed standard normal random variables. For the following random variables state whether they follow a normal distribution, a t- distribution, a χ 2 distribution, an F distribution, or none of the above. State relevant parameters (e.g. degrees of freedom, and means and variances for normal RVs)

(a) 3Y − Z (b) X + Y + Z. (c) X2 + Y 2 + Z 2 . X2 + Y 2 (d) 2Z2 X2 (e) √ Y 2 + Z2 (X + Y ) 2 (f) 2

Question 3 (20 pts). Suppose a data set {(Xi , Yi) : 1 ≤ i ≤ n} is fit to a linear model of the form. Yi = β0 + β1xi + εi where εi are independent, mean zero, and normal with common variance σ 2 . Here we treat Y as the response variable and X as the predictor variable. The output of the lm function is given. Some values are hidden by ‘XXXXX’. We provide you with additional value: X¯ = 1.11. Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.9412 0.4593 4.226 0.000508 *** x 0.7042 0.3697 1.905 0.072911 . --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.9221 on 18 degrees of freedom Multiple R-squared: 0.1678, Adjusted R-squared: ----- F-statistic: XXXXX on XX and XX DF, p-value: XXXXXX (a) (2 points) How many data points are there (what is n, the sample size)? What is the estimated mean of the response variable Y at Xh = 2 for this dataset? (b) (3 points) Based on all of this output, do you reject H0 : β1 = 0 in favour of Ha : β1 = 0 at level α = 0.05 significance? What does the test tell us about the relationship between X and Y ? (c) (3 points) Based on all of this output, do you reject the H0 : β1 = 0 vs Ha : β1 > 0 at level α = 0.05 significance? Briefly explain why, or why not. (d) (4 points) The degrees of freedom, the p-value and the value of the F statistic are hidden. Is it possible to reconstruct all of them based on the data shown? Recover as many values as you can. (e) (4 points) Based on the data above find SSTo, SSR and SSE. Hint: Residual standard error may be useful here. (f) (4 points) Find the 95% confidence interval for the mean of the response function at Xh = 2. Write your answer in the form. A ± B · t(C, D), specify values A, B, C, D as precise as you can (i.e. find values of as many terms as you can).

Question 4 (14 pts). Consider the following diagnostic plots for two models (Model 1 and Model 2). Two simple linear regression models Y = β0 + β1X + ε are fitted to the two different datasets (X, Y ) observations of each Model. For each model 3 diagnostic plots are shown: plot of Yi vs Xi , plot of semi-studentized residuals e ∗ i versus fitted values Yˆ i , QQ-plot of the semi-studentized residuals e ∗ i .

(a) (5 points) What is the main issue do you diagnose with the Model 1, if any? Why? Which plot was the most useful in diagnosing this problem? Be as specific in describing the issue as you can. (b) (5 points) What is the main issue do you diagnose with the Model 2, if any? Why? Which plot was the most useful in diagnosing this problem? Be as specific in describing the issue as you can. (c) (4 points) This question is unrelated to the above plots. Explain in what cases the transformation of the predictor variable X is more appropriate than the transformation of the response variable Y .

Question 5 (20 points). For the dataset of n = 200 observations a simple linear regression model Yi = β0 + β1Xi + εi is fit. The following estimates are obtained. b0 = 2, b1 = 1 We have listed additional information here

(a) (2 points) What is the estimated variance s 2 of the error term based on the data above? (b) (3 points) Find a 90% confidence interval for β1. Write it in the form. A ± B · t(C, D), compute values of A, B, C, D if possible. (c) (4 points) Find the joint confidence intervals with confidence at least 90% for β0, β1 in the form. Ai ± Bi · t(Ci , Di). Compute values of Ai , Bi , Ci , Di if possible. Without any computation how does the interval for β1 for this part compare to the one in part (a)? (d) (4 points) Find the joint confidence intervals using Bonferroni procedure with confidence at least 90% for the mean of the response variable at Xh = 2 and Xh′ = 0. Find it in the form. Ai ± Bi · t(Ci , Di). (e) (4 points) Set up a General Linear Test for the data provided: specify the reduced and full model, compute the value of the F-statistic, specify its distribution under the null hypothesis. (f) (3 points) An Aspiring Data Scientist (ADS) noticed that one of the observed data points (Xi , Yi) = (2, 15) lies outside of the 99% Working-Hotelling band (we assume everything was computed correctly). They claim it is an issue. Briefly justify if their concern is correct or not.

Question 6 (20 points). Suppose Yi follows the model Yi = βXi + εi where εi is independent, identically distributed N(0, σ2 ). Note, there is no intercept term. You observe a collection {(Xi , Yi)} of data from this model, i = 1, . . . , n. (a) (5 points) Write the objective function to be minimized and the equations that need to be solved to get the least squares estimate of β. (b) (5 points) Solve the equation in (a) and express the answer as a linear combination of Yi ’s. (c) (5 points) What is the distribution of b? Find the mean, variance. Justify your steps (d) (5 points) Write the log-likelihood that needs to be maximized to obtain the estimate of β. DO NOT MAXIMIZE IT. (a) Function to be minimized

讲解 EN.553.413-613, Spring 2024 EN.553.413-613, Spring 2024 Exam 1讲解 留学生SQL语言

讲解 EN.553.413-613, Spring 2024 EN.553.413-613, Spring 2024 Exam 1讲解留学生SQL语言