#
STA302/1001作业代做、代写data课程作业、代做R编程设计作业、代写R语言作业
调试Web开发|代做R语言编程

STA302/1001 - Final Assignment

Due Wednesday April 22 by 11:59PM EST on Crowdmark

Student Name:

Student Email:

Instructions:

This final assignment must be completed individually. Any sharing or discussion of questions or

answers with other students will be considered an academic integrity offence. To ensure that all

students understand the consequences of violating academic integrity, you will need to upload the

attached academic integrity acknowledgment (on page 2), signed at the beginning and at the completion

of this assignment. In the event of suspected integrity violations, this will serve as evidence

that a student knowingly committed an act of academic misconduct. So please ensure that you read

and understand what constitutes academic misconduct as well as the consequences of so doing.

Assignments must be submitted electronically through Crowdmark. Each student will receive a

personalized link to view the assignment (this is where you will submit your assignment when

finished). If you do not receive this email from Crowdmark, check your spam/junk folder. Instructions

for how to upload completed assignments can be found here: https://crowdmark.com/

help/completing-and-submitting-an-assignment/. Note that only PDF, PNG or JPG

file types are accepted by Crowdmark. You will need to upload certain questions into certain

places, so make sure you are submitting pages in the right place.

The assignment is divided into five questions. Each question needs to be uploaded under the correct

section in Crowdmark, otherwise it may be overlooked when graded. For questions that require

hand calculations or proofs/derivations, you must show all your work. You may submit handwritten

answers for these question, but they must be legible and neat. For questions involving R, you must

provide an appendix that contains all the R code used to complete the question. We need to be

able to verify that your answers can actually be produced from your code. Please do not have R

code or unnecessary R output in your solutions. This should be in the appendix.

Note that as this is meant to replace a final exam and we are keeping the due date of the original

final exam, this means that NO EXTENSIONS WILL BE GRANTED. Therefore, if you have not

submitted by April 22 at 11:59PM EST, you will receive a grade of zero. To ensure that

you submit on time, please start the submission process early, especially if you have unreliable

internet access.

1

Academic Integrity Acknowledgement Form

Academic integrity is a fundamental value of learning and scholarship at the UofT. Participating

honestly, respectfully, responsibly, and fairly in this academic community ensures that your UofT

degree is valued and respected as a true signifier of your individual academic achievement.

Prior to beginning this final assignment, you must attest that you will follow the Code of Behaviour

on Academic Matters and will not commit academic misconduct in the completion of this assessment.

Affirm your agreement to this by completing the following Statement:

By signing this Statement, I, , agree to fully

abide to the Code of Behaviour on Academic Matters. I will not commit academic

misconduct and am aware of the penalties that may be imposed if I commit an academic

offence.

The University of Toronto’s Code of Behaviour on Academic Matters outlines the behaviours that

constitute academic misconduct, the processes for addressing academic offences, and the penalties

that may be imposed. You are expected to be familiar with the contents of this document.

Potential offences include, but are not limited to:

• Using someone elses ideas or words without appropriate acknowledgement (this includes from

internet sources or textbooks).

• Submitting your own work in more than one course without the permission of the instructor.

• Making up sources or facts.

• Obtaining or providing unauthorized assistance on any assignment (this includes working in

groups on assignments that are supposed to be individual work).

• Looking at someone else’s answers, or working together to answer questions.

• Letting someone else look at your answers.

• Misrepresenting your identity or having someone else complete your test or exam.

All suspected cases of academic dishonesty will be investigated following the procedures outlined in

the Code of Behaviour on Academic Matters.

Please sign the Statement below to complete your assessment.

By signing this Statement, I, , am attesting to

the fact that I have abided fully to the Code of Behaviour on Academic Matters. I

have not committed academic misconduct, and am aware of the penalties that may

be imposed if I have committed an academic offence.

2

Question 1 (12 points) - This question must be done by hand (but may be typed for

submission)

Consider a study design in which we have collected multiple response measurements at each value of

the predictor. Suppose we have ni observed responses at each value of xi

, indexed by i = 1, . . . , m,

and yij corresponds to the j-th observation on the response, j = 1, . . . , ni

for the i-th value of the

predictor. This means we have m unique predictor values, and ni response measurements for each

of the m values of the predictor. In this situation, it is possible to create a test that can be used to

test for how poorly the regression line captures the linear relationship.

(a) (4 points) Consider the traditional variance decomposition of a simple regression model:

SST = SSReg + RSS. Show that we can further decompose the residual sum of squares

into

• the pure error (i.e. deviations of the individual responses from the average response at

each unique value of the predictor), denoted by SSP ure

• and the lack of fit error (i.e. deviations of the average response at each x value from the

regression line), denoted by SSLack.

(b) (1 points) Determine the degrees of freedom for the pure error and the lack of fit error.

(c) (3 points) Determine the expected values of the mean squares of the pure error (MSPure) and

the lack of fit error (MSLack). You may assume that model assumptions are satisfied.

(d) (2 points) The test statistic for this test is

F =

MSLack

MSP ure.

Explain why this should follow an F distribution.

(e) (2 points) Based on the test statistic in (d) and the expected values in (c), explain why a large

value of the test statistic implies that the true regression function is not linear, and thus the

fit of our regression model is poor.

3

Question 2 (15 points) - This question must be done by hand (but may be typed for

submission)

A study was run to compare the effect of three different drugs on reducing the pain caused by

a particular condition. The drugs are labelled A, B, and C, and the response of interest is a pain

scale rating (integer-valued), where higher values implies more pain. The goal of the study was

to determine whether there exists a difference in the average pain rating between the three drug

treatments. We can answer this question using multiple linear regression methods. The data can

be found below:

Drug A 4 5 4 3 2 4 3 4 4

Drug B 6 8 4 5 4 6 5 8 6

Drug C 6 7 6 6 7 5 6 5 5

(a) (2 points) Show that we can represent the three treatments/drugs in the form of two indicator

variables. Why don’t we require the use of a third indicator variable?

(b) (2 points) Find the X0X and X0Y matrices for these data.

(c) (3 points) Estimate the regression coefficients for a multiple linear regression model relating

the pain response Y to the three drugs, X.

(d) (3 points) Show that the above regression model can be re-expressed as

yij = µ + τi + ij

where µ is the overall average pain rating, τi

is the average pain rating for drug i, ij is the

random error in the pain rating for individual j and drug i, and yij is the pain rating for

individual j on drug i.

(e) (5 points) Perform an appropriate hypothesis test using your model from (c) to determine

whether the average pain ratings for each drug are equal (i.e. τi = 0 for all i). Use a

significance level α = 0.05 and the residual standard error of 1.089.

4

Question 3 (8 points) - This question must be done by hand (but may be typed for

submission)

For each of the parts below, please provide a concise (up to three sentences) but detailed explanation

for each of the concepts. Make sure you use your own words for your answers.

(a) (2 points) Suppose we have the following correlations between a response variable and two

predictor variables. Explain which predictor the forward selection method would add to the

model first. Would the method then add the second predictor variable? Why or why not?

Y X1 X2

Y 1 0.93 -0.99

X1 0.93 1 0.985

X2 -0.99 0.985 1

(b) (2 points) Explain how violations in the model assumptions affect the ANOVA test of overall

significance in simple linear regression.

(c) (2 points) In the event that condition 1 or 2 fails, explain why we are unable to use the specific

patterns seen in the residual plots to tell us in what way the model assumptions are violated.

(d) (2 points) Explain why, when you have response measurements that are means or medians,

using a weight equal to the number of observations used to create that value can correct for

violations of constant variance.

5

Question 4 (10 points) - This question must be completed using R

Consider the New York City menu dataset, which can be found on the assignment page on Quercus

or attached with this question.

(a) (1 points) Fit a multiple linear regression model to predict Price from the variables Food,

Decor, and East. Extract the residuals from this model and save them. What do they

represent in the context of this model?

(b) (1 points) Fit a multiple linear model to predict Service from Decor, Food and East. Extract

the residuals from this model and save them. What do they represent in the context of this

model?

(c) (1 points) What can we say about the predictors based on the model from (b)?

(d) (2 points) Plot the residuals saved from part (a) against the residuals saved from part (b).

Add a line representing the simple linear regression relationship between these two sets of

residuals. What relationship do you see between the two sets of residuals?

(e) (3 points) Compare the relationship in your plot from (d) to a multiple linear model predicting

Price from the variables Food, Decor, Service and East. What similarities do you see? What

does the plot represent and how does it achieve this?

(f) (2 points) How else might this plot be used for diagnostic purposes?

6

Question 5 (20 points) - This question must be completed using R

For this question, you will be using the housing.proper.csv dataset which can be found on the

assignment page on Quercus or attached to this question on Crowdmark. These data consist of the

median value of owner-occupied homes (Y ) in suburbs of Boston, along with a number of different

neighbourhood characteristics. It contains 506 observations on 13 covariates. You are asked by a

real estate developer to build the best possible model to predict the median value of homes in a

new subdivision being built, but that is also interpretable so they can justify the use of this model

to shareholders. The possible predictors for this model include:

• X1 = per capita crime rate by town

• X2 = proportion of residential land zoned for lots over 25000 square feet.

• X3 = proportion of non-retail business acres per town

• X4 = Charles River indicator variable (1 = near river, 0 = far from river)

• X5 = nitric oxide concentration (parts per 10 million)

• X6 = average number of rooms per dwelling

• X7 = proportion of owner occupied units built prior to 1940

• X8 = weighted distance to five Boston employment centres

• X9 = index of accessibility to radial highways

• X10 = full-value property-tax rate

• X11 = pupil-teacher ratio by town

• X12 = 1000(B − 0.63)2

, where B is the proportion of African Americans by town

• X13 = a numeric vector of percentage values of lower status population

You may use any technique shown in class to arrive at your final model, but you must justify every

decision you make. You will be asked to interpret your final model, explain how you arrived at this

model and defend why you think this is the best possible model. You may use up to 5 plots in

your explanations and each plot must have a reason for being presented. Please do not include too

much R output (ideally fewer than 5 outputs) as all your decisions and model diagnostics should

be discussed in the text rather than presented with R output. The discussion of your model should

be no longer than 500 words. All R code should be at the end in an appendix so we can verify

your final model and the steps you took to arrive there. Your report with plots and output should

reasonably be no longer than 3 pages, with the appendix attached after. Do not overload your

appendix with code or output that is not relevant to the creation of your final model.

7

Rubric for question 5

Characteristic Insufficient (1 point) Adequate (3 points) Excellent (5 points)

Presentation:

Is the report

easy to follow,

and are graphical/output

components

used correctly?

Justification is far too

long or is missing a

number of key decisions,

may have serious

grammatical errors

and/or logic of analysis

is difficult to follow;

plots are not at all referenced

and/or do not

support the decisions or

discussion of model

Justification is a bit

long-winded and/or

is missing a few

key steps, may have

some grammar errors

and/or logic of analysis

a bit hard to follow;

some plots are not

referenced and/or are

not quite helpful for

the discussion

Justification is concise

yet thorough, grammatically

correct and

easy to follow; plots

presented are referenced

and valuable to

the report, and a clear

picture of the analysis

is given.

Model Building:

Were variables

chosen in

a reasonable

way?

No or minimal statistical

methods were used

to select variables in the

model and/or were used

incorrectly and/or justification

for choices not

given, or not stated

Only statistical reasoning

has been used

to select variables

without reference to

context; methods employed

have been used

correctly, justification

for methods may be

lacking

Both statistical and

contextual reasoning

has been used to

determine variables

included in model, and

have been used correctly

with sufficient

justification

Model Diagnostics:

Does

the model have

the correct

properties?

None or a minimal

amount of diagnostics

were made on the final

model, and/or no mention

of appropriateness

of model provided,

and/or no corrections

made/justifications

given

Some diagnostics have

been performed but

some are missing,

and/or if problems

were detected, little/no

justification

was given for how they

were handled

All diagnostics have

been performed and

model violations have

been documented; any

other problems with

the model have been

dealt with and justified

appropriately

Final Model:

Is the model

meaningful and

useful?

Model is overly complicated

which makes

interpretation difficult

and/or interpretation

of model is incorrect or

model is too simplistic

and important predictors

are missing and/or

model not validated

and/or limitations of

model not stated

Model has too

many/too few predictors,

although

relationship to response

still reasonable

and interpretable,

and/or interpretation

has some flaws, model

has been validated but

limitations of model

not reported.

Model shows reasonable

relationship to response,

is easily interpretable,

and has

been interpreted correctly

and validated,

and further limitations

of model documented.

8