STA221H1 The Practice of Statistics II (LEC0101) Winter 2025
Assignment Instructions
Due Date: March 14th, 2025 at 11:59PM on Crowdmark
Instructions
This is an individual assignment. You are expected to work on this independently. You are more than welcome to discuss ideas, code, concepts, etc., regarding this assignment with your classmates. Please do not share your code or your written text with your peers. It is expected that all code and written work should be written by yourself. Please note, this assignment is fairly open, so the context of most of the work completed here should not match your peers.
Synopsis of the Assignment
In this assignment, use the provided dataset and R to complete tasks that we have learned in this course so far.
Submission Format and Instructions
Your final submission will be in PDF files that shows (1) R code, (2) R output, and (3) your written answers. Here are some suggested ways you can create your final submission:
• Use Microsoft Word to type out your answers. Screenshot your R output and place these images throughout the document. For the R code, either copy/paste as text or screenshot.
• Use an app like Notability, OneNote, etc., where you can write/type your answers and include screenshots of your R code and output.
• Use RMarkdown and knit to a PDF. Alternatively, you can knit to an HTML file and then save it as a PDF.
How you create the final file is up to you, as long as it is clear and organized. You don’t want the TA to be frustrated while marking your work!
You will submit your solutions on Crowdmark. There will be a different upload box for each question, so it is recommended that you place each question on different pages.
Late or Missing Submissions
If the assignment is not submitted by the due date, it will be subject to a late penalty of 20% per day. No extensions will be provided for the assignment.
Alternatively, if the assignment is missed due to an illness or personal emergency please fill out the form. listed in the syllabus.
Question 1 (5 marks)
The dataset that you will be using is a sample of Uber and Lyft rides in Boston, MA between Nov. 26, 2018 and Dec. 12, 2018. The response variable is ‘price’ . It is a condensed version of the dataset found here:https://www.kaggle.com/datasets/brllrb/uber-and-lyft-dataset-boston-ma/data. Use the dataset posted on Quercus rather than downloading from this website as they are different.
Download the dataset from Quercus and read it into R. Write 4-6 sentences introducing your dataset. You should be explaining your chosen dataset to someone who isn’t familiar with it. You should introduce the general topic area and any important variables. (Think of this as an introduction to a report where you analyze this dataset.)
Question 2 (5 marks)
Fit a multiple linear regression model in R using the data. It is up to you to decide the predictor variables.
Clearly state what your response and predictor variables are. Show the standard R output using the summary() function.
It is not necessary to use a model selection algorithm.
Question 3 (15 marks)
Check the assumptions using the residuals. Show the appropriate plots. For each of the 4 assumptions, you should state how well the assumption is satisfied, referencing applicable plots as necessary.
For the purposes of this assignment, even if the assumptions are not met, please proceed with the rest of the assignment.
Question 4 (6 marks)
For TWO of your predictors, provide an interpretation of the slope. Provide an interpretation of the intercept. Does the intercept have a meaningful interpretation in practice?
Question 5 (10 marks)
State the hypotheses and conclusion to the ANOVA F-test and t-tests for each slope.
Additionally, interpret the coefficient of determination. (For ease of marking, please show the standard R output using the summary() function in your Question 5 answer – the same one from Question 2.)
Question 6 (6 marks)
For a combination of predictor variable values of your choice, calculate the confidence interval and prediction interval. Provide an interpretation for each of the intervals. You are encouraged to choose a combination of predictor values that makes sense and would have a meaningful interpretation.
Question 7 (10 marks)
Fit another linear regression model with the same response variable. Choose a model where either the second model’s predictors are a subset of the first model’s predictors OR the first model’s predictors are a subset of the second model’s predictors.
Use 2 methods to determine which model is preferable. Show and explain the results of both of these methods. Do the 2 methods agree?