STA475 Assignment #3 (Fall 2024)
Instructions
Due date: Friday November 22 at 11:59pm
• No-questions-asked grace period until Monday November 25 at 5pm
• No late submissions will be accepted after this time
Where to submit:
• Crowdmark: Submit your answers to each question in correct space on Crowdmark
• MarkUs (link: https://markus.teach.cs.toronto.edu/markus/courses/25): Submit the .qmd file with your answers to Q1 to MarkUs - please just save a copy of the ques- tions and enter your answers in the space provided. You don’t have to use the .qmd file to answer the other questions if you don’t want to.
• You can submit as many times as you like before the deadline
• Email submissions will NOT be accepted
Other notes:
• For some questions, you will need to use R; please include all code and relevant output in the pdf submission (make sure the code is visible in the pdf and doesn’t run out of the margins). If the question asks you to answer a question based on R output, make sure your answer is easy for your TA to find (i.e. start with your answer, then have the code and output afterwards for reference). Your TA should be able to understand your answer without looking at your and output (as appropriate), but may refer to these to better understand what you did.
• While you may discuss questions with your classmates, you MUST submit independent work that you did yourself. Students submitting identical solutions (e.g. identical sen- tences, derivations steps, or chunks of code) will be investigated for violations of academic integrity.
• If you believe you’ve found a typo or error in the assignment, please email sta475@utoronto.ca so I can look into it and get back to the class as quickly as possible.
Question 1 [12 points] To answer the questions below, refer to the article titled “Breastfeeding Rates and Related Factors at 1 Year Postpartum in Women with Gestational Diabetes Initially Recruited for a Diabetes Prevention Program”. In Table 2, two sets of results are included (unadjusted and adjusted models). Find the relevant section of the article related to this table to understand the difference between these. For each of the following quantities (i) write the regression equation for the corresponding model, carefully defining any notation you introduce, (ii) fill in the steps to build up your interpretation (the steps are given at the end of this question for your reference), and (iii) write a clear and complete interpretation of the estimated coefficient in the context of the data.
(a) The coefficient for mothers who had breastfeeding troubles (unadjusted model)
(b) The coefficient mothers who had breastfeeding troubles (adjusted model)
Steps for building up an interpretation
Step
Step 1: Identify
Identify the target for interpretation
Step 2: Think
Is it “better” to have larger or smaller values of the target
Step 3: Identify
Identify comparison of interest
Step 4: Analyze
Determine direction of effect
Step 5: Write
Fill in the basic sentence frame.
The for individuals with is times that of individuals with .
Step 5b (if applicable): Rewrite
Rewrite Step 5 so that the value reported is bigger than 1 (if it isn’t already)
Question 2 [16 points] The data in leukemia.csv contains survival times for patients with leukemia, measured in weeks from diagnosis. Values of two predictors are also recorded: white blood cell counts (wbc) at diagnosis and AG, a binary predictor that indicates whether a test related to white blood cells was positive (1) or negative (0).
leuk <- read_csv("leukemia.csv")
Rows: 33 Columns: 4
-- Column specification --------------------------------------------------------
Delimiter: ","
dbl (4): time, status, AG, wbc
i Use `spec()` to retrieve the full column specification for this data.
i Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(leuk)
Rows: 33
Columns: 4
$ time <dbl> 65, 140, 100, 134, 16, 106, 121, 4, 39, 121, 56, 26, 22, 1, 1, ~ $ status <dbl> 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ~ $ AG <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, ~ $ wbc <dbl> 2.30, 0.75, 4.30, 2.60, 6.00, 10.50, 10.00, 17.00, 5.40, 7.00, ~
(a) Use graphical methods to assess how the survival time is affected by wbc, log(wbc),
and AG. Hint: Consider using K-M plots, scatterplots, or other visualizations as
appropriate. For each visualization in your answer, write 1-2 sentences summarizing your observations regarding the association between the predictor and survial times.
(b) Fit a parametric regression model to these data with appropriate predictors. Write the model equation for the model you fit, as well as the summary of model estimates from R. Be sure to define any covariates you introduce.
(c) Carry out residual analysis to assess the fit of the model you chose in the previous
part. If you find that the fit is not suitable based on the residuals, you should fit a
different model in (b) and repeat the residual analysis - the goal here is to end up with a model which is suitable for these data.
(d) Based on your model from (b) and (c), interpret the estimated regression
coefficients and how the survival time depends on the covariates (you don’t need to interpret the intercept).
(e) Now you’ll fit a Cox Proportional Hazards model to these data (you choose what predictors to include) and carry out resdidual analysis to check if the fit is adequate (if it
isn’t, try to build another model that has better fit)
(f ) Interpret the estimates for your fitted Cox PH model from (e)
Question 3 [15 points] The bladder cancer recurrence data in
bladder.csv feature data on a tumour recurrence study for patients with bladder cancer. Individuals had 0 to 3 recurrences during the 60 month (5-year) follow-up window, but for this analysis we will
focus on the time to first recurrence, measured from entry into the study. Define the response time based on follow-up time (futime) and first recurrence time (in months) as follows. If the patient
experienced no recurrences, they are considered ‘censored’ at the end of the follow-up time. The covariates include:
• treatment group: placebo (1) vs drug thiotepa (2)
• size: size of the largest initial tumour, measured in cm
• number: number of tumours at initial diagnosis
(a) Before you begin your analysis, perform the following modifications to the bladder
data. (i) Define a new variable response variable called time according to the description above. (ii) Define a status variable to indicate which subjects experienced a recurrence and which were censored. (iii) Change the values of Group from 1/2 to more descriptive values. Print the first 10 rows of the modified bladder dataset in your solutions, to
demonstrate that you’ve made the changes appropriately (you should leave all previous variables in the dataset, to facilitate this verification).
bladder <- read_csv("bladder.csv")
Rows: 86 Columns: 5
-- Column specification --------------------------------------------------------
Delimiter: ","
chr (1): recurrence_time1
dbl (4): Group, futime, number, size
i Use `spec()` to retrieve the full column specification for this data.
i Specify the column types or set `show_col_types = FALSE` to quiet this message.
(b) In this question, you will fit Cox regression models to investigate the effect of these covariates on the time to first recurrence. Your investigation should include the following
sections: (A) Initial graphical exploration & commentary, (B) Model fitting and model comparison - you should fit at least two models and compare them using hypothesis
test(s), (C) Residual analysis - use deviance residuals only for this investigation, and (D) Summary of final results and conclusions, including interpreting the estimates in the context of the problem.
Be sure to write a few sentences describing and summarizing each part of your analysis process (in addition to providing R code and relevant output).
Question 4 [5]
In this question, you’ll watch a brief interview with David Cox (first 12 minutes of https://www.youtube.com/watch?v=TiHCNRUiLKc&t=731s )
(a) Write 100-200 words briefly describing his motivation for developing this new model.
(b) If you used any information outside of the above video in your response,
please provide a citation for this information. If you used no other information, please write “I did not use any other information to produce my response.”
(c) If you used an LLM tool to generate any part of your response, please
indicate how you used it here and include a transcript of the conversation you
had. Reasonable use of an LLM is allowed to improve your answer, but the ideas should be your own.
Question 5 [10 points]
The following paragraph is a somewhat incorrect description of the Cox Proportional Haz- ards model. Identify **all** errors in the passage, and for each one, provide a justification (please use bullet points to clearly identify each error, followed by an explanation of why it is incorrect).
The Cox Proportional Hazards model is widely used in survival analysis due to its flexibility and robust theoretical foundation. Its popularity stems from the ability to handle multiple covariates while making minimal assumptions about the underlying data structure . A defining feature of this model is that it assumes that the baseline hazard function follows a Weibull distribution, which helps with computational efficiency. Like other survival models, it estimates the instantaneous risk of an event occurring, and the relationship between covariates and survival is expressed through the hazard ratio. The hazard ratios between different groups must remain constant over time - this is called the proportional odds assumption. As a fully parametric model, it provides direct estimates of survival times and allows us to explicitly interpret the magnitude of the baseline hazard at any time point.