BIOS5001 / PUBH6001 Introduction to Biostatistics
2024 – 2025 (Term 1)
Assignment 3
1. (viral.sav) The SPSS dataset viral.sav contains 6 variables measured on 24 HIV positive subjects:
Age = age of patient in years
Risk = 1 if patients risk factor was MSM or 2 if risk factor was heterosexual
Days = days from symptom onset until blood sample was taken
CD4 = CD4 cell count in 106 per liter
Viral = Blood viral load
Lgviral = Log10(viral load)
Your goal is to find the best linear regression model for predicting blood viral load (outcome variable), in terms of either Viral or Lgviral , using the other 4 variables as potential predictors.
(a) From the six scatterplots of the two potential outcome variables vs. the three quantitative predictor variables, decide which of the outcome variables (Viral or Lgviral) would be more appropriate for linear regression analysis (6 marks). Based on what characteristic of the scatterplots did you make the choice (10 marks)?
(b) Use backward elimination to determine a model for predicting the blood viral load of a future patient and show the “Coefficients” table in the SPSS/PSPP output (10 marks). Write down the equation of your final model (8 marks).
(c) From your final model in (b), what would be the fitted value (8 marks) and residual (8 marks) for the first subject in the data set who was 28 years old, had a CD4 count of 361, had 24 days between onset of symptoms and sampling and had the “MSM” risk factor, given the observed blood viral load is 186208.71, or equivalently log10 of observed blood viral load is 5.27? (Please write down the calculation steps.)
2. (disease.sav) The SPSS dataset disease.sav contains 3 variables and 200 cases suffering from Disease A.
status = 1 meaning the patient died and = 0 survived
agemid = midpoint of the age group to which the patient belonged
gender = 0 for females and 1 for males
(a) Please conduct a simple (univariate) logistic regression, with status as the outcome variable and gender as the predictor variable (female as reference level) and show the “Variables in the Equation” table in the SPSS/PSPP output (10 marks). What is the odds ratio (males:females) for mortality (4 marks)? Is there a statistically significant difference in mortality between males and females (please explain using the p-value from your output) (6 marks)?
(b) Please conduct a multiple logistic regression with status as the outcome variable and both gender (female as reference level) and agemid as predictors and show the “Variables in the Equation” table in the SPSS/PSPP output (10 marks). What is the adjusted odds ratio (males:females) for mortality (4 marks)? Is the difference in mortality between males and females statistically significant adjusted for other variables in the model (please explain using the p-value from your output) (6 marks)? How would you explain the change in the coefficient of gender between the two models (10 marks)?