讲解1STAT5002: Introduction to Statistics

1STAT5002: Introduction to Statistics - Semester 1, 2022
Submission Due Date: Friday, 27th May, 2022 before 11:59 pm (Sydney time)
Instructions:
1. You are required to type up your entire assignment, including any equations. If you are using Word,
you should use the equation editor for any maths notation.
2. Copy and paste relevant R code and outputs while discussing your answer in the text. Do not put
all R code and outputs at the end of the document.
3. Answer all questions in the given order; i.e., 1(a), 1(b), etc. Keep your answers clear, brief, and concise.
4. Convert and submit your assignment in pdf, which must be uploaded to the Turnitin assignment
box on Canvas.
5. Data used in this assignment are in the spreadsheet ADataset.xlsx.
6. You MUST write up solutions on your own. Do not discuss the assignment with your classmates.
Students caught cheating will automatically receive a mark of 0 and are subject to disciplinary action.
7. This assignment carries a weight of 8% towards your final mark for STAT5002.
1. Assume that the marks in the following subjects are normally distributed:
Subject Mean (μ) Standard deviation (σ)
Statistics 50 12
Economics 65 10
Mathematics 76 8
(a) Douglas obtained a final mark of 68 in Statistics, 73 in Economics, and 71 in Mathematics. In
which subject did he perform best compared to the rest of the class?
(b) Maria’s z scores were 1.2 in Statistics, -0.5 in Economics, and -1.5 in English. Calculate Maria’s
mark in the subject where she performed worst compared to the rest.
(c) Examiners often use z scores to scale marks via a new mean and a new standard deviation.
The new marks are then directly comparable. Calculate the scaled marks with a new mean of
100 and a new standard deviation of 20 in each subject for Douglas and Maria.
(d) Refer to part (c). Who had the best overall performance?
2. It has long been known that brain weight scales with body weight across large groups of animals.
The data were collected on n = 24 mammals and is found in the Q2 sheet in the Excel spreadsheet.
Let X be the body weights (kg) and Y be the brain weights (g).
(a) Produce a scatter plot of ”Brain weights (g)” versus ”Body weights (kg)”. Make sure you label
your axes properly and that your graph has an appropriate title. Briefly describe the nature
of the relationship between these two variables. Are there any outliers? If yes, can we remove
them? Why or why not?
(b) You would like to build a linear regression model to predict brain weights (g) using body weight
(kg). Which model: linear-linear, log-linear, linear-log, or log-log fits the data better? Provide
visual evidence to support your argument. Write down the model of your choice.
23. The dataset Q3 contains the following information on a sample of n = 36 severely depressed indi-
viduals.
Variable Description
Eff Measure of the effectiveness of the treatment
Age Age (years)
Tmt Treatment received (A, B or C)
(a) Produce a scatter plot of Eff versus Age. What does it show?
(b) Run a regression of Eff on Age. Write down the fitted regression equation.
(c) Produce another scatter plot of Eff versus Age but this time with colour coding and different
regression lines for each of the three treatments. Does the treatment appear to interact with
age in explaining the response? Explain why or why not.
(d) Code up dummy variables for treatments A and B as well as an interaction between Age and
each of treatments A and B. Attach the R code to show how you create the dummies and
interaction terms. Why don’t we need a dummy variable for treatment C?
(e) Using Age, the dummies, and the interactions as predictors, perform the backward elimination
to obtain the best model by means of AIC criterion. Write down the final estimated regression
equation. What percentage of the total variation in Eff is explained by the model?
(f) Use the partial F test to determine which model [the one in part (b) or the one in part (e)]
fits the data better. Include mention of H0 and H1, the observed value of the test statistic, the
p-value, the decision, and conclusion.
(g) Predict the effectiveness of treatments for the following people:
Patient Age Treatment
Peter 20 A
Anna 56 B
Louis 69 C
4. As part of the 2020 College Alcohol Study, students who drank alcohol in their senior year were
asked if drinking ever resulted in missing a class. The data are given in the following table:
Drinking Status
Missed a class Nonbinger Occasional binger Frequent binger Total
No 41 18 11 70
Yes 4 8 18 30
Total 45 26 29 100
(a) At the 0.05 level of significance, is there evidence of a significant association between missing
a class and drinking status? Include mention of H0 and H1, the observed value of the test
statistic, the p-value, the decision, and conclusion.
(b) What is the conditional distribution of drinking status?
(c) What are the odds of a nonbinger who missed a class?
(d) What are the odds of a frequent binger who missed a class?
(e) What is the odds ratio for nonbingers versus frequent bingers who missed a class?
(f) Fit a logistic regression of a senior student who never missed a class on drinking status. Treat
frequent binger as a base group. Write down the fitted regression equation.
(g) Refer to part (f). What is the odds ratio for nonbingers versus frequent bingers who missed a
class? Is it the same as your calculation in part (e)? Explain why or why not.