STAT7055 Introductory Statistics for Business and Finance

Research School of Finance, Actuarial Studies and Statistics ASSIGNMENT Semester 2, 2021 STAT7055 Introductory Statistics for Business and Finance ©c 2021 ANU INSTRUCTIONS TO STUDENTS Due Date • The assignment is due at 6:00pm on Friday October 22. Note that this is different to the time and date stated in the class summary. • Late submission of the assignment is not permitted. An assignment submitted without an extension after the due date will receive a mark of 0. Obtaining your Assignment • There are different versions of the assignment and each student will be assigned a particular version of the assignment. • Therefore, you must log in to Wattle with your own ANU credentials and download your assignment directly from Wattle. Writing your Assignment • The assignment is an individual piece of assessment and must be completed on your own. • You will be required to write a report in an R Markdown document that contains R code, R output and written text. An example of an R Markdown document, which you can use as a template, has been provided on Wattle. • When answering the assignment questions in your report, you will need to include all your R code and R output that you used to calculate any answers and you must also write your answers in proper sentences. For example, if you are required to calculate a sample mean, then you would include your R code and R output for calculating the sample mean and you would also write a proper sentence in the report such as “The sample mean is equal to ...”. • Make sure to be clear and concise in your answers.

• A good way to approach writing your report is to imagine that you are a statistical consultant and that a client has asked you to do some statistical analyses. When presenting the results of your analyses to the client, you wouldn’t just give them pages of R code or pages of R output. Rather, you should give them a proper report which clearly outlines and explains the results of the analyses and which also includes the R code and R output used to produce the results. • Once you have finished writing your report in your R Markdown document, you will need to render the document by pressing the Knit button in RStudio to create a HTML file of your report. • Further to the above point, it is good practice to regularly Knit your R Markdown document as you write your report. This is useful for checking that it’s rendering properly. Submitting your Assignment • Submission of the assignment will be through Wattle via Turnitin. • A Turnitin link with further details regarding assignment submission will be provided on Wattle. • For submission you will need to submit two files: the R Markdown file of your report (i.e., a “.Rmd” file) and the rendered HTML file of your report produced by pressing the Knit button in RStudio (i.e., a “.html” file). • Please name your two files as “uNNNNNNN.Rmd” and “uNNNNNNN.html”, where uNNNNNNN is your student number. • No other file types can be submitted, e.g., “.R”, “.docx”, “.RData”, “.zip”, etc., files will not be accepted. In particular, do not submit any compressed files. Other Important Details • You may only use built-in functions available in base R and you are not permitted to use functions in any additional R packages (e.g., ggplot2). • You must use the appropriate R functions (and not the statistical tables) to calculate critical values or p-values for the normal, t and F distributions. • Round all final numeric answers to 4 decimal places. However, as you will be using R, keep all decimals during all intermediate steps to ensure the accuracy of your final numeric answer. • Please use the help function if you want to learn more about a particular R function, e.g., enter help(mean) in the R console to learn more about the mean function. • For questions that require writing mathematical symbols, you are welcome to use shorthand notation, provided you make the meaning clear (e.g., using “Mu” for µ, or “!=” for ∕=). • Answers need to be written in the text of the R Markdown document and not in the comments of code chunks. • Do not print out the entire data sets in your R Markdown document, as this will only take up unnecessary space.

Question 1 [14 marks] A streaming service provider would like to better understand the weekday viewing habits of their subscribers. In each year from 2015 to 2019, the provider conducted a short survey where they randomly selected 200 subscribers and asked each subscriber the following two questions: 1. The weekday on which they used the streaming service the most. 2. Their age. Note that each year, a new random sample of subscribers was selected for the survey. The data are stored in the file AssignmentData.RData in the data frame Q1.df. The data frame contains two columns for each year, one for the weekday on which the subscriber used the streaming service the most and one for the subscribers’s age . For example, for the year 2015 there are two columns, Weekday2015 and Age2015. For parts (a), (b) and (c), you will be analysing data from the 2015 survey. (a) [4 marks] Create a boxplot and a histogram of the subscribers’ ages for the 2015 survey. Make sure to give each plot a proper descriptive title and label the x-axis of the histogram appropriately (do not just use the default title or labels). Based on these plots, describe the distribution of the subscribers’ ages for the 2015 survey. Be specific in your description, making sure to mention any interesting and/or important aspects of the distribution. (b) [3 marks] Test whether the population proportion of subscribers in 2015 that are at least 25 years old is less than 0.4. Clearly state your hypotheses, making sure to define any parameters, and use a significance level of α = 1%. Do not use any R functions that are designed to perform hypothesis tests. (c) [3 marks] Create a bar chart describing the weekdays on which subscribers used the streaming service the most for the 2015 survey. Think carefully about how to appropriately present this data in the bar chart. Make sure to give the bar chart a proper descriptive title and label the x-axis appropriately (do not just use the default title or labels). Determine the mode of this data. (d) [4 marks] Test whether the population proportion of subscribers for which Mon- day or Tuesday was the weekday on which they used the streaming service the most is the same in 2015 and 2016. Clearly state your hypotheses, making sure to define any parameters, and use a significance level of α = 5%. Do not use any R functions that are designed to perform hypothesis tests. Assignment S2 2021 Page 3 of 6 STAT7055

Question 2 [25 marks] A statistical organisation conducted a survey about general happiness in the community. A random sample of people, taken from the population of all Australians, was selected to participate in the survey. The survey ran from January to May and involved each participant completing a questionnaire at the end of each month. For each month, based on the answers to the questionnaire, a score between 0 and 100 was calculated for each participant, with higher scores indicating greater happiness. The data are stored in the file AssignmentData.RData in the data frame Q2.df. Each row corresponds to a participant, and their happiness scores for the five months are given in the columns named by the month (Jan, Feb, Mar, Apr and May). Also included in the data set is each participant’s home state (State). For parts (a) and (b), you will be analysing the happiness scores from Jan- uary. (a) [4 marks] Test whether the population variance of happiness scores is the same for the ACT and NSW. Clearly state your hypotheses and use a significance level of α = 5%. Do not use any R functions that are designed to perform hypothesis tests. (b) [4 marks] Calculate the 80% confidence interval for the difference in mean happi- ness score between the ACT and NSW. Give a clear interpretation of the confidence interval. Do not use any R functions that are designed to calculate confidence in- tervals or perform hypothesis tests. (c) [4 marks] Considering only people who call QLD their home state, test whether the mean happiness score for January is less than the mean happiness score for February. Clearly state your hypotheses and use a significance level of α = 2.5%. Do not use any R functions that are designed to perform hypothesis tests. You will now conduct a one-way ANOVA on the happiness scores from Jan- uary with home state as the factor. (d) [2 marks] Calculate the sum of squares for error for the one-way ANOVA. Do not use any R functions that are designed to perform, analyse or interpret an ANOVA. (e) [3 marks] Test whether the mean happiness score is the same for all home states represented in the data. Clearly state your hypotheses and use a significance level of α = 5%. Do not use any R functions that are designed to perform hypothesis tests or to perform, analyse or interpret an ANOVA. (f) [8 marks] Discuss whether the assumptions for a one-way ANOVA hold for this data. You do not need to conduct any hypothesis tests, but make sure to provide clear justifications for your answer. Assignment S2 2021 Page 4 of 6 STAT7055

Question 3 [31 marks] A top-ranked medical school has a 5-year postgraduate medical program designed for students seeking to become fully-qualified doctors. As part of the application process, applicants to the program are required to take an entrance exam. Applicants who are successfully admitted to the program will also take a general exam at the end of each year of the program. A random sample of 400 students that completed the program was selected and each student’s scores in each of the six exams (Entrance, Year1, Year2, Year3, Year4 and Year5) were collated into a data set. In addition to the exam scores, the data set includes a variable (Biology) that indicates whether each student majored in biology in their undergraduate degree (1 if they did and 0 otherwise). The data are stored in the file AssignmentData.RData in the data frame Q3.df. The director of the postgraduate medical program would like to investigate any relationships that may exist between the entrance exam scores and the scores in the exams taken during the program. For this question, you will be analysing the scores for the entrance exam (Entrance) and the exam taken at the end of year 1 of the program (Year1). (a) [3 marks] Create a scatter plot of the end of year 1 exam scores against the en- trance exam scores. Make sure to give your plot an appropriate title and appropriate labels for the x and y axes. Describe the relationship between these two variables. Answer parts (b) to (d) without using the lm function or any other R function designed to fit, analyse or interpret regression models. (b) [4 marks] Fit a simple linear regression model with the end of year 1 exam score as the dependent variable and the entrance exam score as the independent variable. Write down the estimated regression model. (c) [6 marks] Discuss whether the assumptions for a simple linear regression model hold for the model you fitted in part (b), making sure to provide clear justifica- tions for your answer. Note that you will need to use R to manually calculate the residuals. (d) [4 marks] For the model you fitted in part (b), test whether the coefficient param- eter for entrance exam score is less than −0.2. Clearly state your hypotheses and use a significance level of α = 5%. Do not use any R functions that are designed to perform hypothesis tests. Assignment S2 2021 Page 5 of 6 STAT7055

The remaining parts of this question can be answered using the lm function. (e) [2 marks] Create the scatter plot of part (a) again, but now colour all the points corresponding to students who majored in biology in their undergraduate degree in red and all the other points in black. Describe the relationship between the two variables for students who majored in biology in their undergraduate degree. (f) [3 marks] Considering only the students who majored in biology in their under- graduate degree, fit a simple linear regression model with the end of year 1 exam score as the dependent variable and the entrance exam score as the independent variable. Write down the estimated regression model. You may find it useful to create a new data frame that only contains the relevant data. (g) [5 marks] Discuss whether the assumptions for a simple linear regression model hold for the model you fitted in part (f), making sure to provide clear justifications for your answer. (h) [4 marks] Using the model fitted in part (f), calculate a 95% prediction interval for the end of year 1 exam score for a student that scored 60 on their entrance exam. Comment on the accuracy of this prediction interval. Do not use any R functions that are designed to calculate any predictions, confidence intervals or predictions intervals. END OF ASSIGNMENT Assignment S2 2021 Page 6 of 6 STAT7055