辅导 MATH1041 Statistics for Life and Social Sciences Term 2, 2024讲解 Processing

MATH1041 Statistics for Life and Social Sciences

Term 2, 2024

MATH1041 Assignment

Data: Together with this document, you should have received your unique dataset in an e-mail sent to your oﬃcial university email address. The data (i.e., your dataset) is available in a text ﬁle with the name 5474908.txt. If you have not received your dataset (double check your UNSW email inbox and the spam folder), please contact your lecturer.

Submission due date: Tuesday 23rd July (Week 9) before 11:59 PM (Sydney time, AEST). Note a late penalty of 5% of the maximal possible mark per day will apply. No assignment will be accepted more than ﬁve days after the deadline.

Your submission must contain your full name and student zID at the top of the ﬁrst page of your assignment. Submit your assignment through Turnitin via Moodle. See the “Assessments Hub” section on Moodle for further information regarding online submission.

Please submit a neatly typed assignment as a Microsoft Word document (.doc or .docx), see the information and help about the assignment in the assessment section on Moodle, or as a PDF document (.pdf) created for instance using Google Docs, LATEX, RMarkdown or similar tools. For your convenience, there is a Microsoft Word template that can be downloaded from Moodle which you can write your assignment in, that is already in a format appropriate for this assignment.

Verify your assignment has been submitted correctly by downloading the submission receipt and clicking on the link to check it displays correctly in the Turnitin viewer. If not, it is your responsibility to make the necessary amendment.

Typesetting (*)		/2
Q1		/14
Q2		/16
Q3		/16
Q4		/6
Q5		/19

Total

/73

(*) See the next pages and the “Assessments Hub” on Moodle for details, help and explanations about the assignment and typesetting.

Assignment format

Keep in mind this assignment is not only about assessing your Statistical skills; it is also about giving you feedback on your Mathematical writing skills. The assignment must be typeset correctly and provide complete but concise explanations in complete English sentences and paragraphs. Think of this as practice for a document you might produce in your future studies or career that includes mathematical explanations.

Here are some more details that may assist you:

• Regarding the overall assignment structure, please answer all questions in the given order (that is, 1.a., 1.b., ... etc). Do not re-write the assignment questions again, only their label (write “3.e.” for instance when you start question 3.e.). Keep your answers brief, clear and concise.

DO NOT reproduce the cover sheet, i.e., the ﬁrst 5 pages of the pdf ﬁle sent to you, in your assignment.

• Start your answer to each Question (1, 2, etc.) on a new page. Each Question should start on a new page, but sub-parts of a Question (such as Question 3.d., 3.e.) should continue on the same page.

• If you do include any R-code, it should be typed in a diﬀerent colour and/or font.

• You are required to type up your entire assignment (in Microsoft Word, Google docs, LATEX, Overleaf or RMarkdown) including any equations. The only exception are the plots produced by RStudio, for which you can save the ﬁgures (use “export” in the bottom right window in RStudio) which you then paste in your assignment. Nothing can be handwritten then scanned. As a UNSW student, you can download Microsoft Word for free, see: https://www.myit.unsw.edu.au/software-students.

• As in any properly typeset document containing mathematic symbols, you must use an equation editor for all maths symbols. For instance, you should write “X is normal”, rather than “X is normal” (Notice

how the ‘X’ looks diﬀerent?) and you should write “tobs = 1.23”, rather than “tobs = 1.23”.

The marking scheme for this criterion is the following: Are mathematical symbols typeset using the equation editor? 2 marks for ‘almost always’, 1 mark for ‘sometimes’, 0 mark for ‘rarely’.

Help about Microsoft equation editor can be found in a document called Microsoft Word Equation editor help for MATH1041 located on Moodle in the Assignment (20%) section within the Assessments Hub section of the MATH1041 Moodle page.

• You should add some working out for the questions involving calculations; do not just give the ﬁnal answer. Note that you may get partial marks for clear explanations and a correct method even if you get the wrong answer. However, try to keep your solutions brief and concise. Depending on what the question is asking, your working out could consist of RStudio commands, a formula, or perhaps the main steps explaining how you arrived at your answer. You do not need to add all your R-code.

• Keeping your results to 3 or 4 signiﬁcant ﬁgures should be ﬁne. If there are multiple steps in a calculation, do not round any numbers until you have reached the ﬁnal step. To help you do calculations correctly in RStudio without rounding, values should be stored as variables, rather than copying the output number into a further calculation. For example, if you are constructing a conﬁdence interval and need to calculate t* , you should write the code: tstar <- qt(0.975, df = 10) and then use the variable tstar in calculating your conﬁdence interval, rather than pasting in the number 2.228139.

• There is no requirement for font size and line spacing but please make sure your assignment is readable — do not make the font size too small or the spacing too compact.

• If the question asks you to produce a graph/plot, you should always include that graph in your answer, unless otherwise speciﬁed.

Scenario

During 2022, when COVID-19 was present, Dr Lafayellis lectured

BIOL273, a course for second-year Biology students at Kensington University in Sydney. They asked the students in the course to ﬁll in a survey:

(The variables are in alphabetical order. What you see on the right of this page is an excerpt of the answers to the survey.)

• accomodation: What are your living arrangements?

1 = I live with my parents/family, 2 = I live with roommates, 3 = I live alone

• age: Your age in years?

• anxiety: Dr Lafayellis had put in the survey a series of state- ments like ‘I overreact to situations’ or ‘I ﬁnd it hard to calm down’ or ‘even if something bad happens to me, I rarely feel fear or anxiety’ and asked the students to choose between ‘very false for me’, ‘somewhat false for me’, ‘somewhat true for me’ and ‘very true for me’. They used these answers to calculate a anxiety score for each student, which is a number between 1 and 5, where 5 corresponds to extremely anxious.

• ATAR: What was your ATAR? If you did not get an ATAR, because you studied at a Greek or Chinese high school for instance, then enter NA for ‘Not Applicable’.

Background information: At the end of high school, students in NSW (New South Wales) take the HSC (High School Certiﬁcate) at the end of which they are granted an ATAR (Australian Tertiary Admissions Rank), which is a number between 0 and 100. The ATAR determines a student’s entry into Australian universities.

• Engl1st: Is English your ﬁrst language? 1 = yes, 2 = no.

• gender: What is your gender?

F = Female, M = Male, O = Other or ‘Prefer not to say’.

• highschool: What type of high school did you attend?

1 = Australian public school, 2 = Australian private school, 3 = Australian selective school and 4 = non-Australian high school (that is a high school which does not follow an Australian curriculum, like a high school in China for example).

• job: Do you have a job in parallel to university?

1 = No; 2 = Yes, part-time job, less than 20 hours per week; 3 = Yes, part-time job, 20-34 hours per week; 4=Yes, full-time job, 35 hours per week or more.

• politorient: Which Australian political party has your preference? 1 = Liberal, 2 = Labor, 3 = Other.

• SES: How would you describe your family’s socioeconomic status on a scale of 1 to 5 where 1 = High and 5 = Low?

• WAM: From the Kensington University database, Dr Lafayellis had found the Weighted Average Mark (WAM) of each student, which is a number between 0 and 100 calculated using the ﬁnal marks obtained by that student in all the previous courses they have taken at Kensington University.

At that point, Dr Lafayellis had a table with one row per student and columns corresponding to the variables described above (they deleted the column with the names for anonymity purposes).

Some students did not answer, and among the students who did answer, a few students have left some ﬁelds blank. These show as NA in the dataset.

Two of Dr Lafayellis’ BIOL273 students, Dani and Xiao, are also taking a Statistics course called MATH1041, and they ask Dr Lafayellis for the results of the survey. Dr Lafayellis provides them with the dataset 5474908.txt, which you have received by email. Your job is to assist Dani and Xiao by analysing the dataset provided to you. The data has been sent to you in a separate text ﬁle attachment sent by email. An excerpt of (a portion of) it was on the right of the previous page.

Reading the data into RStudio

The data are in a text ﬁle with the name 5474908.txt. This ﬁle was sent to you by e-mail (see page 1).

The ﬁrst step is to read the data into RStudio. The data format is similar to what you have already done in the Weekly Mobius lessons. Follow the instructions given in section R1.4 “How to import a text ﬁle into RStudio ” of the RStudio “How-To-Manual” available on Moodle. Once you have uploaded the data then you are ready to start your analysis!

To make sure everything is all right, please check the values calculated from your ﬁle 5474908.txt match the values given below.

Note you may need to add the argument na.rm = TRUE to remove non available (i.e., missing) values. ## mean anxiety = 2 .096931 , mean age = 21 .58 , mean WAM = 73 .08899

They match? It means you imported the data correctly in RStudio. You are ready to start! (Note that print(x, digits = 9) would allow you to set to 9 the number of signiﬁcant digits to display for the variable x.)

The Analysis Tasks

The questions you need to answer in your assignment submission are given below.

Q1. Dani and Xiao decide to start by exploring the data.

1.a. Is it an observational study or is it an experiment? Provide a brief justiﬁcation for your answer.

1.b. Check the sample mean of the ATAR of the students in your dataset is 83.2751807 (we are giving you this value as one last check you are using the correct dataset, correctly imported).

If this is the case, calculate the standard deviation of the ATAR of the students in your dataset.

1.c. Dani is impatient to compare their ATAR to the ATAR of their BIOL273 classmates, and without waiting or thinking any further, Dani gets started on calculations.

Dani got an ATAR of 86. Calculate Dani’s z-score (the standardised value corresponding to their ATAR) and explain in a full sentence what this number is for Dani’s ATAR in relation to the ATAR of their classmates.

1.d. Dr Lafayellis, who is keeping an eye on what Dani and Xiao are doing with the data, looks at Dani’s ATAR calculations and says: “Calculating the standard deviation and your z-score should not have been your ﬁrst step.” They smile and adds “In fact, I know for sure you can discard some of the values in data because they are wrong.”

Write, in no more than one sentence, what should Dani have done before computing any numerical summary to check if any of the data are non-sensical and should be removed?

(N.B. You are not required to do it, just say what Dani should have done).

1.e. Xiao lives with their parents and Dani wonders if this is the exception or the rule.

To ﬁnd out, using the data provided in the ﬁle 5474908.txt which was shared with you, Dani runs the following R-code and obtains the graph below.

counts <- table(survey$accomodation, survey$Eng1st) barplot(counts, main = "Living arrangements by First language", names.arg = c("English as \n First language","English as \n Second language"), col = c("royalblue","indianred", "orange"), beside = TRUE) legend(5, 55, legend = c("live w. parents", "live w. roomates", "live alone"), fill = c("royalblue","indianred", "orange"), cex = 1)

Dani looks at the graphical summary produced by R and exclaims: “My graph shows the most common living arrangement for Kensington University students, whether English is their ﬁrst language or not, was to live with their parents”.

Xiao says: “No, what you have done does not prove that. It may be true, but your graph does not prove it, no matter what your graph looks like.”

Explain, in no more than three sentences, why Xiao is correct.

1.f. Xiao suspects there is a relationship between gender and political preferences. Create a properly labelled graphical summary of the political orientation of students by gender. The graph should be self contained: someone who would see just the graph should be able to understand it without any extra information. Hence, please include a legend for the graph.

1.g. Describe at least two diﬀerences or similarities in the political preferences among students depending on their gender using your graph from Part 1f.

Your answer should be no more than two sentences.

With the markers in mind, in your assignment, please start every question on a new page.

Q2 WAM by type of high school attended

In this question, we study the WAM by type of high schools the students attended. Note we are still in the exploration phase, so you must still use the dataset which was given to you in its original form. Do NOT modify it at this stage.

2.a. What is the explanatory variable? Is it a categorical or quantitative variable? 2.b. What is the response variable? Is it a categorical or quantitative variable?

2.c. Produce properly labelled comparative (i.e., parallel/side by side) boxplots for WAM by type of school. Your graph should be self contained: someone who would see just the graph should be able to understand it without any extra information.

2.d. Describe any diﬀerences or similarities in the distribution of WAM among students depending on the type of school they attended, using your comparative boxplots from Part 2c. Include in your answer comments on shape, location and spread. Indicate what aspect of the boxplots each of your conclusion(s) is based on (we need to know what exactly you looked at). Your answer should be no more than four sentences.

2.e. By looking at the comparative (i.e., parallel/side by side) boxplots from Part 2c for WAM by type of school, Xiao notices on average, students from Australian high schools (i.e., high schools which teach the curriculum from New South Wales or Victoria or Queensland . . . etc) seem to be doing better in terms of WAM than students who did their high school in a non-Australian high school (meaning they teach a foreign curriculum).

Identify a confounding variable for the exploration of the WAM by type of school using comparative (i.e., parallel/side by side) boxplots from Part 2c, and explain why it is confounding (make sure you explain why the variable you are mentioning satisﬁes all the required conditions to be considered a counfoundig variable according the deﬁnition of a confounding variable).

Your answer should be no more than four sentences.

With the markers in mind, in your assignment, please start every question on a new page.

Q3 Relationship between ATAR and WAM

Admission to Kensington University, like for all universities in New South Wales, is in most cases based on a student’s ATAR. Xiao wonders if the ATAR a student gets at the end of high school is a good predictor of the WAM a student will get later on at university. On one hand, Xiao thinks it may, because both the ATAR and the WAM measure academic performance but on the other hand, they suspect there are essential diﬀerences between high school and universities and the fact the ATAR and the WAM could be calculated using diﬀerent subjects (some students studied mostly humanities, other students studied mostly STEM subjects) may come into play. Therefore, rather than speculate, Xiao decides to investigate based on the data at their disposal. As usual, the ﬁrst step is to explore the data.

3.a. Produce an appropriate graphical summary to visualise the relationship between ATAR and WAM (the goal at this early stage is to explore the data in your dataset so use the raw data you were given to produce your graph. Do NOT alter it in any way). DO NOT include this graph in your assignment, it is just meant to help you answer the question below. The plot does not attract marks.

• Do you notice some problematic observations in the plot you just produced in Part 3a? If so, comment on this in your report.

• Explain why it is justiﬁed to remove these problematic values from the dataset.

Now you have done the exploration in Part 3a, remove the problematic values from your dataset.

To make sure everything is all right, please check the values calculated from your ﬁle 5474908.txt after removing the problematic rows match the values given below:

• mean WAM after the rows with problematic ATAR values have been removed is equal to 73.3320816

• standard deviation of WAM after the rows with problematic ATAR values have been removed is equal to 6.4430928

They match? It means you are ready to go on!

3.b. Now you have removed the problematic values, produce an appropriate graphical summary to visualise

the relationship between ATAR and WAM. Include this plot in your assignment, properly labelled.

3.c. Summarize the key features of your plot from Part 3b.

Your answer should be no more than three sentences though one sentence should suﬃce. 3.d. Calculate r2 , the square of the Pearson coeﬃcient of correlation between ATAR and WAM.

3.e. In plain English, interpret the numerical value of r2 you calculated in Part 3d. Your answer should be no more than one sentence.

3.f. When Dani took the HSC, they got an ATAR of 86. Their current WAM is 73. Dani wonders how the WAM predicted from their ATAR using the least-squares regression line compares to the WAM they actually have.

Using the least-squares regression line, calculate the predicted WAM for Dani based on his ATAR.

Tip: Use the R function lsfit() (Least Squares ﬁt). You MUST include in your answer the value of the intercept and slope for the least-squares regression line.

3.g. Using the least-squares regression line calculated in Part 3f, calculate the residual corresponding to Dani’s ATAR and WAM.

3.h. Describe what the residual calculated in Part 3g means in a sentence in plain English (no technical terms).

Q4. INFERENCE: Two-sided conﬁdence interval for the population WAM

In this question, we will produce a two-sided conﬁdence interval for the population WAM. Unlike in Q3, where we were dealing with ATAR values, there is no issue in the WAM values, so there is no reason to remove rows from the original dataset 5474908.txt, so please, switch back to the original dataset you were given.

Below are a few values to help you check you are using the correct dataset:

## mean anxiety = 2 .096931 , mean age = 21 .58 , mean WAM = 73 .08899

4.a. Produce a 92% two-sided conﬁdence interval for μW , the true mean WAM. For this question you may assume it is appropriate to use a t-distribution. Make sure you write down all the required steps to calculate this interval.

4.b. Like all realised conﬁdence intervals, your two-sided conﬁdence interval constructed in Part 4a is of the form estimate ± margin of error.

Does the margin of error you calculated include errors due to practical problems like non-response to the survey? (Recall that not all students answered the survey.)

Your answer should be no more than one sentence.

4.c. Dani and Xiao want to mention the above two-sided conﬁdence interval in a university help FORUM post. In their post, they write there is a 92% probability the true average WAM of students at Kensington University is within . . . . and . . . . (the endpoints of your two-sided conﬁdence interval) .

Dani and Xiao have correctly reported the numerical values for the endpoints of the two-sided conﬁdence interval but Dr Lafayellis nevertheless wants them to modify this sentence in the post.

Give at least two reasons why the statement made by Dani and Xiao is inaccurate. (You are not required to correct the statement, you only need to explain what is wrong about it.)

Your answer should be no more than two sentences.