MATH1041 Statistics for Life and Social Science
Term 3, 2019
MATH1041 Assignment
Assignment release date: The assignment will be released to all students on Friday
the 1st of November on Moodle (see “Assessments” section).
Submission due date: Friday 15th November (Week 9) before 2pm (Sydney
time).
Please submit your assignment through Turnitin via Moodle, see the “Assessments Information”
section on Moodle for further information regarding online submission. You
must submit a neatly typed assignment converted to pdf format.
Data: A data set (in the text file format) will be sent to you via email at your official
university email address (see page 2 of this document for further details).
Assignment length: No more than SIX single-sided A4 pages including this cover
sheet as the first page. Also, please make sure that you include your name and zID
somewhere in the assignment.
Obtaining the data via email and reading it into RStudio
The data (that is, your data set) are available in a text file with a name similar to:
“z1234567.txt”, (where z1234567 in the text file name is replaced by your unique student
zID number). This text file has been sent to you via email at your official
university email address. PLEASE CHECK YOUR UNIVERSITY EMAILS
REGULARLY TO MAKE SURE THAT YOU HAVE OBTAINED YOUR
DATA SET. Please email Dr Jakub Stoklosa (j.stoklosa@unsw.edu.au) if haven’t
received your data set yet.
The first step is to read the data into RStudio. The data format is simple and similar
to what you have already done in the Introduction labs. Follow the instructions given in
section R1.4 “How to import a text file into RStudio”of the RStudio “How-To-Manual”
available on Moodle. Once you’ve uploaded the data then you are ready to start your
analysis!
Computing assignment format
Here are some more details that may assist you:
• Regarding the overall assignment structure, please answer all questions in the given
order (that is, 1a), b), etc.). You don’t need to re-write the assignment questions
again. Keep your answers brief, clear and concise.
• You are required to type up your entire assignment (rather than scanning and taking
screenshots), including any equations. If you are using Word you should use the
equation editor for any maths notation. If you don’t have Word then please use the
School computers, or you can download Word for free, see:
https://student.unsw.edu.au/notices/office
• Please convert and submit your assignment in pdf.
• We recommend adding some working out for some of the questions involving calculations.
But try to keep your solutions brief and concise (since there is a page
limit). It’s good practice for the exam and in case you get the wrong answer you
have some workings to gain marks from. Depending on what the question is asking,
your working could consist of RStudio commands or perhaps the main steps on how
you arrived at your answer. You don’t need to add all of your R-code!
• Keeping your results to 2 or 3 decimal places should be fine.
• There is no requirement for font size and line spacing but obviously don’t make
things too small.
2
Scenario
A group of research ecologists were interested in studying the impacts of climate change
on different species of plants that grow in New South Wales, Australia. Some of these
plants are native to Australia while others are non-native (exotic).
To obtain their data, the research team decided to collect a random sample of plants
from a national park. Some measurements were then taken on each plant. The random
sample of data consists of plant height measurements (measured in centimeters), dry
weight measurements (measured in grams), whether the plant was native or non-native
to Australia and the polinization mode of the plant (this could one of four types: wind,
water, insect and self-polinization).
The text file contains your unique data of length n in separate rows consisting of 4
variables: Height which corresponds to the heights, Weight which corresponds to dry
weight of a plant, Type which corresponds to plant type (native = 0 and exotic = 1), and
Polin which corresponds to the polinization mode of the plant (Wind, Water, Insect and
Self).
Your job is to assist the research team by analysing the data set provided to you.
The Analysis Tasks
The questions you need to answer in your assignment submission are given below. Please
make sure your assignment is converted to pdf format.
1. (a) Calculate the sample mean and sample standard deviation of your plant height
(Height) measurements.
(b) Produce a normal quantile plot of your sample of plant height measurements
(see Section R2.6 “How to produce a normal quantile plot using RStudio”).
Include this plot in your submitted assignment, properly labelled.
(c) By referring to the normal quantile plot obtained in Part 1b briefly discuss if
the plant heights are approximately normally distribution.
2. Let µ be the population mean plant height (in centimeters) of plant heights in
the national park now (Spring, 2019). The research team decided to compare the
current plant height mean with the mean from 20 years ago using plant height data
obtained from the same national park. The known mean plant height from 20 years
ago was 190 centimeters.
(a) Test the hypothesis that µ is equal to 190 centimeters. You must summarize
all steps: state the null (H0) and alternative hypotheses (Ha) relevant to the
research objectives stated in this scenario, the value of a suitable test statistic,
the sampling distribution for this statistic, a P-value, your summary of
significance and conclusion in plain language.
3
(b) Some assumptions need to be made for the sampling distribution of the test
statistic (as given in Part 2a) to be valid. State these assumptions, and briefly
discuss whether these assumptions are satisfied.
(c) Produce a 95% confidence interval for µ, the mean heights. For this question
you may assume that it is appropriate to use a t-distribution. Make sure you
write down all the required steps to calculate this interval.
(d) Does your confidence interval (constructed in Part 2c) include the value 190
centimeters?
(e) Explain whether your confidence interval (constructed in Part 2c) is consistent
with your conclusions from the hypothesis test in Part 2a.
(f) Next, produce a 90% and a 99% confidence interval for µ, the mean plant
heights. Again, for this question you may assume that it is appropriate to
use a t-distribution. You don’t need to write down all the required steps to
calculate these intervals, reporting the values is fine.
(g) Briefly comment on how these confidence intervals compare with the confidence
interval you calculated in Part 2c.
(h) Other than changing the confidence level, what two other quantities could we
change to decrease the length of a confidence interval?
3. The research team were also interested in studying the relationship between:
• Plant type and height
(a) Produce a comparative boxplot for plant type against height. Include this plot
in your submitted assignment, properly labelled.
(b) Describe any differences or similarities in the distribution of plant height for the
different types (native or exotic) using your comparative boxplot from Part 3a.
Include in your answer comments on shape, location, and spread.
• Plant type and polinization mode
(c) Construct an appropriate numerical summary for the plant type and polinization
mode.
(d) Briefly describe any differences or similarities of plant type and polinization
mode from your numerical summary from Part 3c.
• Plant height and weight
(e) Construct an appropriate graphical summary to visualize the relationship between
plant height and weight. Include this plot in your assignment, properly
labelled.
(f) Summarize the key features of your plot from Part 3e.
(g) Suggest an appropriate numerical summary to quantify the strength of the linear
relationship between plant weight and height. Report and briefly comment
on this value.
(h) The research team wanted to predict plant weight from plant height measurements
by fitting a linear regression model. Would you recommend the research
team do this? Explain briefly. You are not required to carry out any prediction
in this question.
4. The research team decided to investigate the plant weight (Weight) measurement
in more detail.
(a) Produce a five number summary for the Weight measurements.
(b) Using the appropriate measure found in Part 4a, comment on the location of
the Weight measurements.
(c) Produce a histogram for the Weight measurements. Include this histogram in
your submitted assignment properly labelled.
(d) Comment on the shape (skewness/symmetry) of your histogram from Part 4c.
(e) A common technique that can be used to remove skewness in data is known as
a log-transformation. That is, for each value in your data (denoted by xi), you
can log-transform it as yi = log(xi). The function in RStudio that performs a
log-transformation on a value is log().
Produce a histogram for the log(Weight) measurements. Include this new
histogram in your submitted assignment properly labelled.
(f) Again, comment on the shape (skewness/symmetry) of your histogram from
Part 4e.
(g) Do you think this log-transformation reduced any skewness? Explain briefly.