STAT 231辅导、R程序设计辅导

STAT 231 Assignment 3: What Are You Waiting For?
Due: 11am Eastern on Friday November 4
Total marks: 50
Please review the information on Page 1 of Assignment 1 for full details on how to submit your
assignment. As a reminder, to complete this assignment you must:
Upload your typed/computer-generated Assignment 3 Report file as a PDF to Crowdmark.
Upload your Assignment 3 R code file as a .R file to the Assignment 3 R Code File dropbox.
As with Assignments 1 and 2, your Report must be typeset, and your R code file should generate all
the results presented in your Report.
If you are unsure how to format an answer, please check the Layout Lowdown on pages 5-8! There
are also template files available on LEARN for this assignment - you are not required to follow these
templates, but you may if you wish!
What’s this assignment about?
This assignment covers the material up to and including Chapter 4, with a focus on interval estimation
techniques. We will seek to model the tweet.gap variate, which measures the time (or ‘gap’) between
the publication of tweets. More precisely, for a particular tweet, tweet.gap gives the number of
seconds since the user’s previous tweet was published.
Data about how often a user is interfacing with a website, service, or product, are valuable for a
variety of reasons. The regularity, and reliability, with which users return (sometimes referred to as
‘stickiness’) is a key metric to assess product performance, as well as for testing the effectiveness of
new features and initiatives.
In addition to providing insights into how often users post tweets, the variate tweet.gap also provides
an opportunity to explore some challenges commonly encountered in real-world data analysis. Many
of you will find that tweet.gap contains some particularly large values, as a result of users not
tweeting for several days, or even weeks. When working with real-world data it is common to
encounter unusual behaviour such as this, which can make finding a suitable statistical model difficult.
In this assignment we will explore two approaches for modelling data with unusual distributions. One
of these is to consider a subset of the data, narrowing the focus of our research question in order to
facilitate meaningful analysis. The other is data transformation, which we have used previously (such
as in taking logs of the likes variate) and will now extend to other, more complex transformation
procedures.
Before we begin
For the purposes of this assignment, the study population is defined as the set of tweets in the
primary dataset from which you downloaded your sample at the start of term.
In this analysis we will include all of the data in your Twitter dataset (that is, all five accounts).
You may find it interesting to re-run your analyses on your personal and organizational accounts
separately, while thinking about why we might expect these accounts to have different distributions
for this variate.
Because tweet.gap is measured in seconds, we will convert this to hours to make it easier to
interpret our results. You should create the variate tweet.gap.hour, just like how we created
time.of.day.hour in Assignment 1.
1
Analysis 1: Time Between Tweets and an Exponential Model
In Analyses 1 and 2 we will be exploring the distribution of tweet.gap.hour for tweets that are not
the first tweet of the day. In the following, we refer to two sets of tweets denoted Tweet Set A and
Tweet Set B as follows:
Tweet Set A: All tweets in your dataset.
Tweet Set B: Just tweets that are not the first tweet of the day. Note that these are the tweets
for which first.tweet equals 0.
1a. [0.25 marks] To facilitate grading, please provide your 8-digit student ID number.
1b. [2 marks] Do you have any concerns about measurement error in the first.tweet variate?
Briefly explain why or why not.
1c. [2 marks] State the sample size, and calculate the sample mean, sample median, sample mini-
mum, sample maximum, and sample standard deviation of tweet.gap.hour for Tweet Set A
and Tweet Set B. Display these values in a table in your Report.
1d. [1 mark] Briefly explain why the maximum value of tweet.gap.hour for Tweet Set B should not
be greater than 24. Note: This question is not asking you to simply verify that the maximum
calculated in Analysis 1c is not larger than 24; your answer should explain why, based on how
Tweet Set B is constructed, it should not contain a value larger than 24 for any possible sample.
1e. [4 marks] Generate a relative frequency histogram and an empirical cumulative distribution
function plot of the variate tweet.gap.hour for each of Tweet Set A and Tweet Set B (that
is, you should include a total of four plots, two for each Tweet Set). All plots should feature
a suitable superimposed Exponential probability density or cumulative distribution function
curve. Hint: You may wish to use par(mfrow = c(2, 2)) so that your plots are displayed in
a single image.
1f. [7 marks] For each of Tweet Set A and Tweet Set B, discuss how well an Exponential model
fits the data. Your answer should explain what you would expect to observe if the data were
generated from an Exponential distribution, and compare this with what you observe in your
sample. You should make at least three comparisons (of what you would expect, and what you
observe) for each of Tweet Set A and Tweet Set B, and include an overall conclusion on which
of Tweet Set A and Tweet Set B the Exponential model appears to fit better.
Analysis 2: Interval Estimation Using an Exponential Model
In this analysis we will use an Exponential model to describe the time between tweets that were
not the first tweet of the day. Note that, regardless of your conclusion in Analysis 1f, you should
complete Analysis 2 using Tweet Set B.
Let Y ～ Exponential(θ) denote the value of tweet.gap.hour for a randomly chosen tweet from the
study population that was not the first tweet of the day. You are reminded that in our notation
E[Y ] = θ.
2a. [0.25 marks] To facilitate grading, please provide your 8-digit student ID number.
2b. [1 mark] What is the maximum likelihood estimate of θ based on your sample?
2c. [3 marks] Generate a plot of R(θ), the relative likelihood function for θ based on your sample
and the assumed Exponential(θ) model. Your plot should include a horizontal line that could
be used to identify the 15% likelihood interval for θ.
2
2d. [2 marks] Using uniroot() or uniroot.all(), calculate the 15% likelihood interval for θ.
Give your answer to four decimal places.
2e. [3 marks] Calculate approximate 15%, 95%, and 99% confidence intervals for θ based on a
Central Limit Theorem approximation. Your Report should include an explanation of how this
was calculated, which may be expressed algebraically or, if you wish, by including the relevant
R command(s).
2f. [2 marks] Which of the confidence intervals you calculated in Analysis 2e is most similar to the
15% likelihood interval found in Analysis 2d? Is this what you would expect? Briefly explain
why or why not.
2g. [3 marks] Write 1-2 sentences that explain what the 95% confidence interval calculated in
Analysis 2e means in the context of the study. Note: your answer should relate your interval
to the real-world question under consideration, and not simply be written in terms of θ.
Analysis 3: Time Between Tweets and a Gaussian Model
In Analyses 3 and 4 we will be exploring the distribution of tweet.gap.hour for tweets that are
the first tweet of the day. We will exclude tweets that were published more than 24 hours after the
preceding tweet (think about why we might wish to do this). You can create this subset of tweets as
follows:
> tgh.first <- mydata$tweet.gap.hour[mydata$first.tweet == 1 & mydata$tweet.gap.hour <= 24]
Note: We have called the variate tgh.first as shorthand for ‘tweet gap hour first tweets’; you are
welcome to use your own choice of naming convention!
The data in tgh.first are therefore the times between the first tweet sent on a particular day, and
the last tweet sent the preceding day. Hint: Run summary(tgh.first) and check the results make
sense based on how we have defined this variate.
We will explore various transformations of the variate in an attempt to facilitate the use of a Gaussian
model. In particular, we will consider the following three transformations, which we first define in
general terms for data y1, y2, . . . , yn, recalling that y(n) denotes the maximum value in our sample.
? Square Root: si =
√(
y(n) ? yi
)
+ 1
? Log: li = log(
(
y(n) ? yi
)
+ 1)
? Reciprocal: ri =
1
(y(n)?yi)+1
You should generate three new variates as follows (where, again, you are welcome to use your own
naming conventions):
# Square Root
> tf1 <- sqrt(max(tgh.first) - tgh.first + 1)
# Log
> tf2 <- log(max(tgh.first) - tgh.first + 1)
# Reciprocal
> tf3 <- 1/(max(tgh.first) - tgh.first + 1)
We will refer to the non-transformed data as the ‘Original’ data.
3
3a. [0.25 marks] To facilitate grading, please provide your 8-digit student ID number.
3b. [4 marks] Generate a relative frequency histogram or an empirical cumulative distribution
function plot of the Original, Square Root, Log, and Reciprocal transformations of the variate
defined above as tgh.first. All four plots should be of the same type (that is, your Report
should contain four histograms, or four e.c.d.f. plots). All four plots should feature a suitable
superimposed Gaussian probability density or cumulative distribution function curve. Hint:
You may wish to use par(mfrow = c(2, 2)) as you did in Analysis 1e.
3c. [2 marks] Which of the Square Root, Log, or Reciprocal transformations leads to the best fit of
a Gaussian model? Briefly justify your answer in 1-2 sentences. It is sufficient to refer only to
your results in Analysis 3b, but if you wish to carry out additional analyses you are welcome
to. Note that even if you believe the original dataset exhibits the best fit, you must choose one
of the three transformation options detailed above.
Analysis 4: Interval Estimation Using a Gaussian Model
In our final analysis, we will use the transformed variate chosen in Analysis 3c. Let X ～ G(μ, σ)
denote the value of the transformed variate for a randomly chosen tweet from the study population.
Note that all questions in this analysis should be conducted using the transformed variate you chose
in Analysis 3c.
4a. [0.25 marks] To facilitate grading, please provide your 8-digit student ID number, and write
down the name of the transformation you chose in Analysis 3c (that is, Square Root, Log, or
Reciprocal).
4b. [1 mark] State the sample size, and calculate the sample mean and sample standard deviation
for your transformed variate.
4c. [3 marks] Calculate a 95% confidence interval or approximate confidence interval for μ based
on your sample. (You should decide which is the appropriate confidence interval to calculate.)
Your Report should include an explanation of how this was calculated, which may be expressed
algebraically or, if you wish, by including the relevant R command(s).
4d. [1 mark] Is the confidence interval you calculated in Analysis 4c exact or approximate? Briefly
justify your answer. (Note: this question concerns whether the interval is theoretically ex-
act or approximate, your answer should not discuss numerical matters such as rounding, or
approximations used within R itself.) You may cite results in the Course Notes without proof.
4e. [3 marks] Write 1-2 sentences that explain what the interval calculated in Analysis 4c means
in the context of the study. Note: your answer should relate your interval to the real-world
question under consideration, and not simply be written in terms of μ. Note: Do not transform
your interval back to the original scale on which tweet.gap.hour is measured.
4f. [3 marks] Calculate a 95% confidence interval for σ based on your sample. Your Report should
include an explanation of how this was calculated, which may be expressed algebraically or, if
you wish, by including the relevant R command(s).
4g. [2 marks] You are told that Alex, another STAT 231 student, has a sample which contains
considerably fewer tweets than your sample. Would the interval Alex calculated in Analysis
4f be narrower, wider, or about the same width as the interval you calculated in Analysis 4f?
Justify your answer in 1-2 sentences.