data frame辅导、讲解R编程设计、讲解R设计、MASS package留学生辅导解析C/C++编程|辅导Web开发

Estimate the mean and variance of the Height variable using this sample. Are the
sample mean and the sample variance equal to the population mean and population
variance? Why/why not?
12. (5 points) Using the sample drawn before, we want to give a 95% condence interval1
for the population mean of the height using the formula
[height − t1−α/2,n−1 ∗
√
var(height)/n, height + t1−α/2,n−1 ∗
√
var(height)/n],
where height is the sample mean, var(height) is the sample variance, n is the eective
sample size and t1−α/2,n−1
is the quantile of order 1 − α/2 of the Student (or t)
distribution with n−1 degrees of freedom. Here, α = 0.05. Is it possible to construct
here a such 95% condence interval (check the assumptions of its construction)? If it
is the case, compute this condence interval and interpret it. Usually we do not know
the population mean but in our case its value is known. Verify if the population
mean is inside of the constructed condence interval.
Exercise 2 (25 points)
1. (5 points) The data frame Cars93 (from MASS package) holds extensive information
on data from 93 cars on sale in the USA in 1993 (see the help le for data and
variable description). The MASS package is already downloaded; you should load it
in your R session. One of the variables, stored as a factor, is Type. Create a new
data frame called myCars93, in which the row names are the distinct values of Type.
This data frame contains 3 columns: the mean of the Min.Price, the mean of the
Max.Price, while a later column holds one or two character abbreviations of each of
the car types. Print this new data frame. This data frame should be like this:
mean.min.price mean.max.price abbreviation
Compact 15.69 20.7 c
Large 22.94 25.7 l
Midsize 24.11 30.3 m
Small 8.43 11.9 s
Sporty 16.86 22.0 sp
Van 16.20 22.0 v
2. (5 points) The normal QQ-plot is a fancy way of checking if the distribution looks
normal. A more primitive one is to check the rule of thumb that approximatively
68% of the data is 1*standard deviation from the mean, 95% within 2*standard
deviations and 99.8% within 3*standard deviations. Generate 100 random values
from the Student distribution with 25 degrees of freedom. Is your data consistent
with the normal distribution ? Check it using a QQ-plot and the rule given above.
1The meaning of the term "95% condence level" is that, if condence intervals are constructed across
many samples drawn from the same population and in the same conditions, the proportion of such intervals
that contain the true value of the parameter (the population mean) will match the condence level 95%.
3
3. (5 points) Depending on the type of data, there are advantages to use the mean or
the median.
(a) Use 100 times 200 random numbers from the Student distribution with two
degrees of freedom (each time you generate 200 random numbers). Compute
each time the mean and the median. You obtain 100 means and 100 medians.
Plot in the same layout the boxplot of all means and the boxplot of all medians.
(b) Repeat the same computation for 100 times and 200 random numbers from
N(0, 1) distribution.
Based on these two cases explain the advantages of using the mean or the median.
Provide also each time a measure of the data dispersion.
4. (10 points) The data `experiment.csv' contains information about the results of a
test assessing the level of prociency in a foreign language. The participants were
informed that they were going to watch 6 scenes showing the events of a dramatic
evening and they were asked to tell the story immediately after each scene. Afterwards
the responses of the participants were coded with respect to the presence
(`yes/no') of some adverbs (variable `adverb'). The variable `group' gives the level
of prociency in the foreign language of the participants (4 groups, `GL1', `GL2-B',
`GL2-C', `FL1'). The participant id is given by the variable `speaker'. The data set
contains 6 records (one for each scene) for each participant (so the observations are
not independent here). We are interested to study if on average the groups with
dierent level of prociency show a dierence in using the adverbs. Propose, check
the assumptions and perform a statistical test at the 5% level. Comment your result.
Exercise 3 (15 points)
1. (5 points) Write an R function to compute the sum of the maximum of independent
N(0, 1) random variables X and Y. To do this generate n pairs of (X, Y ), compute
the maximum for each, and sum them. The integer number n should be the argument
of your function. Call this function for n = 100 and n = 1000.
2. (5 points) Suppose we have a matrix of 1s and 0s and want to create a vector as
follows: for each row of the matrix, the corresponding element of the vector will be
either 1 or 0, depending on whether the majority of the rst d elements in that row
is 1 or 0. Write a function to create this vector. The function has as arguments the
matrix and d. Call this function for the following matrix x and d = 3 :
For this input, your function should return the vector 1, 1, 0.
3. (5 points) Write a function to visualize the approximation of the binomial distribution
by the normal distribution (plot the density mass function of the binomial
distribution and add the density of the normal distribution). The approximation
generally improves as n increases (at least 20) and is better when p is not near to 0
or 1. The function takes as parameters the sample size n and the success probability
p of the binomial distribution. The normal distribution will have the mean np and
the standard deviation √
np(1 − p). Use this function for dierent sample sizes n
and dierent p to see the approximation eect. Take:
• n = 10, 30, 50, 100 and p = 0.25 in the same layout;
• n = 10, 30, 50, 100 and p = 0.75 in the same layout;
• n = 10, 30, 50, 100 and p = 0.5 in the same layout.
What do you observe?
Exercise 4 (30 points)
Read the le study.dat in R. The le contains a header. The columns are not in xed
width, but delimited by a comma (sep=",") (open rst the le to check it). The data
set contains measures obtained in an epidemiological study. Some variables that were
measured are:
• BMI: Body Mass Index (kg/m)
• SMOKING: 0=no, 1=yes
• TCHOL: Total cholesterol (mg/dl)
• FEMALE: sex, 0=man, 1=woman
• CVD: Cardiovascular death, 0=no, 1=yes, missing if the cause of death wasn't cardiovascular.
a. (5 points) Check the normality of the BMI variable using three dierent graphical
tools. Comment your results.
b. (5 points) Check the normality of the BMI variable using an appropriate statistical
test (the signicance level is 5%). Explain your result.
c. Compare the total cholesterol (mg/dl) for the patients of this study who died of a
cardiovascular disease between the smokers and non-smokers:
1. (5 points) Draw boxplots (in the same layout) of the total cholesterol for smokers
and non-smokers to give a rst insight and comment your result.
2. (5 points) Test for the equality of the TCHOL means in the two groups (smokers,
non-smokers) using a two-sided t-test at the 5% signicance level, assuming
normality of the TCHOL in both groups (the test assumes that the two
groups/samples come from two normal populations N(µ1, σ21) and N(µ2, σ22),
respectively, under the null hypothesis). A parameter of the R function concerns
the equality of σ21 and σ22. Use the F-test to test the equality of the variances
(provided that the samples come from normal populations). Then, apply the R
function for a two-sample t-test. Comment your results.
3. (5 points) All these tests assume normality of the data in the two groups/samples.
The two-sample Wilcoxon or Mann-Whitney test (which is a nonparametric
test) only assumes a common continuous distribution under the null hypothesis
and tests if the median dierence is 0 versus the median dierence is not 0.
Perform this test and comment your result. Does this result agree with the
result of the t-test at the 5% level?
4. (5 points) It is known that the t-test is sensitive to outliers (for this reason,
the Wilcoxon test is sometimes preferred). The variable TCHOL includes some
outliers in the smokers group (see the associated boxplot). Delete the largest
outlier from the data and perform again the t-test. What do you observe?
Comment your result.
Exercise 5 (40 points)
Consider the data given in the le 'restaurant.csv'. Some variables that were measured
are:
• y - Price (the price of a dinner in US$);
• x1 - Food (customer rating of the food, out of 30);
• x2 - Decor (customer rating of the decor, out of 30);
• x3 - Service (costumer rating of the service, out of 30);
• x4 - East (dummy variable, 1/0 if the restaurant is in the east/west of Fifth Avenue,
New York).
We seek for a linear regression model that predicts y.
1. (5 points) Start by graphically inspecting the data. Comment.
2. (5 points) Fit the regression model having as predictors all the x variables. State the
null and the alternative hypotheses of the overall F-test. Perform the overall F-test
at the 5% level. Comment your result.
3. (5 points) Check if the predictor variables are statistically signicant at the 5% level.
Comment your result.
4. (10 points) Consider the model including only the predictors which are statistically
signicant at the 5% level. Check using diagnostic plots the validity of the regression
model. Improve if necessary its goodness-of-t following the assumptions of a linear
multiple regression model. Comment your results.
5. (5 point) The quantities hii represent the diagonal elements of the hat matrix. We
can show that the mean of hii, i = 1, . . . , n is (p + 1)/n, where p is the number of
predictors in your model and n is the number of observations. Large values of hii
may indicate an observation i having an important inuence on the model (a data
point is inuential if it strongly inuences any part of a regression analysis, such as
the predicted responses, the estimated coecients, or the hypothesis test results).
Consider the following rule of thumb: if hii > 3(p + 1)/n, the observation i should
be considered noteworthy. List the observations where this rule is fullled and check
their inuence on the model. Comment your results.
6. (10 points) It is possible to construct models that include dierent subsets of predictors.
Install the package `leaps' and use the function with the same name to perform
an exhaustive search for the best subset of the variables x for predicting y. Use the
adjusted R2 as criterion to compare dierent models. Provide the `best' model given
by the leaps() function (we call this the initial `best' model). We want to check the
stability of the initial `best' model using the following method:
• draw with replacement a sample of observations with the sample size equals to
the number of rows of the original data set; a new data set is obtained;
• run the leaps() function on the new data set and obtain a new `best' model as
before;
• repeat the previous steps 1000 times and compute the proportion of times that
the initial `best' model was provided by the leaps() function. Comment your
result.