讲解431 Quiz 2、R编程设计调试、R语言辅导、讲解data_states讲解Python程序|解析R语言编程

431 Quiz 2: Fall 2019
Thomas E. Love
due 2019-11-18 at Noon, version 2019-11-12
Instructions
All of the links for Quiz 2 Materials are at
The Materials
To complete the Quiz you’ll need three things, all of which are linked at the URL above.
1. The 2019-431-quiz02-questions.PDF file. This contains all of the instructions, questions and potential
responses. Be sure that you see all 30 questions, and all 27 pages.
2. Five data files, named quiz_data_states.csv, quiz_hosp.csv, quiz_ra.csv, quiz_sim_nejm.csv
and quiz_statin.csv, which may be useful to you.
3. The Quiz 2 Answer Sheet which is a Google Form.
Use the PDF file to read the quiz and craft your responses (occasionally making use of the provided data
sets), and then place those responses into the Answer Sheet Google Form. When using the Answer Sheet,
please select or type in your best response (or responses, as indicated) for each question. All of your responses
must be in the Answer Sheet by the deadline.
Key Things To Remember
The deadline for completing the Answer Sheet is Noon on Monday 2019-11-18, and this is a firm deadline,
without the grace period we allow for in turning in Homework.
The questions are not arranged in any particular order, and your score is based on the number of correct
responses, so you should answer all questions. There are 30 questions, and each is worth either 3 or 4 points.
The maximum possible score on the quiz is 100 points. Questions 01, 02, 05, 06, 08, 14, 17, 22, 27 and 30 are
worth 4 points each. They are marked to indicate this.
If you wish to work on some of the quiz on the Answer Sheet and then return later, you can do this by [1]
completing the final question which asks you to type in your full name, and then [2] submitting the Answer
Sheet. You will then receive a link which allows you to return to the Answer Sheet without losing your
progress.
Occasionally, I ask you to provide a single line of code. In all cases, a single line of code can include at most
one pipe for these purposes, although you may or may not need the pipe in any particular setting. Moreover,
you need not include the library command at any time for any of your code. Assume in all questions that all
relevant packages have been loaded in R. Any reference to a logarithm refers to a natural logarithm. If you
need to set a seed, use set.seed(2019) throughout this Quiz.
You are welcome to consult the materials provided on the course website, but you are not allowed to discuss
the questions on this quiz with anyone other than Professor Love and the teaching assistants at 431-help at
case dot edu. Please submit any questions you have about the Quiz to 431-help through email. Thank you,
and good luck.
1
1 Question 01 (4 points)
Consider the starwars tibble that is part of the dplyr package in the tidyverse. Filter the data file to focus
on individuals who are of the Human species, who also have complete data on both their height and mass.
Then use a t-based approach to estimate an appropriate 90% confidence interval for the difference between the
mean body-mass index of Human males minus the mean body-mass index of Human females. Don’t assume
that the population variances of males and females are the same. The data provides height in centimeters
and mass in kilograms. You’ll need to calculate the body-mass index (BMI) values - the appropriate formula
to obtain BMI in our usual units of kg
m2 is:
BMI =10, 000 ∗ mass in kg(height in cm)
2
Specify your point estimate, and then the lower and upper bound, each rounded to a single decimal place,
and be sure to specify the units of measurement.
2 Question 02 (4 points)
On 2019-09-25, Maggie Koerth-Baker at FiveThirtyEight published “We’ve Been Fighting the Vaping Crisis
Since 1937.” In that article, she quotes a 2019-09-06 article at the New England Journal of Medicine by
Jennifer E. Layden et al. entitled “Pulmonary Illness Related to E-Cigarette Use in Illinois and Wisconsin —
Preliminary Report.” Quoting that report:
E-cigarettes are battery-operated devices that heat a liquid and deliver an aerosolized product
to the user. . . . In July 2019, the Wisconsin Department of Health Services and the Illinois
Department of Public Health received reports of pulmonary disease associated with the use of
e-cigarettes (also called vaping) and launched a coordinated public health investigation. . . . We
defined case patients as persons who reported use of e-cigarette devices and related products in
the 90 days before symptom onset and had pulmonary infiltrates on imaging and whose illnesses
were not attributed to other causes.
The entire report is available at https://www.nejm.org/doi/full/10.1056/NEJMoa1911614. In the study, 53
case patients were identified, but some patients gave no response to the question of whether or not “they had
used THC (tetrahydrocannabinol) products in e-cigarette devices in the past 90 days.” 33 of the 41 reported
THC use. Assume those 41 subjects are a random sample of all case patients that will appear in Wisconsin
and Illinois in 2019.
Use a SAIFS procedure to estimate an appropriate 90% confidence interval for the PERCENTAGE of
case patients in Illinois and Wisconsin in 2019 that used THC in the 90 days prior to symptom onset.
Note that I’ve emphasized the word PERCENTAGE here, so as to stop you from instead presenting a
proportion. Specify your point estimate of this PERCENTAGE, and then the lower and upper bound for
your confidence interval, in each case rounded to a single decimal place.
2
3 Question 03
Alex, Beth, Cara and Dave independently select random samples from the same population. The sample sizes
are 200 for Alex, 400 for Beth, 125 for Cara, and 300 for Dave. Each researcher constructs a 95% confidence
interval from their data using the same statistical method. The half-widths (margins of error) for those
confidence intervals are 1.45, 1.74, 1.96 and 2.43. Match each interval’s margin of error with its researcher.
Rows:
a. Alex, who took a sample of n = 200 people.
b. Beth, who took a sample of n = 400 people.
c. Cara, who took a sample of n = 125 people.
d. Dave, who took a sample of n = 300 people.
Columns:
1. 1.45
2. 1.74
3. 1.96
4. 2.43
4 Question 04
Suppose you have a tibble with two variables. One is a factor called Exposure with levels High, Low and
Medium, arranged in that order, and the other is a quantitative outcome. You want to rearrange the order
of the Exposure variable so that you can then use it to identify for ggplot2 a way to split histograms of
outcomes up into a series of smaller plots, each containing the histogram for subjects with a particular level
of exposure (Low then Medium then High.)
Which of the pairs of tidyverse functions identified below has Dr. Love used to accomplish such a plot?
a. fct_reorder and facet_wrap
b. fct_relevel and facet_wrap
c. fct_collapse and facet_wrap
d. fct_reorder and group_by
e. fct_collapse and group_by
3
5 Question 05 (4 points)
In a double-blind trial, 350 patients with active rheumatoid arthritis were randomly assigned to receive one
of two therapy types: a cheaper one, or a pricier one, and went on to participate in the trial.
The primary outcome was the change in DAS28 at 48 weeks as compared to study entry. The DAS28 is
a composite index of the number of swollen and tender joints, the erythrocyte sedimentation rate, and a
visual-analogue scale of patient-reported disease activity. A decrease in the DAS28 of 1.2 or more (so a change
of -1.2 or below) was considered to be a clinically meaningful improvement. Data are in the quiz_ra.csv file.
A student completed four analyses, shown below. Which of the following 90% confidence intervals for the
change in DAS28 at 48 weeks most appropriately compares the pricier therapy to the cheaper one?
d. Analysis D
e. Analysis E
f. Analysis F
g. Analysis G
ra <- read.csv("data/quiz_ra.csv") %>% tbl_df()
mosaic::favstats(das28_chg ~ therapy, data = ra)
therapy min Q1 median Q3 max mean sd n missing
1 Cheaper -6.12 -2.955 -2.22 -1.415 0.56 -2.250857 1.208183 175 0
2 Pricier -5.56 -2.630 -2.06 -1.250 1.53 -2.027486 1.260694 175 0
ggplot(data = ra, aes(x = therapy, y = das28_chg, fill = therapy)) +
geom_violin(alpha = 0.3) + geom_boxplot(width = 0.3, notch = TRUE) +
theme_bw() + guides(fill = FALSE) + scale_fill_viridis_d()
5.1 Analysis D
ra %$% t.test(das28_chg ~ therapy, var.equal = TRUE) %>%
tidy(conf.int = TRUE, conf.level = 0.90) %>%
mutate(estimate = estimate1 - estimate2) %>%
select(estimate, conf.low, conf.high, method)
# A tibble: 1 x 4
estimate conf.low conf.high method

1 -0.223 -0.483 0.0362 " Two Sample t-test"
5.2 Analysis E
ra %$% t.test(das28_chg ~ therapy, paired = TRUE) %>%
tidy(conf.int = TRUE, conf.level = 0.90) %>%
select(estimate, conf.low, conf.high, method)
# A tibble: 1 x 4
estimate conf.low conf.high method

1 -0.223 -0.250 -0.197 Paired t-test
5.3 Analysis F
ra %$% wilcox.test(das28_chg ~ therapy, paired = TRUE,
conf.int = TRUE, conf.level = 0.90) %>%
tidy() %>%
select(estimate, conf.low, conf.high, method)
# A tibble: 1 x 4
estimate conf.low conf.high method

1 -0.230 -0.245 -0.215 Wilcoxon signed rank test with continuity co~
5.4 Analysis G
ra %$% wilcox.test(das28_chg ~ therapy, conf.int = TRUE, conf.level = 0.90) %>%
tidy() %>%
select(estimate, conf.low, conf.high, method)
# A tibble: 1 x 4
estimate conf.low conf.high method

1 -0.240 -0.450 -0.0300 Wilcoxon rank sum test with continuity corre~
5
6 Question 06 (4 points)
Referring again to the study initially described in Question 05, which of the following analyses provides an
appropriate 90% confidence interval for the difference (cheaper - pricier) in the proportion of participants
who had a clinically meaningful improvement (DAS28 change of -1.2 or below) at 48 weeks?
j. Analysis J
k. Analysis K
l. Analysis L
m. Analysis M
n. None of the above.
6.1 Analysis J
ra <- read.csv("data/quiz_ra.csv") %>% tbl_df()
ra <- ra %>%
mutate(improved = das28_chg < -1.2) %>%
mutate(improved = fct_relevel(factor(improved), "FALSE"))
ra %>% tabyl(improved, therapy)
improved Cheaper Pricier
FALSE 31 41
TRUE 144 134
twobytwo(31, 41, 144, 134, "improved", "didn't improve",
"cheaper", "pricier")
2 by 2 table analysis:
------------------------------------------------------
Outcome : cheaper
Comparing : improved vs. didn't improve
cheaper pricier P(cheaper) 95% conf. interval
improved 31 41 0.4306 0.3217 0.5466
didn't improve 144 134 0.5180 0.4593 0.5762
95% conf. interval
Relative Risk: 0.8312 0.6227 1.1096
Sample Odds Ratio: 0.7036 0.4173 1.1864
Conditional MLE Odds Ratio: 0.7043 0.4019 1.2246
Probability difference: -0.0874 -0.2100 0.0416
Exact P-value: 0.2339
Asymptotic P-value: 0.1872
------------------------------------------------------
6
6.2 Analysis K
ra <- read.csv("data/quiz_ra.csv") %>% tbl_df()
ra <- ra %>%
mutate(improved = das28_chg <= -1.2) %>%
mutate(improved = fct_relevel(factor(improved), "TRUE"))
ra %>% tabyl(improved, therapy)
improved Cheaper Pricier
TRUE 144 134
FALSE 31 41
twobytwo(144, 134, 31, 41, "improved", "didn't improve",
"cheaper", "pricier")
2 by 2 table analysis:
------------------------------------------------------
Outcome : cheaper
Comparing : improved vs. didn't improve
cheaper pricier P(cheaper) 95% conf. interval
improved 144 134 0.5180 0.4593 0.5762
didn't improve 31 41 0.4306 0.3217 0.5466
95% conf. interval
Relative Risk: 1.2031 0.9013 1.6059
Sample Odds Ratio: 1.4213 0.8429 2.3965
Conditional MLE Odds Ratio: 1.4198 0.8166 2.4880
Probability difference: 0.0874 -0.0416 0.2100
Exact P-value: 0.2339
Asymptotic P-value: 0.1872
------------------------------------------------------
7
6.3 Analysis L
ra <- read.csv("data/quiz_ra.csv") %>% tbl_df()
ra <- ra %>%
mutate(improved = das28_chg < -1.2) %>%
mutate(improved = fct_relevel(factor(improved), "FALSE"))
ra %>% tabyl(improved, therapy)
improved Cheaper Pricier
FALSE 31 41
TRUE 144 134
twobytwo(31, 41, 144, 134, conf.level = 0.90,
"improved", "didn't improve", "cheaper", "pricier")
2 by 2 table analysis:
------------------------------------------------------
Outcome : cheaper
Comparing : improved vs. didn't improve
cheaper pricier P(cheaper) 90% conf. interval
improved 31 41 0.4306 0.3383 0.5279
didn't improve 144 134 0.5180 0.4687 0.5669
90% conf. interval
Relative Risk: 0.8312 0.6523 1.0592
Sample Odds Ratio: 0.7036 0.4538 1.0908
Conditional MLE Odds Ratio: 0.7043 0.4379 1.1271
Probability difference: -0.0874 -0.1914 0.0212
Exact P-value: 0.2339
Asymptotic P-value: 0.1872
------------------------------------------------------
8
6.4 Analysis M
ra <- read.csv("data/quiz_ra.csv") %>% tbl_df()
ra <- ra %>%
mutate(improved = das28_chg <= -1.2) %>%
mutate(improved = fct_relevel(factor(improved), "TRUE"))
ra %>% tabyl(improved, therapy)
improved Cheaper Pricier
TRUE 144 134
FALSE 31 41
twobytwo(144, 134, 31, 41, conf.level = 0.90,
"improved", "didn't improve", "cheaper", "pricier")
2 by 2 table analysis:
------------------------------------------------------
Outcome : cheaper
Comparing : improved vs. didn't improve
cheaper pricier P(cheaper) 90% conf. interval
improved 144 134 0.5180 0.4687 0.5669
didn't improve 31 41 0.4306 0.3383 0.5279
90% conf. interval
Relative Risk: 1.2031 0.9441 1.5331
Sample Odds Ratio: 1.4213 0.9168 2.2034
Conditional MLE Odds Ratio: 1.4198 0.8872 2.2838
Probability difference: 0.0874 -0.0212 0.1914
Exact P-value: 0.2339
Asymptotic P-value: 0.1872
------------------------------------------------------
9
7 Question 07
In response to unexpectedly low enrollment, the protocol was amended part-way through the trial described
in Question 05 to change the primary outcome from a binary outcome to a continuous outcome in order to
increase the power of the study.
Originally, the proposed primary outcome was the difference in the proportion of participants who had a
DAS28 of 3.2 or less at week 48. The original power analysis established a sample size target of 225 completed
enrollments in each therapy group, based on a two-sided 10% significance level, and a desire for 90% power.
In that initial power analysis, the proportion of participants with a DAS28 of 3.2 or less at week 48 was
assumed to be 0.27 under the less effective of the two therapies.
What value was used in the power calculation for the proportion of participants with DAS28 of 3.2 or less at
week 48 for the more effective therapy? State your answer rounded to two decimal places.
8 Question 08 (4 points)
In the trial described in Question 05, 21 of the 222 subjects originally assigned to receive the cheaper therapy
and 35 of the 219 subjects originally assigned to receive the pricier therapy experienced a serious adverse
event (which included infections, gastrointestinal, renal, urinary, cardiac or vascular disorders, as well as
surgical or medical procedures.)
Suppose you wanted to determine whether or not there was a statistically detectable difference in the rates of
serious adverse events in the two therapy groups at the 5% significance level? Specify a single line of R code
that would do this, appropriately.
10
9 Question 09
The Pottery data are part of the carData package in R. Included are data describing the chemical composition
of ancient pottery found at four sites in Great Britain. This data set will also be used in Question 10. In this
question, we will focus on the Na (Sodium) levels, and our goal is to compare the mean Na levels across the
four sites.
anova(lm(Na ~ Site, data = carData::Pottery))
Analysis of Variance Table
Response: Na
Df Sum Sq Mean Sq F value Pr(>F)
Site 3 0.25825 0.086082 9.5026 0.0003209 ***
Residuals 22 0.19929 0.009059
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Which of the following conclusions is most appropriate, based on the output above?
a. The F test allows us to conclude that the population mean Na level in at least one of the four sites is
detectably different than the others, at a 1% significance level.
b. The F test allows us to conclude that the population mean Na level in each of the four sites is detectably
different than each of the others, at a 1% significance level.
c. The F test allows us to conclude that the population mean Na level is the same in all four sites, at a
1% significance level.
d. The F test allows us to conclude that the population mean Na level may not be the same in all sites,
but is not detectably different at the 1% level.
e. None of these conclusions are appropriate.
11
10 Question 10
Consider these two sets of plots, generated to describe variables from the Pottery data set within the carData
package.
Plot 2 for Question 10
Question 10 continues on the next page. . .
12
Question 10 continues
And now, here are summary statistics from the mosaic::inspect function describing the variables contained
in the Pottery data set.
mosaic::inspect(carData::Pottery)
categorical variables:
name class levels n missing
1 Site factor 4 26 0
distribution
1 Llanedyrn (53.8%), AshleyRails (19.2%) ...
quantitative variables:
name class min Q1 median Q3 max mean sd n
1 Al numeric 10.10 11.95 13.800 17.4500 20.80 14.4923077 2.9926474 26
2 Fe numeric 0.92 1.70 5.465 6.5900 7.09 4.4676923 2.4097507 26
3 Mg numeric 0.53 0.67 3.825 4.5025 7.23 3.1415385 2.1797260 26
4 Ca numeric 0.01 0.06 0.155 0.2150 0.31 0.1465385 0.1012301 26
5 Na numeric 0.03 0.05 0.150 0.2150 0.54 0.1584615 0.1352832 26
missing
Based on this output, and whatever other work you need to do, which of the statements below is true, about
Variable 1 (as shown in Plot 1) and Variable 2 (shown in Plot 2)?
a. Var1 is . . .
b. Var2 is . . .
Choices are:
11 Question 11
Suppose you have a data frame named mydata containing a variable called sbp, which shows the participant’s
systolic blood pressure in millimeters of mercury. Which of the following lines of code will create a new
variable badbp within the mydata data frame which takes the value TRUE when a subject has a systolic
blood pressure that is at least 120 mm Hg, and FALSE when a subject’s systolic is less than 120 mm Hg.
a. mydata %>% badbp <- sbp >= 120
b. mydata$badbp <- ifelse(mydata$sbp >= 120, "YES", "NO")
c. badbp <- mydata %>% filter(sbp >= 120)
d. mydata %>% mutate(badbp = sbp >= 120)
e. None of these will do the job.
12 Question 12
According to Jeff Leek in The Elements of Data Analytic Style, which of the following is NOT a good reason
to create graphs for data exploration?
a. To understand properties of the data.
b. To inspect qualitative features of the data more effectively than a huge table of raw data would allow.
c. To discover new patterns or associations.
d. To consider whether transformations may be of use.
e. To look for statistical significance without first exploring the data.
13 Question 13
If the characteristics of a sample approximate the characteristics of its population in every respect, then
which of the statements below is true? (CHECK ALL THAT APPLY.)
a. The sample is random
b. The sample is accidental
c. The sample is stratified
d. The sample is systematic
e. The sample is representative
f. None of the above
14
Setup for Questions 14-15
For Questions 14 and 15, consider the data I have provided in the quiz_hosp.csv file. The data describe
700 simulated patients at a metropolitan hospital. Available are:
• subject.id = Subject Identification Number (not a meaningful code)
• sex = the patient’s sex (FEMALE or MALE)
• statin = does the patient have a prescription for a statin medication (YES or NO)
• insurance = the patient’s insurance type (MEDICARE, COMMERCIAL, MEDICAID, UNINSURED)
• hsgrads = the percentage of adults in the patient’s home neighborhood who have at least a high school
diploma (this measure of educational attainment is used as an indicator of the socio-economic place in
which the patient lives)
14 Question 14 (4 points)
Using the quiz_hosp data, what is the 95% confidence interval for the odds ratio which compares the odds of
receiving a statin if you are MALE divided by the odds of receiving a statin if you are FEMALE. Show the
point and interval estimates, rounded to two decimal places. Do NOT use a Bayesian augmentation here.
15 Question 15
Perform an appropriate analysis to determine whether insurance type is associated with the education
(hsgrads) variable, ignoring all other information in the quiz_hosp data. Which of the following conclusions
is most appropriate based on your analyses, using a 5% significance level?
a. The ANOVA F test shows no detectable effect of insurance on hsgrads, so it doesn’t make sense to
compare pairs of insurance types.
b. The ANOVA F test shows a detectable effect of insurance on hsgrads, and a Tukey HSD comparison
reveals that Medicare shows detectably higher education levels than Uninsured.
c. The ANOVA F test shows a detectable effect of insurance on hsgrads, and a Tukey HSD comparison
reveals that Medicaid’s education level is detectably lower than either Medicare or Commercial.
d. The ANOVA F test shows a detectable effect of insurance on hsgrads, and a Tukey HSD comparison
reveals that Uninsured’s education level is detectably lower than Commercial or Medicare.
e. None of these conclusions is appropriate.
15
16 Question 16
Once a confidence interval is calculated, several design changes may be used by a researcher to make a
confidence interval wider or narrower. For each of the changes listed below, indicate the impact on the width
of the confidence interval.
Rows are
a. Increase the level of confidence.
b. Increase the sample size.
c. Increase the standard error of the estimate.
d. Use a bootstrap approach to estimate the CI.
Columns are
1. CI will become wider
2. CI will become narrower
3. CI width will not change
4. It is impossible to tell
17 Question 17 (4 points)
The data in the quiz_statin.csv file provided to you describe the results of a study of 180 patients who
have a history of high cholesterol. Patients in the study were randomly assigned to the use of a new statin
medication, or to retain their current regimen. The columns in the data set show a patient identification
code, whether or not the patient was assigned to the new statin (Yes or No) and their LDL cholesterol value
(in mg/dl) at the end of the study. You have been asked to produce a 95% confidence interval comparing the
mean LDL levels across the two statin groups (including both a point estimate and appropriate confidence
interval rounded to two decimal places), and then describe your result in context in a single English sentence.
Which of the following approaches and conclusions are reasonable in this setting? (CHECK ALL THAT
APPLY)
a. LDL levels using the new statin were 4.95 mg/dl higher with 95% CI (0.65, 9.24) mg/dl, based on an
indicator variable regression model, which replicates a two-sample t test assuming equal variances.
b. LDL levels using the new statin were 4.95 mg/dl lower with 95% CI (0.65, 9.24) mg/dl, based on an
indicator variable regression model, which replicates a two-sample t test assuming equal variances.
c. LDL levels using the new statin were 4.95 mg/dl higher with 95% CI (0.56, 9.33) mg/dl, based on a
Welch two-sample t test not assuming equal variances.
d. LDL levels using the new statin were 4.95 mg/dl lower with 95% CI (0.56, 9.33) mg/dl, based on a
Welch two-sample t test not assuming equal variances.
e. LDL levels using the new statin were 4.95 mg/dl higher with 95% CI (0.94, 9.21) mg/dl, based on a
bootstrap comparison of the population means and using the seed 2019.
f. LDL levels using the new statin were 4.95 mg/dl lower with 95% CI (0.94, 9.21) mg/dl, based on a
bootstrap comparison of the population means and using the seed 2019.
g. None of the above are appropriate, since we should be using a paired samples analysis with these data.
16
18 Question 18
A hospital system has about 1 million records in its electronic health record database who meet our study’s
qualifying requirements for inclusion and exclusion. We believe that about 20% of the subjects who qualify
by these criteria will need a particular blood test.
Rows are:
a. Which will provide a confidence interval with smaller width for the proportion needing the blood test,
using a Wald approach?
b. Which will provide a better confidence interval estimate for the sample proportion of eligible subjects
who need the blood test?
Columns are:
1. A random sample of 85 subjects who meet the qualifying requirements.
2. A non-random sample of 850,000 of the subjects who met the qualifying requirements in the past year.
19 Question 19
A series of 88 models were built by a team of researchers interested in systems biology. 36 of the models
showed promising results in an attempt to validate them out of sample. Define the hit rate as the percentage
of models built that show these promising results. Which of the following intervals appropriately describes
the uncertainty we have around a hit rate estimate in this setting, using a Wald confidence interval approach
with a Bayesian augmentation and permitting a 10% rate of Type I error?
a. (31.8%, 50.3%)
b. (32.2%, 50.2%)
c. 0.411 plus or minus 9 percentage points
d. (32.4%, 50.3%)
e. None of these intervals.
17
20 Question 20
The lab component of a core course in biology is taught at the Watchmaker’s Technical Institute by a set
of five teaching assistants, whose names, conveniently, are Amy, Beth, Carmen, Donna and Elena. On the
second quiz of the semester (each section takes the same set of quizzes) an administrator at WTI wants to
compare the mean scores across lab sections. She produces the following output in R.
Analysis of Variance Table
Response: exam2
Df Sum Sq Mean Sq F value Pr(>F)
ta 4 971.5 242.868 2.7716 0.02898
Residuals 165 14458.4 87.627
Emboldened by this result, the administrator decides to compare mean exam2 scores for each possible pair of
TAs, using a Bonferroni correction. Suppose she’s not heard of pairwise.t.test() and therefore plans to
make each comparison separately with two-sample t tests. If she wants to maintain an overall α level of 0.10
for the resulting suite of pairwise comparisons using the Bonferroni correction, then what significance level
should she use for each of the individual two-sample t tests?
a. She should use a significance level of 0.10 on each test.
b. She should use 0.05 on each test.
c. She should use 0.025 on each test.
d. She should use 0.01 on each test.
e. She should use 0.001 on each test.
f. None of these answers are correct.
21 Question 21
If the administrator at the Watchmaker’s Technical Institute that we mentioned in Question 20 instead used
a Tukey HSD approach to make her comparisons, she might have obtained the following output.
Tukey multiple comparisons of exam2 means, 90% family-wise confidence level
diff lwr upr || diff lwr upr
----- ----- ----- || ----- ------ ----
Beth-Amy 1.21 -4.43 6.83 || Donna-Beth -6.53 -12.16 -0.90
Carmen-Amy -1.41 -7.04 4.22 || Elena-Beth -0.24 -5.87 5.40
Donna-Amy -5.32 -10.96 0.31 || Donna-Carmen -3.91 -9.54 1.72
Elena-Amy 0.97 -4.66 6.60 || Elena-Carmen 2.38 -3.25 8.01
Carmen-Beth -2.62 -8.25 3.01 || Elena-Donna 6.29 0.66 11.93
Note that when we refer in the responses below to Beth’s scores, we mean the scores of students who were in
Beth’s lab section. Which conclusion of those presented below would be most appropriate?
a. Amy’s scores are significantly higher than Carmen’s or Elena’s.
b. Beth’s scores were significantly higher than Amy’s.
c. Donna’s scores are significantly lower than Beth’s or Elena’s.
d. Elena’s scores are significantly lower than Donna’s.
e. None of these answers are correct.
18
22 Question 22 (4 points)
The quiz_data_states.csv file contains information on several variables related to the 50 United States
plus the District of Columbia. The available data include 102 rows of information on six columns, and those
columns are:
• code: the two-letter abbreviation for the “state” (DC = Washington DC, etc.)
• state: the “state” name
• year: 2019 or 2010, the year for which the remaining variables were obtained
• population: number of people living in the “state”
• poverty_people: number of people in the “state” living below the poverty line
• poverty_rate: % of people living in the “state” who are below the poverty line
Our eventual goal is to use the quiz_data_states data to produce an appropriate 90% confidence interval
for the change from 2010 to 2019 in poverty rate, based on an analysis of the data at the level of the 51
“states”.
Which of the following statements is most true?
a. This should be done using a paired samples analysis, and the quiz_data_states data require us to
calculate the paired differences, but are otherwise ready to plot now.
b. This should be done using a paired samples analysis, and the quiz_data_states data require us to
pivot the data to make them wider, and then calculate the paired differences and plot them.
c. This should be done using a paired samples analysis, and the quiz_data_states data require us to
pivot the data to make them longer, and then calculate the paired differences and plot them.
d. This should be done using an independent samples analysis, and the quiz_data_states data are ready
to be plotted appropriately now.
e. This should be done using an independent samples analysis, and the quiz_data_states data require
us to pivot the data to make them wider, and then plot the distributions of the two samples.
f. This should be done using an independent samples analysis, and the quiz_data_states data require
us to pivot the data to make them longer, and then plot the distributions of the two samples.
23 Question 23
Which of the following is the most appropriate way to complete the development of the confidence interval
proposed in Question 22?
a. Tukey HSD comparisons following an Analysis of Variance
b. Applying tidy() to an Indicator Variable Regression
c. Applying tidy() to an Intercept-only Regression
d. A Wilcoxon-Mann-Whitney Rank Sum Confidence Interval
e. A bootstrap on the poverty_people values across the states
24 Question 24
Use the data you have been provided in the quiz_data_states.csv file to provide a point estimate of the
change from 2010 to 2019 in the poverty rate in the United States as a whole. Provide your response as a
proportion with four decimal places. Note carefully what I am asking for (and not asking for) here.
25 Question 25
In The Signal and The Noise, Nate Silver writes repeatedly about a Bayesian way of thinking about uncertainty,
for instance in Chapters 8 and 13. Which of the following statistical methods is NOT consistent with a
Bayesian approach to thinking about variation and uncertainty? (CHECK ALL THAT APPLY)
a. Updating our forecasts as new information appears.
b. Establishing a researchable hypothesis prior to data collection.
c. Significance testing of a null hypothesis, using, say, Fisher’s exact test.
d. Combining information from multiple sources to build a model.
e. Gambling using a strategy derived from