A Power Calculation
Suppose, for illustation, that we are interested in testing the hypothesis
H0: 1 = ?2 vs. HA: 1 6= 2
Suppose, also for illustration, that the test statistic associated with this test has the form
It will be useful to define the notion of a rejection region R: all values of the observed test statistic
t that would lead to the rejection of H0:
R = {t | H0 is rejected}
– If t 2 R, we reject H0
– If t 2 Rc, we do not reject H0
Defining Type I and Type II error rates in terms of a rejection region is also useful:
= Pr(Type I Error) = Pr(Reject H0 | H0 is true)
= Pr(T 2 R | H0 is true)
= Pr(Type II Error) = Pr(Do Not Reject H0 | H0 is false)
= Pr(T 2 Rc | H0 is false)
2
3
Permutation and Randomization Tests
All of the previous tests have made some kind of distributional assumption for the response measure-
ments
It would be preferable to have a test that does not rely on any assumptions
This is precisely the purpose of permutation and randomization tests.
– These tests are nonparametric and rely on resampling.
– The motivation is that if H0 : ?1 = ?2 is true, any random rearrangement of the data is equally
likely to have been observed.
– With n1 and n2 units in each condition, there are?
arrangements of the n1 + n2 observations into two groups of size n1 and n2 respectively
4
A true permutation test considers all possible rearrangements of the original data
– The test statistic t is calculated on the original data and on every one of its rearrangements
– This collection of test statistic values generate the empirical null distribution
A randomization test is carried out similarly, except that we do not consider all possible rearrange-
ments
– We just consider a large number N of them
Randomization Test Algorithm
1. Collect response observations in each condition.
2. Calculate the test statistic t on the original data.
5
3. Pool all of the observations together and randomly sample (without replacement) n1 observations which
will be assigned to “Condition 1” and the remaining n2 observations are assigned to “Condition 2”.
Repeat this N times.
4. Calculate the test statistic t?k on each of the “shu?ed” datasets, k = 1, 2, . . . , N .
5. Compare t to {t?1, t?2, . . . , t?N}, the empirical null distribution and calculate the p-value:
p-value =
# of t?’s that are at least as extreme as t
N
Example: Pokemon Go
Suppose that Niantic is experimenting with two di?erent promotions within Poke′mon Go:
– Condition 1: Give users nothing
– Condition 2: Give users 200 free Poke′coins
– Condition 3: Give users a 50% discount on Shop purchases
In a small pilot experiment n1 = n2 = n3 = 100 users are randomized to each condition
For each user, the amount of real money (in USD) they spend in the 30 days following the experiment
is recorded
The data summaries are:
– y1 = $10.74, Q1(0.5) = $9
– y2 = $9.53, Q2(0.5) = $8
– y3 = $13.41, Q3(0.5) = $10
6
3 Experiments with More than Two Conditions
3.1 Anatomy of an A/B/m Test
We now consider the design and analysis of an experiment consisting of more than two experimental
conditions – or what many data scientists broadly refer to as “A/B/m Testing”.
– Canonical A/B/m test:
Figure 1: Button-Colour Experiment
Other, more tangible, examples:
– Netflix
– Etsy
Typically the goal of such an experiment is to decide which condition is optimal with respect to some
metric of interest. This could be a
– mean
– proportion
– variance
– quantile
– technically any statistic that can be calculated from sample data
From a design standpoint, such an experiment is very similar to a two-condition experiment
1. Choose a metric of interest ? which addresses the question you are trying to answer
2. Determine the response variable y that must be measured on each unit in order to estimate b?
3. Choose the design factor x and the m levels you will experiment with.
4. Choose n1, n2, . . . , nm and assign units to conditions at random
5. Collect the data and estimate the metric of interest in each condition:
b1, b2, . . . , bm
7
Determining which condition is optimal typically involves a series of pairwise comparisons
But it is useful to begin such an investigation with a gatekeeper test which serves to determine whether
there is any di?erence between the m experimental conditions. Formally, such a question is phrased
as the following statistical hypothesis.
H0: 1 = 2 = · · · = m versus HA: j 6= k for some j 6= k (1)
3.2 Comparing Multiple Means with an F -test
We assume that our response variable follows a normal distribution and we assume that the mean of
the distribution depends on the condition in which the measurements were taken, and that the variance
is the same across all conditions.
The “gatekeeper” test for means is tested using an F -test
In particular, we use the F -test for overall significance in an appropriately defined linear regression
model :
– The appropriately defined linear regression model in this situation is one in which the response
variable depends on m 1 indicator variables:
xij =
(
1 if unit i is in condition j
0 otherwise
for j = 1, 2, . . . ,m 1.
– For a particular unit i, we adopt the model
Yi = 0 + 1xi1 + 2xi2 + · · ·+ m1xi,m1 + "i
8
– In this model the ’s are unknown parameters and may be interpreted in the context of the
following expectations:
E[Yi|xi1 = xi2 = · · · = xi,m1 = 0] = 0
E[Yi|xij = 1] = 0 + j
– Based on these assumptions, H0 in (1) is true if and only if 1 = 2 = · · · = m1 = 0. Thus
testing (1) is equivalent to testing
H0: 1 = 2 = · · · = m1 = 0 vs. HA: j 6= 0 for some j
– This hypothesis corresponds, as noted, to the F -test for overall significance in the model.
In regression parlance, the test statistic is defined to be the ratio of the regression mean squares (MSR)
to the mean squared error (MSE) in a standard regression-based analysis of variance (ANOVA):
t =
MSR
MSE
In our setting we can more intuitively think of the test statistic as comparing the response variability
between conditions to the response variability within conditions:
9
The null distribution for this test is F(m1,Nm)
The p-value for this test is calculated by
p-value = P (T t)
where T F(m1,Nm)
Example: Candy Crush Boosters
– Candy Crush is experimenting with three di?erent versions of in-game “boosters”: the lollipop
hammer, the jelly fish, and the color bomb.
Figure 2: Candy Crush Experiment
– Users are randomized to one of these three conditions (n1 = 121, n2 = 135, n3 = 117) and they
receive (for free) 5 boosters corresponding to their condition. Interest lies in evaluating the e?ect
of these di?erent boosters on the length of time a user plays the game.
– Let μj represent the average length of game play (in minutes) associated with booster condition
j = 1, 2, 3. While interest lies in finding the condition associated with the longest average length
of game play, here we first rule out the possibility that booster type does not influence the length
of game play (i.e., μ1 = μ2 = μ3).
– In order to do this we fit the linear regression model
Y = 0 + 1x1 + 2x2 + "
where x1 and x2 are indicator variables indicating whether a particular value of the response was
observed in the jelly fish or color bomb conditions, respectively. The lollipop hammer is therefore
the reference condition.
10
Optional Exercises:
Calculations: 2, 7
Proofs: 1, 5, 6, 9, 10, 14, 17, 18
R Analysis: 2, 5, 6, 8, 13(g), 17 (not g,h), 22(h), 23(a-f)