辅导Stat 27850、讲解Java，c/c++程序语言、辅导Python设计讲解Database|辅导R语言编程

Stat 27850/30850: Problem set 1
1. Consider a multiple testing scenario where we are testing the global null. Suppose that you have n p-values
P1, . . . , Pn. A proportion π of these p-values are true signals, distributed as
P1, . . . , Pπn ∼ Unif[0, τ ].
The rest of them are nulls,
Pπ·n+1, . . . , Pn ∼ Unif[0, 1].
And all the p-values are independent. For simplicity assume τ ≥ α/n.
(a) Write an exact expression for the probability of rejecting the global null, by using the Bonferroni correction
at level α.
(b) Calculate the probability of rejecting the global null using Fisher’s test, using the normal approximation
to the χ
2 distribution (note that χ
2
m has expected value m and variance 2m, and the normal approximation
is very accurate for large m due to the central limit theorem). You should show that this probability is
(approximately) equal to Φ(Φ−1
(α) −√n · π log(τ )), where Φ is the CDF of a standard normal.
(c) Now let π =1
n1/3 and τ =12. Show that the probability of rejecting the global null tends to 1 for Fisher’s
test, but tends to α for Bonferroni’s (i.e. Bonferroni’s test is no better than random guessing. It’s trivial to
see that the probability of rejection is at least α, i.e., at least as good as random guessing, so you only need
to show an upper bound, i.e., the probability of rejection is upper bounded by a quantity that is ≈ α; this
will be easiest to prove using a union bound).
(d) Finally let π =1
n2/3 and τ =1n. Show that the probability of rejecting the global null tends to 1 for
Bonferroni’s test, but tends to α for Fisher’s (i.e. Fisher’s test is no better than random guessing).
2. Next let’s examine the difference between Bonferroni’s and Fisher’s methods numerically. We’ll choose α =
0.05, n = 230
, π = 2−i
, and τ = 2−j
, for i, j ∈ {1, . . . , 30}. We are not running simulations, just doing
calculations. For each value of i & j, based on your work in the previous problem, set
Bonferroni[i,j]
to be the exact probability that Bonferroni’s method rejects the global null (with this choice of n, π, τ ), and set
Fisher[i,j]
as the same calculation for Fisher (using the normal approximation as before). Now run this code to visualize
your results:
par(mfrow=c(1,2))
image(1:30,1:30,Bonferroni,xlab="-log_2(pi)",ylab="-log_2(tau)",
main="Bonferroni",col=gray((0:10)/10))
image(1:30,1:30,Fisher,xlab="-log_2(pi)",ylab="-log_2(tau)",
main="Fisher",col=gray((0:10)/10))
These types of figures are known as “phase transition diagrams” where we see a sharp transition from a high
chance of success to a high chance of failure. The grayscale corresponds to the chance of rejection: white
= 100% chance of rejecting the global null, and black = 0% chance. You should see that higher values of π
(i.e. lower i), and lower values of τ (i.e. higher j), improve the chances for both methods. However, the regions
of π, τ values where the methods are successful, are different for the two methods. Describe and explain what
you see.
3. Gene expression / COPD & statins. In class we saw a gene expression data set where for each patient, we
observe
• A label: whether the patient takes statins, or not
• Covariates: disease/healthy, age, sex
• Gene expression levels (log transformed) for each of n = 12381 genes (X[k,i] is the log-transformed
gene expression level for person k and gene i)
Copy and paste the following code to download and organize the data:
download.file(
"ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE71nnn/GSE71220/matrix/GSE71220_series_matrix.txt.gz",
"GSE71220_series_matrix.txt.gz")
gene_expr = t(read.table("GSE71220_series_matrix.txt.gz",skip=66,nrows=12381)[,-1])
nsample = dim(gene_expr)[1]; n = dim(gene_expr)[2]
statin = read.table("GSE71220_series_matrix.txt.gz",skip=26,nrows=1)[-1]
statin = (strsplit(toString(unlist(statin)),"_")[[1]][2+3*(0:616)] == "statin")
# which patients take statins
disease = read.table("GSE71220_series_matrix.txt.gz",skip=37,nrows=1)[-1]
disease = (unlist(strsplit(strsplit(toString(unlist(disease)),": ")[[1]],","))[2*(1:nsample)] == "COPD")
# patients with COPD disease or healthy patients
age = read.table("GSE71220_series_matrix.txt.gz",skip=38,nrows=1)[-1]
age = as.numeric(unlist(strsplit(strsplit(toString(unlist(age)),": ")[[1]],","))[2*(1:nsample)])
# age of patient
sex = read.table("GSE71220_series_matrix.txt.gz",skip=39,nrows=1)[-1]
sex = (unlist(strsplit(strsplit(toString(unlist(sex)),": ")[[1]],","))[2*(1:nsample)])
# sex of patient (M or F)
Let’s look only at patients with COPD disease, and only the first 200 genes to save computation time. We will
ignore the age & sex covariates for now. We want to see if there is a true association between statin use and gene
expression levels (overall, across the n genes—not necessarily making a claim for any specific gene). Let’s test
this with the statistic
where Xi
is the vector of gene expression levels for gene #i, and Y is the binary vector indicating if the patient
takes statins.
Next, we will try the following two variants of a permutation test, to compute a p-value based on T.
(a) For each i = 1, . . . , n — permute the vector Xi at random and computerperm.
(b) Permute the vector Y at random. Then for each i = 1, . . . , n,
For each scheme, run it with 500 permutations and compute a p-value for the observed value of T.
Which scheme is a more valid way of testing the global null hypothesis (i.e., testing if there is any association
between statin use and gene expression levels)? To explore this more, let’s run the following simulation. First,
sample the vector Y (i.e., statins or no statins) randomly. Then using this simulated Y vector, run procedure (a)
to produce a p-value via a permutation test, and do the same for procedure (b) (for each procedure, you can do
a smaller number of permutations, like 50 or 100, to save time).
Now repeat this experiment a large number of times (at least 300 to see a clear trend — be aware this may take
some time to run depending on how you implement it). Create a histogram of the p-values from scheme (a), and
another for scheme (b). If the permutation test is valid, you should see (approximately) a uniform distribution,
since we are simulating the procedure with a random vector Y .
Explain what you see, and which scheme is valid or invalid and why.