首页 >
> 详细

Stat 27850/30850: Problem set 1

1. Consider a multiple testing scenario where we are testing the global null. Suppose that you have n p-values

P1, . . . , Pn. A proportion π of these p-values are true signals, distributed as

P1, . . . , Pπn ∼ Unif[0, τ ].

The rest of them are nulls,

Pπ·n+1, . . . , Pn ∼ Unif[0, 1].

And all the p-values are independent. For simplicity assume τ ≥ α/n.

(a) Write an exact expression for the probability of rejecting the global null, by using the Bonferroni correction

at level α.

(b) Calculate the probability of rejecting the global null using Fisher’s test, using the normal approximation

to the χ

2 distribution (note that χ

2

m has expected value m and variance 2m, and the normal approximation

is very accurate for large m due to the central limit theorem). You should show that this probability is

(approximately) equal to Φ(Φ−1

(α) −√n · π log(τ )), where Φ is the CDF of a standard normal.

(c) Now let π =1

n1/3 and τ =12. Show that the probability of rejecting the global null tends to 1 for Fisher’s

test, but tends to α for Bonferroni’s (i.e. Bonferroni’s test is no better than random guessing. It’s trivial to

see that the probability of rejection is at least α, i.e., at least as good as random guessing, so you only need

to show an upper bound, i.e., the probability of rejection is upper bounded by a quantity that is ≈ α; this

will be easiest to prove using a union bound).

(d) Finally let π =1

n2/3 and τ =1n. Show that the probability of rejecting the global null tends to 1 for

Bonferroni’s test, but tends to α for Fisher’s (i.e. Fisher’s test is no better than random guessing).

2. Next let’s examine the difference between Bonferroni’s and Fisher’s methods numerically. We’ll choose α =

0.05, n = 230

, π = 2−i

, and τ = 2−j

, for i, j ∈ {1, . . . , 30}. We are not running simulations, just doing

calculations. For each value of i & j, based on your work in the previous problem, set

Bonferroni[i,j]

to be the exact probability that Bonferroni’s method rejects the global null (with this choice of n, π, τ ), and set

Fisher[i,j]

as the same calculation for Fisher (using the normal approximation as before). Now run this code to visualize

your results:

par(mfrow=c(1,2))

image(1:30,1:30,Bonferroni,xlab="-log_2(pi)",ylab="-log_2(tau)",

main="Bonferroni",col=gray((0:10)/10))

image(1:30,1:30,Fisher,xlab="-log_2(pi)",ylab="-log_2(tau)",

main="Fisher",col=gray((0:10)/10))

These types of figures are known as “phase transition diagrams” where we see a sharp transition from a high

chance of success to a high chance of failure. The grayscale corresponds to the chance of rejection: white

= 100% chance of rejecting the global null, and black = 0% chance. You should see that higher values of π

(i.e. lower i), and lower values of τ (i.e. higher j), improve the chances for both methods. However, the regions

of π, τ values where the methods are successful, are different for the two methods. Describe and explain what

you see.

3. Gene expression / COPD & statins. In class we saw a gene expression data set where for each patient, we

observe

• A label: whether the patient takes statins, or not

• Covariates: disease/healthy, age, sex

• Gene expression levels (log transformed) for each of n = 12381 genes (X[k,i] is the log-transformed

gene expression level for person k and gene i)

Copy and paste the following code to download and organize the data:

download.file(

"ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE71nnn/GSE71220/matrix/GSE71220_series_matrix.txt.gz",

"GSE71220_series_matrix.txt.gz")

gene_expr = t(read.table("GSE71220_series_matrix.txt.gz",skip=66,nrows=12381)[,-1])

nsample = dim(gene_expr)[1]; n = dim(gene_expr)[2]

statin = read.table("GSE71220_series_matrix.txt.gz",skip=26,nrows=1)[-1]

statin = (strsplit(toString(unlist(statin)),"_")[[1]][2+3*(0:616)] == "statin")

# which patients take statins

disease = read.table("GSE71220_series_matrix.txt.gz",skip=37,nrows=1)[-1]

disease = (unlist(strsplit(strsplit(toString(unlist(disease)),": ")[[1]],","))[2*(1:nsample)] == "COPD")

# patients with COPD disease or healthy patients

age = read.table("GSE71220_series_matrix.txt.gz",skip=38,nrows=1)[-1]

age = as.numeric(unlist(strsplit(strsplit(toString(unlist(age)),": ")[[1]],","))[2*(1:nsample)])

# age of patient

sex = read.table("GSE71220_series_matrix.txt.gz",skip=39,nrows=1)[-1]

sex = (unlist(strsplit(strsplit(toString(unlist(sex)),": ")[[1]],","))[2*(1:nsample)])

# sex of patient (M or F)

Let’s look only at patients with COPD disease, and only the first 200 genes to save computation time. We will

ignore the age & sex covariates for now. We want to see if there is a true association between statin use and gene

expression levels (overall, across the n genes—not necessarily making a claim for any specific gene). Let’s test

this with the statistic

where Xi

is the vector of gene expression levels for gene #i, and Y is the binary vector indicating if the patient

takes statins.

Next, we will try the following two variants of a permutation test, to compute a p-value based on T.

(a) For each i = 1, . . . , n — permute the vector Xi at random and computerperm.

(b) Permute the vector Y at random. Then for each i = 1, . . . , n,

For each scheme, run it with 500 permutations and compute a p-value for the observed value of T.

Which scheme is a more valid way of testing the global null hypothesis (i.e., testing if there is any association

between statin use and gene expression levels)? To explore this more, let’s run the following simulation. First,

sample the vector Y (i.e., statins or no statins) randomly. Then using this simulated Y vector, run procedure (a)

to produce a p-value via a permutation test, and do the same for procedure (b) (for each procedure, you can do

a smaller number of permutations, like 50 or 100, to save time).

Now repeat this experiment a large number of times (at least 300 to see a clear trend — be aware this may take

some time to run depending on how you implement it). Create a histogram of the p-values from scheme (a), and

another for scheme (b). If the permutation test is valid, you should see (approximately) a uniform distribution,

since we are simulating the procedure with a random vector Y .

Explain what you see, and which scheme is valid or invalid and why.

联系我们

- QQ：99515681
- 邮箱：99515681@qq.com
- 工作时间：8:00-23:00
- 微信：codinghelp

- 代写cs3014 Google Analytics Customer Rev 2020-01-21
- 代写cmpsc121 Structs代写留学生c/C++实验... 2020-01-21
- 代写mis6326 Data Management调试存储过程作业、数据库编 2020-01-21
- 代写msci 581作业、代做marketing Analytics作业、P 2020-01-20
- Software课程作业代做、代写java，C/C++程序设计作业、Pyth 2020-01-20
- Tcss 372作业代做、代写python，Java编程语言作业、代做c/C 2020-01-20
- Emergency Facilities作业代写、代写r编程设计作业、R课程 2020-01-18
- Cis 413/513作业代做、代写data Structures作业、Ja 2020-01-18
- 代写ia626留学生作业、Python程序设计作业调试、代做data课程作业 2020-01-18
- Mat00027i作业代写、Java程序语言作业调试、Mathematica 2020-01-17
- 代做kt Model作业、代写java，Python编程设计作业、代做c/C 2020-01-17
- Data Set课程作业代做、代写r程序语言作业、Ltcret留学生作业代做 2020-01-17
- 代写rstudio留学生作业、代做r编程设计作业、代写r课程设计作业代做数据 2020-01-17
- 代写cs2250 Delimiter Matching代做数据结... 2020-01-16
- 代写cs12b Edit Distance帮写java实验作业... 2020-01-16
- 代写mins325 Filereader And Filewriter代... 2020-01-16
- 代写cosi131 Tunnels帮写java实验作业 2020-01-16
- 代写inm312 Balancebit Software代写留学... 2020-01-16
- 代写cs61b Maze Solver代写java课程设计 2020-01-16
- Program留学生作业代做、C/C++编程语言作业代写、代做java，Py 2020-01-14