辅导 CS CM121/221 Final讲解留学生R程序

CS CM121/221

Final

Problem 1

Pseudoalignment

[30 points]

Consider the two following isoforms:

t1 = AGGAT A

t2 = AT GAT A

t3 = T T GAT A

In this problem, assume we are only generating data from the forward strand, so you don’t need to consider the reverse strand. Why? Because I am nice, dammit.

(a) Draw a transcript. de Bruijn graph with k = 3. You can pick colors if you’d like, otherwise, annotate the equivalence classes with labels 1, 2, or 3.

(b) Assume reads are of length 4 and a read can start at any position (given sufficient length, of course). What are the possible equivalence classes for possible reads from (a)? Further, generate a table with the equivalence class label, and the number of occurrences from possible reads of length 4.

(d) Draw the transcript. de Bruijn graph with k = 4 and generate a table like you in did (b) for this graph.

(e) Do you think using a larger k = 4 helps disambiguate reads of length 4 in this case?

Problem 2

Expectation-Maximization Algorithm

[30 points]

Let there be two isoforms, tA and tB with known sequences sampled by RNA-seq and unknown abundances θA and θB. Isoform. tA has effective length l and isoform. tB has effective length 2l.

And now, the problem:

(a) Consider three reads, (r1, r2, r3). r1 is compatible with both tA and tB. r2 is compatible only with tA and r3 is compatible with only tB. Using the standard RSEM algorithm described in class, derive one iteration of the EM algorithm updates assuming you start with abundance θA = θB = 1/2.

(b) Implement the EM algorithm with the following input specification:

1. a compatibility matrix with each row being a read, and each column being an isoform.

2. effective lengths of the isoforms

3. a starting proportion for each transcript.

4. an integer for number of iterations

Output: relative abundances (θ).

(d) Run the EM with the configuration from (a) for 100 iterations, but with effective lengths equal to each other (both are 10).

(e) Interpret the results from (c) and (d).

Problem 3

Non-linear dimension reduction

[50 points] There is some data on BruinLearn in final/tsne.tsv you need to load for this problem.

You don’t need to know the generating process for the data, but in case it is helpful, here it is. Rows 1 to 100 correspond to the first ‘cluster’ and rows 101 to 200 correspond the second ‘cluster’. The data is generated by:

xi ∼ Normal2(µk(i) , I2),

where k(i) = 1 for i = {1, 2, . . . , 100} and k(i) = 2 for i = {101, 101, . . . , 200}. µ1 = (0, 0) and µ2 = (10, 10).

You are going to implement some components of t-SNE to get some intuition about how different components of the algorithm work. Please add a screenshot of your code, or a notebook printout anytime you are asked for an implementation.

Finally, everything you need to know is in the non-linear dimension reduction slides.

(a) (Page 33) Implement the pj|i matrix. Do yourself a favor and make it a function because

you’re going to use it quite a bit. You can make your function take a single shared = σ 2 . Sanity check: each row should sum to 1. Side note: if you decide to do the extra credit (see (k)), you should allow your algorithm to utilize different otherwise you’re gonna have a bad time. For completeness, the equation,

(b) (Page 33) Implement the pij matrix. Sanity check: the entire matrix should sum to 1. Again, the equation,

where N is the number of samples (200 in this case).

(c) Using σ2 = 1, plot the entire dataset and color the points based on their probability relative to the first data point. To be rigorous: p1j is the vector of probabilities of “j picking 1 as its neighbor”. A reasonable color scale might be: wj ∝ p1j/maxk(p1k). Your plot should show a change of color away from the first data point. Do the same for σ 2 = {0.1, 10, 100}. Each plot might look something like this:

(d) (Page 35) Implement the qij matrix. Sanity check: the entire matrix should sum to 1. The equation:

(e) Using yi = xi (the original data is projected using the identity), plot the entire dataset and color the points based on their q1j probability relative to the first data point. How is it different than the p1j plot from (c) when = 1?

(f) (Page 25) Implement the KL-divergence. Note, the contribution of {ij} is zero when pij = 0. Please use log-sum-exp and logs where appropriate.

(g) Using the real data as the low-dimensional projection, compute the KL-divergence when:

i. σ2 = 0.1.

ii. σ2 = 1.

iii. σ2 = 100.

Any thoughts on what might be happening?

(h) Summarize your thoughts on how these hyper-parameters matter.

(i) Extra credit (3 points): Using σ2 = 1, can you find a projection that reduces the KL-divergence? Note, there are plenty linear or non-linear ones. The easiest might be to do might be ‘move’ one cluster. Plot the projection and report the KL-divergence.

(j) Extra credit (5 points): implement the Perplexity and recompute the KL-divergence from the previous projections you made using the you get for Perplexity {5, 25, 50, 100}. To implement the Perplexity, find the value of that approximately satisfies the equa-tion Perplexity(Pi) = 2H(Pi) where Pi is the i-th row in the pj|i matrix and H(Pi) is the Shannon entropy. Plot a histogram of your values and how did your results change?

联系我们