讲解 Biological Information讲解 Statistics统计

1. Construct the phylogenetic tree

There are 16 possible changes from one nucleotide to the other. The changes can be represented as 4X4 transition matrix.

Please calculate the log likelihood for the following tree topologies. Then decide the topology for the phylogenetic tree by maximum likelihood.

Suppose the sequences are as follows

X0: AGA

X1:AGC

X2:ACT

X3:ACG

2. Please download the gene expression table from iSpace. Refers to lab 2 and lab 3,

1). Apply cell type cluster for the data. (Please store the cluster label for each cell);

2). For the two clusters that have most cells (summary the numbers as table), detect the differential expressed genes across the two cluster. Remember to conduct multiple test adjustment. (Please store pvalue, and fold change for each gene; Alpha=0.05, fold change >2 as threshold)

3). For the detected differential expressed genes, apply the gene set enrichment to see what is the most relative biological function.

(Please indicate method you used in each step and store the result as “.csv” file)

3. See following description about detecting differential expressed gene by linear regression model.

Given the normalized genes expression value as follow:

There are two groups (T and C). We need to determine whether the expression differences between different conditions for a given gene are greater than expected by chance. Here we use simple linear regression model to fit the difference between different conditions. Calculate the b0 and b1 for each gene.

Suppose the distribution of b1 for all genes are as follows,

Which genes are differential expressed genes, if we take 10% level of significance (alpha=0.1) ?

4. Network reconstruction. Given gene expression table as follow,

1). Please calculate the mutual information between each gene pair (see mutinformation() in R package “infotheo").

2). Apply CLR (Context Likelihood of Relatedness) for each pair.

3). Plot the inferred network with CLR=0.1 as threshold.

5. (Open question.) Below are COSMIC cancer mutation signatures detected based on NMF (https://cancer.sanger.ac.uk/signatures/signatures_v2/). Consider X=W*H, W is feature matrix, H is coefficient matrix. Then after apply NMF to mutation profile, decomposed to W (mutation signature) and H (the contribution of different signatures in each cancer types).

Example. Signature 1.

For each patient, we can do DNA sequencing, mutation calling, and get the occurrence number of each mutation types. (For the mutation types, please see the x axis in the figure of signature 1.) In other words, for each patient, we will have a vector (1*96) to capture the mutation state.