首页 >
> 详细

Coursework Description:

The aim of this assignment is to fit a non-linear time series model to the gene expression data set. Gene expression is one of the most important biological processes where information from a gene is used to synthesize a functional gene product, such as protein. The expression of a gene can be controlled (or regulated) by another gene or several other genes, through a gene product (protein) called transcription factor. Understanding how genes regulate each other, i.e. gene regulation, is important to investigate a complex diseases, and how cell respond to environmental stimuli.

Data:

The ‘simulated’ 5 gene expression time-series data, are given in the excel file (gene_data.csv). The first column contains the sampling time in minutes, the rest 5 columns are the time-course expression data of 5 genes{"#,"%,"&,"',"(}, respectively. All these 5 genes are subject to additive noise (assuming independent and identically distributed (“i.i.d”) Gaussian with zero-mean) with unknown variance.

Task 1: Preliminary data analysis

You should first perform an initial exploratory data analysis, by investigating:

Time series plots

Distribution for each gene

Correlation and scatter plots (between combination of two genes) to examine their

dependencies

Task 2: Dimension reduction

We would like to reduce the dimension of time (for all 5 genes) to two using PCA, you can choose to use either eigen-decomposition method or the singular value decomposition method.

Plot these 5 genes in the reduced 2-dimensional space, with different notations or colours. Task 3: Nonlinear regression – modelling gene regulation

We know one of the genes "& is regulated by the other two genes "' and "(, however, we do not know if such regulation is activation or repression, or if such a regulatory interaction is linear or nonlinear. Therefore, we will fit a generic nonlinear polynomial regression model (with 2 inputs) to the data with the following exemplar structure:

" =+ +." +."%+."&+⋯+0" +0"%+0"&+⋯+1 &,#'%'&' #(%(&(

Here +, is a bias term (denotes the basal transcription rate); {.#, .%, .&, ⋯ , 0#, 0%, 0&, ⋯ } are the parameters of the regression model to be estimated, and 1 denotes an additive, Gaussian, zero-mean noise.

The main objective of this task is to identify the (polynomial) model structure, estimate model parameters from the training data, and use the identified model to predict the response/output signal.

Then you need to identify the nonlinear regression model structure and estimate its parameters, by

• Identify the correct model structure (by using a model selection approach – e.g. subset selection, AIC/BIC, or explore all possible different model structures), so that the model provides you a good mean square error (MSE) and the model residual/error is close to Gaussian. You can either:

1.i) Split the input and output dataset into two part: one part used to train the model, the other used for testing (e.g. 80% for training, 20% for testing). Apply the forward subset selection approach to select the best model structure iteratively (select the most significant term that reduce the MSE on testing data, in each iteration, and add it to the current model).

2.ii) Or select the best model, using BIC or AIC goodness-of-fit criteria, by exploring all possible combinations (or out of the different possible model structures).

The underlying nonlinear polynomial model may contain a bias term, a linear term, and one or few (input) nonlinear terms; the nonlinear terms can have a (maximum) nonlinearity up to 4th order, the maximum model terms will be no more than 3 (including bias, linear and nonlinear

terms).

Estimate the model parameters using least squares method. This step will be embedded within the above model structure identification process (since for each candidate model structure, you will need to estimate its parameters, in order to evaluate the model’s performance against observation data).

Once the best model structure is selected and its parameters are estimated, estimate the parameter covariance matrix, plot corresponding parameter uncertainty p.d.f. in the 3D and/or contours (similar to the example given in the lecture/lab notes). Plot the pair-wise combinations of all parameters, if you have more than 2 parameters in the selected model.

Compute the model’s output/prediction (on the training data), and also compute the 95% confidence intervals and plot them (with error bars) together with the mean values of the model prediction.

Validate the model using train-test split validation approach (may use different splitting portion as the subset model selection stage), to check whether the identified model provide good prediction on the testing dataset.

Using “Approximate Bayesian Computation (ABC)” method to compute the posterior distribution of the regression model parameters (using rejection ABC and assuming a Uniform prior). Plot the marginal posterior distribution for each parameter, and the joint posterior probability distribution for all pair-wise combinations of parameters.

Marking Scheme

This coursework worth 15 credits (100%). This will be marked according to:

15% will be given for performing an initial data analysis (histogram plots, simple input-output

correlation measures, time series plots, fitting linear model ...). If you create any R code, you must

include this in the report.

10% will be given for performing dimension reduction using PCA and plotting the result.

25% will be given for writing the R code that to select the correct model structure, estimate the

model’s parameters, use these estimates to calculate new predictions.

20% will be given for estimating the parameter estimation uncertainties (covariance matrix, plot the

corresponding parameter estimates distribution) and the model’s prediction confidence intervals (on

the training input data). Again, if you create any R code, you must include this in the report.

5% will be given for performing model validation and analysing the performance of the identified

nonlinear model.

5% will be given to perform the Approximate Bayesian computation to compute the (approximated)

posterior distribution of the regression model.

10% will be given to appropriate discussion and interpretation of the results you obtained.

10% for writing the report (around 3000-4000 words) in a structured, readable form and submitting

the executable R scripts. Report should be in sections with appropriate headings, an introduction and a conclusion.

联系我们

- QQ：99515681
- 邮箱：99515681@qq.com
- 工作时间：8:00-23:00
- 微信：codinghelp2

- Cis 484作业代做、代写sql编程语言作业、代做sql课程设计作业、代写 2020-09-27
- 代写kit206课程作业、代做software留学生作业、代写c++程序语言 2020-09-27
- Comp2100作业代做、代写programming作业、C/C++编程设计 2020-09-27
- Msbd5015作业代写、Python编程语言作业调试、代写python课程 2020-09-27
- Programming作业代写、Java程序设计作业调试、代做algorit 2020-09-27
- Cisc 360作业代做、代写java程序设计作业、Python，C++语言 2020-09-27
- Cs 570留学生作业代做、Java程序语言作业调试、代写java课程设计作 2020-09-27
- 代做isys 1108作业、代做software作业、代写python，C+ 2020-09-27
- Data留学生作业代写、Java，C++程序设计作业调试、Python语言作 2020-09-26
- Csc220作业代做、Data留学生作业代做、代写java课程作业、Java 2020-09-26
- 代写csc220作业、代做java实验作业、Java程序语言作业调试、代做p 2020-09-26
- Bridging Coursework 2020-09-26
- Comp Sci 3004/7064 Operating Systems ... 2020-09-26
- Comp9311 20T2 - Assignment 2 2020-09-26
- Ipal Capstone Project 2020-09-26
- Ipal Programming In R - Week 2 Assign... 2020-09-26
- Csc 503/Seng 474 Data Mining Assignmen... 2020-09-26
- Assignment 1Cmpt 307 2020-09-26
- Csci 2300 Lecture Exercise 5 2020-09-26
- Csci 2300 Lab 4 2020-09-26