EE6435 Programming homework

EE6435 Programming homework on hidden Markov model

Python. Debug may take you time.

Full mark: 70 pts

Mark your calendar. No late work will be graded.

Implementation and application of the Viterbi algorithm

The base composition of most genome sequences is not homogeneous. In particular, GC composition can

vary regionally. In the human genome, for instance, there are "CpG islands" which show a very strong

statistical signal (even stronger at the dinucleotide composition level than the mononucleotide composition

level) and which tend to mark the 5' end of many genes. Therefore the problem of objectively segmenting

a genome sequence into regions of different compositions arises naturally. One way to segment a genome

is with HMM algorithms.

Let's assume that a genome can be modeled as a two-state HMM, with two states A and B. Assume that the

parameters of this HMM are as follows. State A has an AT-biased emission distribution {0.35, 0.15, 0.15,

0.35} for the probabilities of {A,C,G,T}. State B has a GC-biased emission distribution {0.15, 0.35, 0.35,

0.15}. State A switches to state B with probability 0.001. State B switches back to state A with probability

0.01. Assume that the initial distribution for the HMM is uniform; 0.5, 0.5 for the two states.

The format of the hidden Markov model can be found in example.hmm and example.hmm.readme.

The file example.fa contains a simulated 1 Mb genome sequence, generated by this HMM. Lower case

residues were generated by state A; upper case residues by state B. Note that *.fa means a format named

“fasta”. In fasta files, the first line is a header starting with “>”. This is the standard format for describing

sequences. The line starts with “>” contains the information of the following sequence and the Viterbi

parsing is applied to the sequence starting with the next line (not the line with >).

Viterbi parsing

Implement a Viterbi parser for the HMM above using Python, including the initialization, matrix fill, and

traceback stages. Your program should output the alignment as a series of segments using the following

format.:

1 153 state A (or state 1)

154 252 state B (or state 2)

253 1651 state A

(... etc...)

Each line is a continuous segment with the same state. The format of each line is: start-position

end-position state X

Note that the start position is 1-indexed (not 0-indexed)

Run your program on the 1 Mb simulated sequence in example.fa. How many segments of the genome

are in state B? Output this information as well. (The true state path is in the file example.positions.)

Name your program as .

Your program should run like this: your-student-ID.py

Don't hard code any input parameters because we will change them during testing. For example, my

testing file may be named as "test3.txt" or "inputdata.fa". But they always have the fasta format.

When we test your program, we will run additional input files besides the one provided on

Canvas. Below please find the detailed instructions about what to submit.

1. The source codes .

2. A readme.txt file: include the complete command line to run your program, tell us

the environment and version of Python you used.

3. In your source codes, clearly comment the part about reading files, table generation,

table initialization, table filling, and the traceback part.

More details about grading.

1. Can generate correct results on 3 input files (each 15 pts, total 45 pts). The three

input files have the same format as the example.fa. It is a fasta format file and

contain multiple lines or just two lines.

e.g.

>test1

AACCGGA

>test2

AACGG

AGCT

AAACGTA

2. The format (input, output, not hardcode any inputs) follows the instructions (15 pts,

no partial credit)

3. The output of your program on example.fa. How many segments of the genome are in

state B? Output this information as well. (10 pts)

联系我们

QQ：99515681
邮箱：99515681@qq.com
工作时间：8:00-21:00
微信：codinghelp

热点文章

辅导 comm2000 creating socia... 2026-01-08
讲解 isen1000 – introductio... 2026-01-08
讲解 cme213 radix sort讲解 c... 2026-01-08
辅导 csc370 database讲解迭代 2026-01-08
讲解 ca2401 a list of colleg... 2026-01-08
讲解 nfe2140 midi scale play... 2026-01-08
讲解 ca2401 the universal li... 2026-01-08
辅导 engg7302 advanced compu... 2026-01-08
辅导 comp331/557 – class te... 2026-01-08
讲解 soft2412 comp9412 exam辅... 2026-01-08
讲解 scenario # 1 honesty讲解... 2026-01-08
讲解 002499 accounting infor... 2026-01-08
讲解 comp9313 2021t3 project... 2026-01-08
讲解 stat1201 analysis of sc... 2026-01-08
辅导 stat5611: statistical m... 2026-01-08
辅导 mth2010-mth2015 - multi... 2026-01-08
辅导 eeet2387 switched mode ... 2026-01-08
讲解 an online payment servi... 2026-01-08
讲解 textfilter辅导 r语言 2026-01-08
讲解 rutgers ece 434 linux o... 2026-01-08

热点标签

msinm014/msing014/msing014b

联系我们 - QQ: 99515681 微信：codinghelp

程序辅导网！