EE6435 programming homework 5

EE6435 programming homework 5 (clustering)

In this homework, you will apply bottom-up hierarchical clustering algorithm to cluster SARSCoV-2 genomes.

The similarity between all pairs of sequences is provided to you as a matrix

(SCOV2_96_matrix.txt). The first line is the name of these sequences. Then all the other lines

contain the pairwise similarity following the order of the first line. See a toy example below:

s1 s2 s3 s4 s5

1.0 0.9 0.98 0.89 0.78

0.9 1.0 0.89 0.87 0.65

…

In this example, there are 5 sequences form s1 to s5. The second line is the similarity between

s1 to all five sequences: s1 vs s1, s1 vs s2, s1 vs s3, s1 vs s4, and s1 vs s5. Similarly, the third line

contains pairwise similarities between s2 and all others.

It is hard for you to directly apply k-means because you need to design your own method about

generating the centroid sequence. Usually, the centroid sequence should be the consensus

sequence, which will take you extra programming to get. So, we will apply bottom-up

hierarchical clustering. The clustering algorithm will stop when you have just one cluster. Then,

using this tree, output 2 clusters, 3, 4, and 5 clusters.

Specific requirements about the input and output.

1. The data can be found on Canvas

2. Use python only. Your program should be named as

ID>.py. It should take one parameter, which is the full path +

name of the input sequence file. For example,

/documents/hw5/SCOV2_96_matrix.txt

/Documents/SCOV2_96_matrix.txt

Etc.

3. Comment the start and end of the clustering implementation in

your code. Don’t call any existing APIs.

4. For each k=2 to 5, plot the heatmap similar to the one on page

80. Order the sequences based on their clusters and visualize

their similarities.

5. Use “average similarity” between clusters.

6. Submit your code and a pdf format report. The report should

contain the following parts:

a. instructions to run your program (similar to readme)

b. A description about how you generate different number of

clusters from the final tree

c. For each k=2 to 5, show the sequence IDs inside each

cluster. For example, when k=2, show the contents of the two

clusters. When k=3, show the contents of the three clusters.

d. The figures for item 4 for k=2 to 5.

7. We will compare the similarity of your codes. Copying others’

codes/reports will lead to 0 for this homework or F for this

course. To protect yourself and your friends, keep the codes to

yourself only.

Don’t hardcode the input file’s path because it will make our testing

very difficult. -10 for hardcoding the input file.

联系我们

QQ：99515681
邮箱：99515681@qq.com
工作时间：8:00-21:00
微信：codinghelp

热点文章

mgt202辅导、讲解 java/pytho... 2025-06-28
讲解 pbt205—project-based l... 2025-06-28
辅导 comp3702 artificial int... 2025-06-28
辅导 cs3214 fall 2022 projec... 2025-06-28
辅导 turnitin assignment讲解... 2025-06-28
辅导 finite element modellin... 2025-06-28
讲解 stat3600 linear statist... 2025-06-28
辅导 problem set #3讲解 matl... 2025-06-28
讲解 elen90066 embedded syst... 2025-06-28
讲解 automatic counting of d... 2025-06-28
讲解 ct60a9602 functional pr... 2025-06-28
辅导 stat3600 linear statist... 2025-06-28
辅导 csci 1110: assignment 2... 2025-06-28
辅导 geography调试r语言 2025-06-28
辅导 introduction to informa... 2025-06-28
辅导 envir 100: introduction... 2025-06-28
辅导 assessment 3 - individu... 2025-06-28
讲解 laboratory 1讲解留学生... 2025-06-28
辅导 ct60a9600 renewable ene... 2025-06-28
辅导 economics 140a homework... 2025-06-28

热点标签

msinm014/msing014/msing014b

联系我们 - QQ: 99515681 微信：codinghelp

程序辅导网！