辅导ITNPBD7调试Java、Java编程辅导

ITNPBD7 Spring 2020 - Resit/Deferred Assessment

DNA Sequence Analysis

Your task for this assignment is to use Hadoop and the MapReduce approach to find the average number of

letters between pairs of DNA tags across a sample genome. You are provided with two files - one containing

the sample genome and the second containing the set of tag pairs that you should search for. Examples of the

some of the data held in these two files are given below:

Sample Genome

CAGGAAAGACAATTCCAAAATCAGTTAGAGTCCTGTTGGCGCGTGTAATACATCTCCACTTTGAAAATGAAGACAGGGGGTTACGAGTGTTATTAATGAG

TGGGAATGTAAATTAGTCCAGCCACTCTGGAGAACCGTATGGAGGTTCCTCCAAAAATTACAAATAGAACTACCATATGATCCAGCAATCCCATGCTATG

AGATTTCCCTGAGAAAGTCATATTTAAGCTGCCATTTGAAGACCAAGGAATCATGACTAGAGACAAGAAGAGAGAACATAGAGTGATTATGGAGAATCTT

AGTATCAGTCCAGTCCTCAGTGACGGGACCCTAACTGACCTGCCCTTCTTTGGCTTAGATTGCTTAAATGGTTCTGGATGTGATGATGGTGCACCTTGCC

TATATTAGAGTAGAGTCTAAAGATTAGAATGATCCACAGGTTAATATGGGCCATTATAAAGAGATTAGTGATATTAACAATNTAGTATCAACATGGAGAT

TCTATTATTTCATTGGGGTTGCAAAATTGTGATTTTCTAATCATTTCACTTTTCCTATATTTATTGCCTGGAACTTTGTAAAGAAGAAATTGATCTTATT

Sample Start/End Tag Pairs

CAG,AGA

CCA,TGT

TGG,TCA

TGG,TCC

TGG,TCT

CCA,TGA

CCA,TGC

CCA,TGG

GTG,TGA

GAA,CAT

As an example of what you must do, consider the first two lines of the above data which are individually sent

to a mapper:

CAGGAAAGACAATTCCAAAATCAGTTAGAGTCCTGTTGGCGCGTGTAATACATCTCCACTTTGAAAATGAAGACAGGGGGTTACGAGTGTTATTAATGAG

TGGGAATGTAAATTAGTCCAGCCACTCTGGAGAACCGTATGGAGGTTCCTCCAAAAATTACAAATAGAACTACCATATGATCCAGCAATCCCATGCTATG

Your program should identify that the start and end tag pairs above are located at the following positions in

the first line:

CAG...AGA: 0..6

CAG...AGA: 21..26

CCA...TGT: 14..33

CCA...TGT: 55..87

TGG...TCC: 36..54

TGG...TCT: 36..52

CCA...TGA: 14..61

CCA...TGG: 14..36

GTG...TGA: 42..61

GTG...TGA: 86..96

GAA...CAT: 3..50

with the number of letters between these tags (not including the tags themselves) being:

CAG...AGA 3 2

CCA...TGT 16 19

TGG...TCC 15

TGG...TCT 13

CCA...TGA 44

CCA...TGG 19

GTG...TGA 16 7

GAA...CAT 44

For the second line, the tag pairs are located at:

CAG...AGA: 18..30

TGG...TCC: 0..16

TGG...TCC: 27..46

TGG...TCT: 0..25

CCA...TGA: 17..77

CCA...TGC: 17..93

CCA...TGG: 17..27

GAA...CAT: 3..73

with the number of letters between tags of:

CAG...AGA 9

TGG...TCC 13 16

TGG...TCT 22

CCA...TGA 57

CCA...TGC 73

CCA...TGG 7

GAA...CAT 67

For the two data lines shown at the start of this example, the average gap between tags would therefore be:

CAG...AGA 4.6666665

TGG...TCC 14.666667

TGG...TCT 17.5

CCA...TGA 50.5

CCA...TGT 22.5

CCA...TGC 73.0

CCA...TGG 13.0

GTG...TGA 11.5

GAA...CAT 55.5

Your task is to write the Map/Reduce code in Java needed to process the above data in such a way that it

produces the final output of the averages shown above but for the entire genome data rather than just the

two sample lines shown. You will submit a written report, detailing your design and the results you found. You

must also submit a Java file containing your code.

Step 1, HDFS – 20 Marks

Before you write any code, you will need to copy the data onto your own space in HDFS. In your report, give

details of how HDFS stores data such as this (assume the file is much bigger than it really is for the purpose of

your description). This section should be around half a page long, plus a diagram. Describe what HDFS is for,

the architecture it uses, and the roles of different nodes in the cluster. Document the hdfs commands you

used to create a directory for the data and place it there. Make sure everything you put here, including the

diagram, is your own work. Do not copy anything from other sources.

Step 2, Design – 20 Marks

Now consider the Map/Reduce design you will implement. Compare and contrast producing a design with and

without a Combiner and describe the role that the Combiner plays in improving the efficiency of your

solution. You should also describe what keys and values the mapper will emit, the combiner will emit and

what the final reducer will emit. You should consider how much data will be moved across the network in

each of your two designs and how many different reducers will be used in each case.

Step 3, Implement – 60 Marks

Once you have completed your designs, you should implement the design that uses a Combiner and show

how it improves the performance of the overall solution. It is advisable to use the DNASeqCount.java file

provided on the assignment page in Canvas as a starting point. A file called TestSeqCount has been provided

that will use the code from DNASeqCount.java and run it on the mochadoop Hadoop simulator. You are

advised to develop your solution with this first before finally running it on Hadoop. TestSeqCount uses the

sample data and tag pairs shown above so you can use it to check that you are getting the final answers

shown above.

The Hadoop run will use the full set of data and a larger set of tag pairs to produce a more detailed result so

do not expect the two alternatives to produce the same output (although you can test your Hadoop job with

the smaller data files if you wish).

If you have problems remotely accessing Hadoop, you can try only running your code with Mochadoop on the

dna-40.txt sample which contains the first 40 linest of DNA sequences and submit the results for this however

your submission may be tested on a much larger data set on Hadoop so you should be sure that it works.

Whether or not you use Hadoop, you should still provide the commands that would be needed to run your

solution on the real Hadoop system.

Submission Details

Please write up your work in a report and submit it via Canvas, clearly noting your 7 digit student ID number

on the front of your report but do not provide your name. Additionally, please submit your DNASeqCount.java

file via Canvas and ensure that your code is very well commented and that you have put your 7 digit ID

number at the top of your Java code in the commented area. Make sure your report also contains the results

you got when you ran your code. The deadline for submission is Monday 22nd of June at 4pm.

Plagiarism

Work which is submitted for assessment must be your own work. All students should note that the University

has a formal policy on academic misconduct which can be found here.

Plagiarism means presenting the work of others as though it were your own. The University takes a very

serious view of plagiarism, and the penalties can be severe (ranging from a reduced grade in the assessment,

through a fail for the module, to expulsion from the University for more serious or repeated offences).

Specific guidance in relation to Computing Science assignments may be found in the Computing Science

Student Handbook. We check submissions carefully for evidence of plagiarism, and pursue those cases we

find.

Late submission

If you cannot meet the assignment hand-in deadline and have good cause, please see the module coordinator

to explain your situation and ask for an extension. Coursework will be accepted up to seven days after the

hand-in deadline (or expiry of any agreed extension) but the mark will be lowered by three marks per day or

part thereof. After seven days the work will be deemed a non-submission and will receive an X.

联系我们

QQ：99515681
邮箱：99515681@qq.com
工作时间：8:00-21:00
微信：codinghelp

热点文章

mgt202辅导、讲解 java/pytho... 2025-06-28
讲解 pbt205—project-based l... 2025-06-28
辅导 comp3702 artificial int... 2025-06-28
辅导 cs3214 fall 2022 projec... 2025-06-28
辅导 turnitin assignment讲解... 2025-06-28
辅导 finite element modellin... 2025-06-28
讲解 stat3600 linear statist... 2025-06-28
辅导 problem set #3讲解 matl... 2025-06-28
讲解 elen90066 embedded syst... 2025-06-28
讲解 automatic counting of d... 2025-06-28
讲解 ct60a9602 functional pr... 2025-06-28
辅导 stat3600 linear statist... 2025-06-28
辅导 csci 1110: assignment 2... 2025-06-28
辅导 geography调试r语言 2025-06-28
辅导 introduction to informa... 2025-06-28
辅导 envir 100: introduction... 2025-06-28
辅导 assessment 3 - individu... 2025-06-28
讲解 laboratory 1讲解留学生... 2025-06-28
辅导 ct60a9600 renewable ene... 2025-06-28
辅导 economics 140a homework... 2025-06-28

热点标签

msinm014/msing014/msing014b

联系我们 - QQ: 99515681 微信：codinghelp

程序辅导网！