首页 > > 详细

辅导ITNPBD7调试Java、Java编程辅导

 
ITNPBD7 Spring 2020 - Resit/Deferred Assessment 
 
DNA Sequence Analysis 
Your task for this assignment is to use Hadoop and the MapReduce approach to find the average number of 
letters between pairs of DNA tags across a sample genome. You are provided with two files - one containing 
the sample genome and the second containing the set of tag pairs that you should search for. Examples of the 
some of the data held in these two files are given below: 
Sample Genome 
CAGGAAAGACAATTCCAAAATCAGTTAGAGTCCTGTTGGCGCGTGTAATACATCTCCACTTTGAAAATGAAGACAGGGGGTTACGAGTGTTATTAATGAG
TGGGAATGTAAATTAGTCCAGCCACTCTGGAGAACCGTATGGAGGTTCCTCCAAAAATTACAAATAGAACTACCATATGATCCAGCAATCCCATGCTATG
AGATTTCCCTGAGAAAGTCATATTTAAGCTGCCATTTGAAGACCAAGGAATCATGACTAGAGACAAGAAGAGAGAACATAGAGTGATTATGGAGAATCTT
AGTATCAGTCCAGTCCTCAGTGACGGGACCCTAACTGACCTGCCCTTCTTTGGCTTAGATTGCTTAAATGGTTCTGGATGTGATGATGGTGCACCTTGCC
TATATTAGAGTAGAGTCTAAAGATTAGAATGATCCACAGGTTAATATGGGCCATTATAAAGAGATTAGTGATATTAACAATNTAGTATCAACATGGAGAT
TCTATTATTTCATTGGGGTTGCAAAATTGTGATTTTCTAATCATTTCACTTTTCCTATATTTATTGCCTGGAACTTTGTAAAGAAGAAATTGATCTTATT
Sample Start/End Tag Pairs 
CAG,AGA 
CCA,TGT 
TGG,TCA 
TGG,TCC 
TGG,TCT 
CCA,TGA 
CCA,TGC 
CCA,TGG 
GTG,TGA 
GAA,CAT 
As an example of what you must do, consider the first two lines of the above data which are individually sent 
to a mapper: 
CAGGAAAGACAATTCCAAAATCAGTTAGAGTCCTGTTGGCGCGTGTAATACATCTCCACTTTGAAAATGAAGACAGGGGGTTACGAGTGTTATTAATGAG
TGGGAATGTAAATTAGTCCAGCCACTCTGGAGAACCGTATGGAGGTTCCTCCAAAAATTACAAATAGAACTACCATATGATCCAGCAATCCCATGCTATG
 
Your program should identify that the start and end tag pairs above are located at the following positions in 
the first line: 
CAG...AGA: 0..6 
CAG...AGA: 21..26 
CCA...TGT: 14..33 
CCA...TGT: 55..87 
TGG...TCC: 36..54 
TGG...TCT: 36..52 
CCA...TGA: 14..61 
CCA...TGG: 14..36 
GTG...TGA: 42..61 
GTG...TGA: 86..96 
GAA...CAT: 3..50 
 
with the number of letters between these tags (not including the tags themselves) being: 
CAG...AGA 3 2 
CCA...TGT 16 19 
TGG...TCC 15 
TGG...TCT 13 
CCA...TGA 44 
CCA...TGG 19 
GTG...TGA 16 7 
GAA...CAT 44 
 
For the second line, the tag pairs are located at: 
CAG...AGA: 18..30 
TGG...TCC: 0..16 
TGG...TCC: 27..46 
TGG...TCT: 0..25 
CCA...TGA: 17..77 
CCA...TGC: 17..93 
CCA...TGG: 17..27 
GAA...CAT: 3..73 
 
with the number of letters between tags of: 
CAG...AGA 9 
TGG...TCC 13 16 
TGG...TCT 22 
CCA...TGA 57 
CCA...TGC 73 
CCA...TGG 7 
GAA...CAT 67 
 
For the two data lines shown at the start of this example, the average gap between tags would therefore be: 
CAG...AGA 4.6666665 
TGG...TCC 14.666667 
TGG...TCT 17.5 
CCA...TGA 50.5 
CCA...TGT 22.5 
CCA...TGC 73.0 
CCA...TGG 13.0 
GTG...TGA 11.5 
GAA...CAT 55.5 
 
Your task is to write the Map/Reduce code in Java needed to process the above data in such a way that it 
produces the final output of the averages shown above but for the entire genome data rather than just the 
two sample lines shown. You will submit a written report, detailing your design and the results you found. You 
must also submit a Java file containing your code. 
 
Step 1, HDFS – 20 Marks 
Before you write any code, you will need to copy the data onto your own space in HDFS. In your report, give 
details of how HDFS stores data such as this (assume the file is much bigger than it really is for the purpose of 
your description). This section should be around half a page long, plus a diagram. Describe what HDFS is for, 
the architecture it uses, and the roles of different nodes in the cluster. Document the hdfs commands you 
used to create a directory for the data and place it there. Make sure everything you put here, including the 
diagram, is your own work. Do not copy anything from other sources. 
Step 2, Design – 20 Marks 
Now consider the Map/Reduce design you will implement. Compare and contrast producing a design with and 
without a Combiner and describe the role that the Combiner plays in improving the efficiency of your 
solution. You should also describe what keys and values the mapper will emit, the combiner will emit and 
what the final reducer will emit. You should consider how much data will be moved across the network in 
each of your two designs and how many different reducers will be used in each case. 
Step 3, Implement – 60 Marks 
Once you have completed your designs, you should implement the design that uses a Combiner and show 
how it improves the performance of the overall solution. It is advisable to use the DNASeqCount.java file 
provided on the assignment page in Canvas as a starting point. A file called TestSeqCount has been provided 
that will use the code from DNASeqCount.java and run it on the mochadoop Hadoop simulator. You are 
advised to develop your solution with this first before finally running it on Hadoop. TestSeqCount uses the 
sample data and tag pairs shown above so you can use it to check that you are getting the final answers 
shown above. 
The Hadoop run will use the full set of data and a larger set of tag pairs to produce a more detailed result so 
do not expect the two alternatives to produce the same output (although you can test your Hadoop job with 
the smaller data files if you wish). 
If you have problems remotely accessing Hadoop, you can try only running your code with Mochadoop on the 
dna-40.txt sample which contains the first 40 linest of DNA sequences and submit the results for this however 
your submission may be tested on a much larger data set on Hadoop so you should be sure that it works. 
Whether or not you use Hadoop, you should still provide the commands that would be needed to run your 
solution on the real Hadoop system. 
Submission Details 
Please write up your work in a report and submit it via Canvas, clearly noting your 7 digit student ID number 
on the front of your report but do not provide your name. Additionally, please submit your DNASeqCount.java 
file via Canvas and ensure that your code is very well commented and that you have put your 7 digit ID 
number at the top of your Java code in the commented area. Make sure your report also contains the results 
you got when you ran your code. The deadline for submission is Monday 22nd of June at 4pm. 
 
Plagiarism 
Work which is submitted for assessment must be your own work. All students should note that the University 
has a formal policy on academic misconduct which can be found here. 
Plagiarism means presenting the work of others as though it were your own. The University takes a very 
serious view of plagiarism, and the penalties can be severe (ranging from a reduced grade in the assessment, 
through a fail for the module, to expulsion from the University for more serious or repeated offences). 
Specific guidance in relation to Computing Science assignments may be found in the Computing Science 
Student Handbook. We check submissions carefully for evidence of plagiarism, and pursue those cases we 
find. 
Late submission 
If you cannot meet the assignment hand-in deadline and have good cause, please see the module coordinator 
to explain your situation and ask for an extension. Coursework will be accepted up to seven days after the 
hand-in deadline (or expiry of any agreed extension) but the mark will be lowered by three marks per day or 
part thereof. After seven days the work will be deemed a non-submission and will receive an X. 
 
 
 
联系我们 - QQ: 99515681 微信:codinghelp
程序辅导网!