Asad I. Khan 2018
Monash University
Bigdata Assignment-Programing
Marks: This asignment is worth 140 marks and it forms 40% of al marks for the unit
Type: Individual submision
Due Date: Week 10: Mon 01-Oct-2018, 2pm
Submision: The asignment must be uploaded to the unit’s portal on Moodle as a single ZIP
or GZIP file (using any other format wil lead to your asesment being delayed);
including a completed assignment submision cover sheet.
Latenes: A late submision penalty of 5% marks deduction per day wil aply including
the wekends and public holidays.
Extensions: If due to circumstances beyond your control, you are unable to
complete the assignment by the due date, you should submit the
incomplete asignment, presented in a profesional manner, and
complete a request for special consideration.
Authorship: This asignment is an individual asignment and the final submision must be
identifiably your own work. You may be required to atend an interview with the
marker to confirm that it is your own work (this submision wil be submited to
Turnitin for screning by the marker).
Asad I. Khan 2018
Specifications
1. Spark-Scala Programming Fundamentals [30 marks]
Provide spark-shel executable coding for the following tasks in a file named q1.scala (plain
text). The program outputs must show clearly in spark-shel (failure to do so may lead to loss
of marks). Your file must be appropriately commented to ensure that all significant
programing steps have been clearly explained.
a. Create a Spark data frame. from a CSV file which has the headers in the first row
(create a small CSV file or use ~/ /Documents/Datasets/simple.csv in the bigdata
virtual machine) and verify. [4+1 = 5 marks]
b. Print the data frame’s schema. [1 marks]
c. Convert the data frame. to a RD and display its contents. [1+1 =2 marks]
d. Create a RDD by reading from a text file (create a text file or use
$SPARK_HOME/README.md in the bigdata vm). [2 marks]
e. Calculate the total length in characters, including white spaces, for all the lines in
the $SPARK_HOME/README.md file. [5 marks]
f. Count and display al the words as (String, Int) pairs, which occur in
$SPARK_HOME/README.md file of the bigdata vm. [5 marks]
g. Write a program which does word count of the $SPARK_HOME/README.md file
using Spark. Explain the reduction operation. [2+3 = 5 marks]
h. Factorial is an integer number calculated as the product of itself with all number
below it e.g. Factorial of 3 or 3! = 3x2x1 = 6. Factorial of 0 is always 1. Using these
rules write a compact program, which computes the factorials of an integer array
X(1,2,3,4,5) and then sums these up into a single value. [5 marks]
Q1 Rubrics
Evaluation Criteria Marks %age
Correctnes of the coded solutions 60%
Commenting 30%
Coding style. and sophistication (e.g. use of fewer
lines to expres the program logic)
10%
Total 100%
2. Data Exploration with Spark [20 + 10 = 30 marks]
Provide spark-shel executable code for the folowing task in a file named q2-a.scala (plain
Asad I. Khan 2018
text) and a PDF (or Word) file, q2-b.pdf, explaining the program design approach. The
program outputs must show clearly in spark-shel (failure to do so may lead to los of
marks). Your code file must be apropriately commented to ensure that al significant
programing steps have been clearly labeled.
a. Using a parquet-formatted dataset on flight data, flight_208_pq.parquet/,
available in bigvm’s ~/Documents/Datasets/flight_2008_pq.parquet
(and also provided as flight_208_pq.parquet.zip in Modle), calculate and display
the maximum flight departure delays (DepDelay) for up to 20 flights. Re-arrange and
display the delays in descending order (listing the flight with the highest delays at
the top).
b. Provide a writen explanation, of no more than 500 words, of your Spark-Scala code
for calculating these delays. The explanation should include your choice of Spark
APIs and their brief explanation. Include a flowchart and explanatory figure/s where
applicable.
Q2 Rubrics
2-a Evaluation Criteria Marks
Correctnes of the code 12
Commenting 4
Coding style. and sophistication (e.g. use of fewer
lines to expres the coding logic)
4
2-b Evaluation Criteria
Write-up: clarity of explanation, wel formated,
within word limit, referencing (if aplicable)
6
Flow chart 2
Explanatory figure/s 2
Total 30
3. Provide short answers (in one to two paragraphs) to the following questions [20+40 = 60
marks]
Provide a PDF (or Word) file named q3.pdf for this question.
Programing Related:
a. Compare and contrast an Apache Spark data set with a data frame. (10 marks)
b. Compare and contrast reservoir sampling with bloom filter. (10 marks)
Framework Related:
c. Discus the main diferences betwen Apache HBase with Apache Spark. (10 marks)
d. List the main benefits of integrating Apache Spark with Hadop HDFS. (5 marks)
Asad I. Khan 2018
e. Explain how Hadop implements computational paralelism in terms of the paralel dwarf/s
it employs and Flyn’s taxonomy (5 + 5 = 10 marks).
f. Outline the main design features of the RDD abstraction for in-memory cluster computing?
(15 marks)
Q3 Rubrics
Evaluation Criteria Marks as %age
Write-up: clarity of explanation, wel formated,
referencing (if aplicable)
80%
Table/s and Explanatory figure/s 20%
Total 100%
Lab Test (A-Prog Demos): Wek11-12 [20 marks]
Your lab tutor wil randomly select a smal set of programing tasks and/or theory questions, in-
total one to thre*, from this asignment. These items must be re-worked and demonstrated
during the lab. Time allocation wil be upto 25* minutes for completion of these tasks and 5
minutes/student to demonstrate these tasks.
This test is, Closed Book i.e. reference to your assignment submission, unit contents, personal
notes/files, and web/Internet resources is not allowed with the exception of web resources
[1] and [3] listed at the end, under resources, for this asignment.
Lab Test Submision Instructions:
1. Write your (i) ful name, (ii) student id, (ii) lab day+time, and (iv) include the questions
with your answers.
2. Provide plain text with .scala extension for codes and PDF/Word file for theory.
Upload your outputs as a single ZIP/GZIP file to ‘Modle Lab Test’ submision at the
end of the lab test. Any isues with Modle upload, you must email** the zip archive
to your tutor before the end of your lab sesion. NO LATE SUBMISION ACEPTED.
Lab Test Rubrics
Evaluation Criteria
All tasks demonstrated corectly 15
Tutor’s questions answered satisfactorily 5
Total 20
*The number and duration of tasks may vary; these are indicative values only.
** You may use sftp or scp to transfer files betwen the virtual machine and your lab PC or personal
notebook computer.
Interviews
During Wek 1 and 12 each student in the lab wil be interviewed as part of the in-lab test.
Asad I. Khan 2018
Interview duration: 5 minutes (max)
Interview procedure: Each student wil be asked up to thre questions to test their general
understanding of practical terms in bigdata e.g. What is HBase? It is a non-relational database, which can
be distributed over a cluster for scalability.
Pas criterion: At least one question must be answered corectly to pas this hurdle.
Final In-Lab Test marks (out of 20) wil be alowed upon pasing the interview.
Resources
[1] Spark 2.2.1 Scala API Docs
http:/spark.apache.org/docs/2.2.1/api/scala/index.html#org.apache.spark.package
[2] Spark SQL, Dataframes and Datasets http:/spark.apache.org/docs/2.2.1/sql-programing-
guide.html
[3] Scala Docs http:/scala-docs-sphinx.readthedocs.io/en/latest/style-guide/scaladoc.html