School of Computer Science
Uwe Roehm
DATA3404: Data Science Platforms 1.Sem./2020
Big Data Analysis Assignment
Group Assignment (15%) 06.05.2020
Introduction
This is the practical assignment of DATA3404 in which you have to write a series of Apache Spark
programs to analyze a air traffic data set and then optimise your programs for scalability on increas-
ing data volumes. We provide you with the schema and dataset. Your task is to implement the three
given data analysis tasks, to evaluate their performance, and to decide on which optimisations are
best suited to improve the task’s performance.
You find links to online documentation, data, and hints on tools and schema needed for this
assignment in the ’Assignments’ section in Canvas.
Data Set Description and Preparation
This assignment is based on an Aviation On-time data set which includes information about airports,
airlines, aircrafts, and flights. This data set has the following structure:
airport_code
airport_name
city
state
country
Airports
tail_number
manufacturer
model
aircraft_type
year
Aircrafts
carrier_code
name
country
Airlinesflight_id
carrier_code
flight_number
flight_date
origin
destination
tail_number
scheduled_departure_time
scheduled_arrival_time
actual_departure_time
actual_arrival_time
distance
Flights
You find a set of corresponding data files (as zip archives) on our course website in Canvas in
the ”Assignment” module.
1. Download the linked air traffic data archives from the course website and unpack them.
2. Load the contained CSV files into your storage of your AWS Educate account (cf. tutorial
Week 9), typically S3 containers. Important: Only do this data load for the two smallest data
sets. We will also provide you with a larger data set for the performance evaluation. Due to
its size, this one will however only be available as shared resource later in this unit of study.
1
Question 1: Data Analysis with Apache Spark
You shall implement three different analysis tasks of the given data set using plain Apache Spark
(using the Apache Spark’s RDD API or Dataframe API, either with Java or Python):
1. Task 1: Top-3 Cessna Models
Write an Apache Spark program that determines the top-3 Cessna aircraft models with regard
to the number of flights, listed in descending order of number of flights. Output the Cessna
models in the form ”Cessna 123” as one string with only the initial ’C’ capitalised and the
model number having just its three digits. The output file should have the following tab-
delimited format, ordered by number of flights in descending order:
Cessna XYZ \t numberOfDepartingFlights
2. Task 2: Average Departure Delay
In the second task, write a Apache Spark program that determines the average, min and max
delay (in minutes) of flights by US airlines in a given year (user-specified year). Only consider
delayed flights, i.e. a flight whose actual departure time is after its scheduled departure time,
and ignore any canceled flights. The output file should have the following tab-delimited format
(ordered alphabetically by airline name):
airline_name \t num_delays \t average_delay \t min_delay \t max_delay
3. Task 3: Most Popular Aircraft Types
In the third task, you shall write an Apache Spark program that lists per airline of a given
country (user-specified) the five most-used aircraft types (manufacturer, model). List the
airlines in alphabetical order, and show the five most-used aircraft in descending order of the
number of flights as a single, comma-separated string that is enclosed in ’[’ and ’]’ (indicating
a list). Format the name of an aircraft type as follows: MANUFACTURER ’ ’ MODEL (for
example, ”Boeing 787” or ”Airbus A350”).
The output should have the following tab-delimited format (alphabetically by airline name):
airline_name \t [aircraft_type1, aircraft_type2, ... , aircraft_type5]
General Coding Requirements
1. You should solve this assignment with the Apache Spark version 2.4 as installed in AWS
EMR. You will need an AWS Educate account for this.
2. If you use any code fragments or code cliches from third-party sources (which you should
not need for these tasks...), you must reference those properly. Include a statement on which
parts of your submission are from yourself.
3. Always test your code using a small data set before running it on any larger data set.
Question 2: Performance Evaluation and Tuning
a) Conduct a performance evaluation of your implementations for each task on varying dataset
sizes. We will provide you with five different data sizes, the two largest ones to be shared
among all groups. You should execute your code on each data size and record the execution
times and the sizes of the intermediate results (communication efforts).
b) Suggest some optimisations to the the analysis task implementations such that the perfor-
mance of your task(s) improve. Show that it works.
2
Question 3: Documentation of Implementation and Tuning Decisions
Write a text document (plain text or Word document or PDF file, no more than 5 pages plus
optional Appendix) in which you document your implementation and your performance evaluation.
Your document should contain the following:
1. Job Design Documentation
In your document, describe the Apache Spark jobs you use to implement Tasks 1 to 3. For
each job, briefly describe the different transformation functions. If you use any user-defined
functions, classes or operators, please describe those too.
2. Justification of any tuning decisions or optimisations; document the changes in the exe-
cution plans and the estimated execution costs for each individual analysis tasks before and
after your optimisations using the DAG Visualizations of Apache Spark.
3. Briefly justify each tuning decision.
4. Performance Evaluation: Include a chart and a table with the average execution times of
your tasks for different data sets.
5. Include as appendix the S3 storage location of your final output files from various executions.
Milestones
Have the first task ready in the Week 11 tutorials for the tutors to review and to give feedback.
Deliverables and Submission Details
There are three deliverables: source code, a brief program design and performance documen-
tation (up to 5 pages, as of content description above), and a demo in Week 12 via Zoom.
All deliverables are due in Week 12, no later than 8 pm, Friday 22 May 2020. Late submission
penalty: -20% of the awarded marks per day late. We will make available a marking rubric in
Canvas.
Please submit the source code and a soft copy of design documentation as a zip or tar file elec-
tronically in Canvas, one per each group. Name your zip archive after your UniKey: abcd1234.zip
Demo: A few points of the marking scheme will be given to any submission which can be
demoed successfully on our own cluster.
Students must retain electronic copies of their submitted assignment files and databases, as the
unit coordinator may request to inspect these files before marking of an assignment is completed. If
these assignment files are not made available to the unit coordinator when requested, the marking
of this assignment may not proceed.
All the best!
Group member participation
This is a group assignment. The mark awarded for your assignment is conditional on you being
able to explain any of your answers to your tutor or the subject coordinator if asked.
If members of your group do not contribute sufficiently you should alert your tutor as soon as
possible. The tutor has the discretion to scale the group’s mark for each member as follows, based
on the outcome of the group’s demo in Week 12:
Level of contribution Proportion of final grade received
No participation. 0%
Passive member, but full understanding of the submitted work. 50%
Minor contributor to the group’s submission. 75%
Major contributor to the group’s submission. 100%