首页 > > 详细

辅导DATA3404解析Java

School of Computer Science 
Uwe Roehm 
DATA3404: Data Science Platforms 1.Sem./2020 
Big Data Analysis Assignment 
Group Assignment (15%) 06.05.2020 
Introduction 
This is the practical assignment of DATA3404 in which you have to write a series of Apache Spark 
programs to analyze a air traffic data set and then optimise your programs for scalability on increas- 
ing data volumes. We provide you with the schema and dataset. Your task is to implement the three 
given data analysis tasks, to evaluate their performance, and to decide on which optimisations are 
best suited to improve the task’s performance. 
You find links to online documentation, data, and hints on tools and schema needed for this 
assignment in the ’Assignments’ section in Canvas. 
Data Set Description and Preparation 
This assignment is based on an Aviation On-time data set which includes information about airports, 
airlines, aircrafts, and flights. This data set has the following structure: 
airport_code 
airport_name 
city 
state 
country 
Airports 
tail_number 
manufacturer 
model 
aircraft_type 
year 
Aircrafts 
carrier_code 
name 
country 
Airlinesflight_id 
carrier_code 
flight_number 
flight_date 
origin 
destination 
tail_number 
scheduled_departure_time 
scheduled_arrival_time 
actual_departure_time 
actual_arrival_time 
distance 
Flights 
You find a set of corresponding data files (as zip archives) on our course website in Canvas in 
the ”Assignment” module. 
1. Download the linked air traffic data archives from the course website and unpack them. 
2. Load the contained CSV files into your storage of your AWS Educate account (cf. tutorial 
Week 9), typically S3 containers. Important: Only do this data load for the two smallest data 
sets. We will also provide you with a larger data set for the performance evaluation. Due to 
its size, this one will however only be available as shared resource later in this unit of study. 
Question 1: Data Analysis with Apache Spark 
You shall implement three different analysis tasks of the given data set using plain Apache Spark 
(using the Apache Spark’s RDD API or Dataframe API, either with Java or Python): 
1. Task 1: Top-3 Cessna Models 
Write an Apache Spark program that determines the top-3 Cessna aircraft models with regard 
to the number of flights, listed in descending order of number of flights. Output the Cessna 
models in the form ”Cessna 123” as one string with only the initial ’C’ capitalised and the 
model number having just its three digits. The output file should have the following tab- 
delimited format, ordered by number of flights in descending order: 
Cessna XYZ \t numberOfDepartingFlights 
2. Task 2: Average Departure Delay 
In the second task, write a Apache Spark program that determines the average, min and max 
delay (in minutes) of flights by US airlines in a given year (user-specified year). Only consider 
delayed flights, i.e. a flight whose actual departure time is after its scheduled departure time, 
and ignore any canceled flights. The output file should have the following tab-delimited format 
(ordered alphabetically by airline name): 
airline_name \t num_delays \t average_delay \t min_delay \t max_delay 
3. Task 3: Most Popular Aircraft Types 
In the third task, you shall write an Apache Spark program that lists per airline of a given 
country (user-specified) the five most-used aircraft types (manufacturer, model). List the 
airlines in alphabetical order, and show the five most-used aircraft in descending order of the 
number of flights as a single, comma-separated string that is enclosed in ’[’ and ’]’ (indicating 
a list). Format the name of an aircraft type as follows: MANUFACTURER ’ ’ MODEL (for 
example, ”Boeing 787” or ”Airbus A350”). 
The output should have the following tab-delimited format (alphabetically by airline name): 
airline_name \t [aircraft_type1, aircraft_type2, ... , aircraft_type5] 
General Coding Requirements 
1. You should solve this assignment with the Apache Spark version 2.4 as installed in AWS 
EMR. You will need an AWS Educate account for this. 
2. If you use any code fragments or code cliches from third-party sources (which you should 
not need for these tasks...), you must reference those properly. Include a statement on which 
parts of your submission are from yourself. 
3. Always test your code using a small data set before running it on any larger data set. 
Question 2: Performance Evaluation and Tuning 
a) Conduct a performance evaluation of your implementations for each task on varying dataset 
sizes. We will provide you with five different data sizes, the two largest ones to be shared 
among all groups. You should execute your code on each data size and record the execution 
times and the sizes of the intermediate results (communication efforts). 
b) Suggest some optimisations to the the analysis task implementations such that the perfor- 
mance of your task(s) improve. Show that it works. 
Question 3: Documentation of Implementation and Tuning Decisions 
Write a text document (plain text or Word document or PDF file, no more than 5 pages plus 
optional Appendix) in which you document your implementation and your performance evaluation. 
Your document should contain the following: 
1. Job Design Documentation 
In your document, describe the Apache Spark jobs you use to implement Tasks 1 to 3. For 
each job, briefly describe the different transformation functions. If you use any user-defined 
functions, classes or operators, please describe those too. 
2. Justification of any tuning decisions or optimisations; document the changes in the exe- 
cution plans and the estimated execution costs for each individual analysis tasks before and 
after your optimisations using the DAG Visualizations of Apache Spark. 
3. Briefly justify each tuning decision. 
4. Performance Evaluation: Include a chart and a table with the average execution times of 
your tasks for different data sets. 
5. Include as appendix the S3 storage location of your final output files from various executions. 
Milestones 
Have the first task ready in the Week 11 tutorials for the tutors to review and to give feedback. 
Deliverables and Submission Details 
There are three deliverables: source code, a brief program design and performance documen- 
tation (up to 5 pages, as of content description above), and a demo in Week 12 via Zoom. 
All deliverables are due in Week 12, no later than 8 pm, Friday 22 May 2020. Late submission 
penalty: -20% of the awarded marks per day late. We will make available a marking rubric in 
Canvas. 
Please submit the source code and a soft copy of design documentation as a zip or tar file elec- 
tronically in Canvas, one per each group. Name your zip archive after your UniKey: abcd1234.zip 
Demo: A few points of the marking scheme will be given to any submission which can be 
demoed successfully on our own cluster. 
Students must retain electronic copies of their submitted assignment files and databases, as the 
unit coordinator may request to inspect these files before marking of an assignment is completed. If 
these assignment files are not made available to the unit coordinator when requested, the marking 
of this assignment may not proceed. 
All the best! 
Group member participation 
This is a group assignment. The mark awarded for your assignment is conditional on you being 
able to explain any of your answers to your tutor or the subject coordinator if asked. 
If members of your group do not contribute sufficiently you should alert your tutor as soon as 
possible. The tutor has the discretion to scale the group’s mark for each member as follows, based 
on the outcome of the group’s demo in Week 12: 
Level of contribution Proportion of final grade received 
No participation. 0% 
Passive member, but full understanding of the submitted work. 50% 
Minor contributor to the group’s submission. 75% 
Major contributor to the group’s submission. 100% 
 
联系我们
  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-21:00
  • 微信:codinghelp
热点标签

联系我们 - QQ: 99515681 微信:codinghelp
程序辅导网!