讲解 CSMBD21、辅导 Java, C/C++, Python编程
            
                Department of Computing
School of Mathematical, Physical and Computational Sciences
Assessed Coursework Set Front Page
Module code: CSMBD21
Coursework Description for Big Data and Cloud Computing
Module Title: Big Data and Cloud Computing
Lecturers responsible: Prof. Atta Badii, Dr Zahra Pooranian
Type of Assignment: Coursework
Individual/group Assignment: Individual
Total Weighting of the Assignment: 50% comprising of 25% for each of Big Data and Cloud Computing
Page limit/Word count for the technical report of the results:
Approximately 3000 words max, consisting of two sections of 3 (max) pages each, to report on the
implementation of two tasks (Task A and Task B); Section A to report on the Big Data Task (Task A) and Section
B to report on the Cloud Computing (Task B) – Maximum of 6 pages excluding appendices and should follow
the School Style Guide.
Expected hours spent for this assignment: 30 hrs.
Items to be submitted: Two PDFs to be submitted via BB, each of 3 pages max, one for Section A (Task A) and
one for Section B (Task B). The PDFs are to include, on the first page, a link to the code to be made accessible to
assessors [z.pooranian@reading.ac.uk (rk929650), atta.badii@reading.ac.uk (sis04ab), r.faulkner@reading.ac.uk
(ei194011), weiwei.he@reading.ac.uk (in928478)] through GitLab or similar repository. For Section A, as the
solution is to be provided in your free Azure account, please provide a temporary username and password to
your azure account; for Section B the link can be a GitLab.
Work to be submitted on-line via Blackboard Learn by: 12:00 hrs, Wednesday 20th March 2024
Work will be marked and returned by: 15 working days after the date of submission.
NOTES
By submitting this work, you are certifying that it is all your own work and that use of material from othersources
has been properly and fully acknowledged in the text. You are also confirming that you have read and
understood the University’s Statement of Academic Misconduct, available on the University web-pages.
If your work is submitted after the deadline, 10% of the maximum possible mark will be deducted for each
working day (or part of) it is late. A mark of zero will be awarded if your work is submitted more than 5 working
days late. You are strongly recommended to hand in your work by the deadline as a late submission on one piece
of work can have impacts on other work.
If you believe that you have a valid reason for failing to meet a deadline then you should complete an
Extenuating Circumstances form and submit it to the Student Support Centre before the deadline, or as soon as
is practicable afterwards, explaining why.
2
Section A (The Big Data Task):
• Task A: Implement a solution to predicting flight delays based on historical
weather and airline data as provided in your free azure account and
explain the reason for your preferred Machine Learning (ML)model.
Section B (The Cloud Computing Task)
• Task B: Implement a MapReduce solution to determine the passenger(s)
having had the highest number of flights based on flights and
passenger data provided in the Assignment Folder of the Module on
Blackboard.
Assignment Tasks based on the explanatory notes in the Appendix to this document
If you face any difficulties, make clear, in your submission, how far you were able to proceed with the
implementation and explain the challenges you faced.
Marking Criteria for Task A:
• Total marks for this Task A will be normalised for 25% credit towards the overall coursework.
• The table below indicates the level of performance expected for each range of assessment:
Classification Range Typically, the work should meet these requirements
First Class (>= 70%) The assignment demonstrates:
• Excellent technical skills in implementing the system, possibly also suggesting
any other solution deemed viable; including reasons for the preferred solution.
• Professional technical writing skills and style.
Upper Second (60-69) The assignment demonstrates:
• Excellent technical skills in implementing the solution.
• Appropriate technical writing skills and clear presentation; including reasons for
the preferred solution.
Lower Second (50-59) The assignment demonstrates:
• Excellent technical skills in implementing the system.
• Moderate technical writing skills and clear presentation; including reasons for
the preferred solution.
Third (40-49) The assignment demonstrates:
• Satisfactory technical skills in implementing the system.
• Some technical writing ability and clear presentation; including some reasoning
for the preferred solution.
Fail (<40) The coursework fails to demonstrate technical skills to implement and technical
writing and clear presentation; inadequate or non-existent reasoning for the
preferred solution.
3
Marking Scheme and feedback template for Task A (Big Data Task)
• Total marks for this Task A will be normalised for 25% credit towards the overall coursework assessment
The key criteria for the assessment of the submitted coursework Contribution to Mark in %
Introduction
• Brief description of the background of the case study.
• Description of the tools and techniques deployed, including “Data
Factory”, “Data Bricks”, “Power BI” as used to analyse this solution
(explaining the solution architecture).
5
10
Solution Implementation
Implementation of Solution:
• Creating the Data Bricks cluster
• Load sample data
• Setup the Data Factory
• Data factory pipeline
• Operation of ML
• Summarizing data
• Visualisation of data
Evaluation:
Your personal reflections on:
• Stating reasons for your preferred solution 10
Presentation of the report:
• Structure and layout of the report
• Professional writing style
• Use of figures, tables, references, citations, and captions
Marking Criteria for Task B (Cloud Computing Task)
• Total marks for this Task A will be normalised for 25% credit towards the overall coursework.
• The table below indicates the level of performance expected for each range of assessment:
Classification Range Typically, the work should meet these requirements
First Class (>= 70%) The assignment demonstrates:
• Deep understanding of the MapReduce paradigm and excellent technical skills in
implementing the system to fulfil the objectives of the task.
• Highest quality technical reporting including solution evaluation, addressing all
aspects, completely and clearly.
Upper Second (60..69) The assignment demonstrates:
• Good understanding of the MapReduce paradigm and good implementation
consistent with the objectives of the task.
• Good quality technical reporting, inclusive, complete, and clear.
Lower Second (50..59) The assignment demonstrates:
• Sufficient understanding of the MapReduce Paradigm and satisfactory
implementation consistent with the objective of the task.
• Acceptable technical writing reporting tackling the key aspects.
Third (40..49) The assignment demonstrates:
• Basic understanding of MapReduce and basic level of implementation of the
task.
• Basic standard of technical reporting; with some notable shortcomings.
Fail (<40) The coursework fails to demonstrate sufficient understanding of the MapReduce
paradigm and fails to provide reporting even to a basic standard.
5
Marking Scheme and feedback template for Task B (Cloud Computing Task)
• Total marks for this Task B will be normalised for 25% credit towards the overall coursework assessment
MapReduce Concepts
Concept Example Max.
Map Phase Inputs and Outputs 5
Reduce Phase Inputs and Outputs 5
Segmentation of Roles Split of work 2
File Handling Use of Files and Buffers 3
Distributed parallelism Advantages, fault tolerance etc. 3
Explanation of additional process Combining/Shuffling/partitioning etc. 1
Flowchart Illustration of MapReduce problem solving 1
20
Software Prototyping
Concept Example Max
Project Structure Object-Orientation/class hierarchy 7
Code Re-usability Generics, Templating 7
Solution Elegance Design Optimality 6
20
Implementation
Aspect Max.
Task Implementation
Key/Value Selection 6
Correct Result 4
Output Format 4
Parallelisation Multi-threading 6
20
Documentation
Aspect Max.
Report Structure Abstract, Sections, Length, References, etc. 2
Section Content
Description of development 4
Evidence of use of Version Control 5
Evidence of understanding MapReduce 7
Conclusions 5
Report Quality Overall Quality of Report 5
Code Commenting Use of comments in code 12
40
6
Coursework description: Analysis of big data solution architecture
Implement and evaluate the big data solution to be provided to a customer who needs to modernise their
system.
Your task:
Implement a solution to predicting flight delays based on historical weather and airline data
In order to deal with big data, it is required to process data in a distributed manner. Azure Synapse Analytics in
Azure Machine Learning (ML) provides a platform for data pre-processing, featurization, training and
deployment. It can connect Spark pools in Azure Synapse Analytics. PySpark helps pre-processing the data in an
interactive way. This environment provides powerful Bigdata Analytics tools such as Data Factory, Data Brick
and Power BI which you will need to use in developing a solution for this coursework using the data for the case
study available on Azure Synapse Case Study as described below.
Please download the case study from the Big Data assignment guide in the assessment area:
Blackboard → Enrolments → CSMBD21-23-4MOD: Big Data and Cloud Computing (2023/24)
In the Assessment tab, select as follows:
Assessment → Big Data → Case Study
and develop your solution accordingly.
Assignment Case Study
Margie's Travel (MT) provides concierge services for business travellers. They need to modernise their system.
They want to focus on web app for their customer service agents who are providing flight booking information
to the travellers. This could, for example, include features such as a prediction of flight delay of 15-minutes or
longer, due to weather conditions.
You are expected to analyse the design of solution-I to predict the flight delay by processing the data provided.
Your solutions will need to be responsive to the customers’ needs as specified.
Your report is to describe your progress on the objectives of the task, including the aspects set out below:
1. A brief description of the background of the case study;
2. A description of the implementation of the solution supported by your free Azure account including
tools and techniques deployed such as Data Factory, Data Brick, and Power BI;
3. The reason for your preferred solution.
By clicking on the link below, you can take the first step to create your free MS Azure account and should be
able to complete allthe steps for Task A using the student’sfree $100 allowance. Please keep track of your usage
so that it will not exceed the free allowance limit as you will be liable for any excess charges.
Azure for Students – Free Account Credit | Microsoft Azure
With Microsoft Azure for Students, get a $100 credit when you create your free account. There is
no credit card needed and 12 months of free Azureservices.
Please register for this free account through https://azureforeducation.microsoft.com/devtools In
this way you will not be asked for a credit card number and risk subsequently being billed because
there is no need for you to incur charges in attempting to use the MS Azure under this scheme; so
please follow this link and register correctly so that you will not risk incurring charges
for which you will be liable.
7
Assignment Case Study Description and Data Access Details for Task B
For this coursework there are two files containing lists of data. These are located on the Blackboard system in
the Big Data and Cloud Computing assignments directory – download them from:
Blackboard → Enrolments → CSMBD21-23-4MOD: Big Data and Cloud Computing (2023/24)
In the Assessment tab, select as follows:
Assessment → Cloud Computing → Coursework Data
The coursework data folder includes the files:
AComp_Passenger_data_no_error.csv Top30_airports_LatLong.csv
The first data file contains details of passengers that have flown between airports over a certain period. The
data is in a comma delimited text file, one line per record, using the following format:
Passenger id Format: 𝑋𝑋𝑋𝑛𝑛𝑛𝑛𝑋𝑋𝑛
Flight id: Format: 𝑋𝑋𝑋𝑛𝑛𝑛𝑛𝑋
From airport IATA/FAA code Format: 𝑋𝑋𝑋
Destination airport IATA/FAA code Format: 𝑋𝑋𝑋
Departure time (GMT) Format: 𝑛 [10] (Unix ‘epoch’ time)
Total flight time (mins) Format: 𝑛 [1. .4]
Where 𝑋 is Uppercase ASCII, 𝑛 is digit 0. .9 and [𝑛. . 𝑚] is the min/max range of the number of digits/characters
in a string.
The second data file is a list of airport data comprising the name, IATA/FAA code, and location of the airport.
The data is in a comma delimited text file, one line per record using the following format:
Airport Name Format: 𝑋 [3. .20]
Airport IATA/FAA code Format: 𝑋𝑋𝑋
Latitude Format: 𝑛. 𝑛 [3. .13]
Longitude Format: 𝑛. 𝑛 [3. .13]
There are two additional data input files which can be used for analysis and validation however should not be
used for the final execution of the implemented jobs, these can be downloaded from this directory and are as
follows:
AComp_Passenger_data.csv AComp_Passenger_data_no_error_DateTime.csv
Your Task:
Determine the passenger(s) having had the highest number of flights.
For this task in the development process, develop a MapReduce-like executable prototype, (in Java, C, C++, or
Python). The objective is to develop the basic functional ‘building-blocks’ that will address the Task above, in a
way that emulates the MapReduce/Hadoop framework.
The solution may use multi-threading as required. The marking scheme reflects the appropriate use of coding
techniques, succinct code comments as required, data structures and overall program design. The code should
be subject to version control best-practices using a hosted repository under your university username.
8
Write a brief report (no more than 3 pages, excluding any appendices), describing:
• The high-level description of the development of the prototype software;
• A simple description of the version control processes undertaken;
• A detailed description of the MapReduce functionsimplemented;
• The output format of any reports that the job is to produce.