首页 > > 详细

讲解 CS3DP Distributed and Parallel Computing讲解 Python语言

Department of Computer Science

Summative Coursework Set Front Page

Module Title:  Distributed and Parallel Computing

Module Code:  CS3DP

Type of Assignment:  Coursework

Individual / Group Assignment:  Individual

Weighting of the Assignment:  50%

Total marks: 100

Page limit/Word count:  4 pages

Expected hours spent for this assignment:  20 hours

Items to be submitted:  4-page report  (excluding title page, diagrams, graphs and references) in pdf format

Work to be submitted on-line via Blackboard Learn by:  Monday 12th  May 12 Noon

Work will be marked and returned by: Tuesday 4th June.

NOTES

By submitting this work, you are certifying that it is all your sentences, figures, tables, equations, code snippets, artworks, and illustrations in this report are original and have not been taken from any other person's work except where explicitly the works of others have  been acknowledged, quoted, and  referenced. You understand that failing to do so will be considered a case of plagiarism. Plagiarism is a form. of academic misconduct and will be penalised accordingly. The University’s Statement of Academic Misconduct is available on the University web pages.

If your work is submitted after the deadline, 10% of the maximum possible mark will be deducted for each working day (or part of) it is late. A mark of zero will be awarded if  your  work  is  submitted   more  than  5  working  days   late.  You   are  strongly recommended to hand work in by the deadline as a late submission on one piece of work can impact on other work.

If you believe that you have a valid reason for failing to meet a deadline then you should make an Exceptional Circumstances request and submit it before the deadline, or as soon as is practicable afterwards, explaining why. To make such a request log on to  RISIS  and  on  the  Actions  tab  select  Exceptional  Circumstance:  as  explained  at

https://www.reading.ac.uk/essentials/The-Important-Stuff/Rules-and- regulations/Exceptional-Circumstances

1. Assessment Classifications

First Class (>= 70%)

Outstanding/excellent work with correct results, a good presentation  of  the  theoretical  concepts  applied  to applications.  A   good  presentation  of  the   code  and results,   and   a   critical   analysis   of   the    results.   An outstanding  work  will  present  full  solutions  with  an insightful discussion.

Upper Second (60-69%)

Very good work with partial correct results: most work has  been  carried  out  correctly.  Some  tasks  have  not been  carried  out  or  are  not  completely  correct.  The presentation   is   good,    well   structured,   clear   and complete with respect to the work done.

Lower Second (50-59%)

Good work, which is missing some significant part of the assignment, and/or with partially correct results. Some tasks have not been carried out. The presentation is, in general,  accurate  and  complete,  but  it  lacks  clarity (presentation quality).

Third (40-49%)

Acceptable solutions to limited part of the assignment. Some tasks have not been carried out. Some results may not be complete or technically sound. The presentation is not accurate, complete and lacks clarity.

Pass (35-39%)

Partial solutions to limited part of the assignment. Some tasks have not been carried out. Some results may not be complete or technically sound. The presentation is not accurate, complete and lacks clarity.

Fail (0-34%)

Incomplete solutions to limited part of the assignment. Most tasks  have  not  been  carried  out  with  sufficient accuracy.  Results  may  not  be  correct  or  technically sound. The presentation is not accurate, complete and lacks clarity.

2. Assignment Description

Task 1. Understanding the CAP Theorem. [25 marks]

Choose a real-life distributed computing system, such as the distributed Domain Name Server (DNS), not one that we’ve introduced in the lectures. Analyse its distributed architecture and the rationale behind its design including aspects such as data storage, algorithms, and investigate how the CAP theorem is applied in the design and operations.

Analyse and discuss the following:

a)   The semantical properties: concurrency, availability and durability

b)   The wish-list for distributed software, consider whether and how the chosen system has the following properties: High-availability, Fault tolerance, scalability, Extensibility, Usability, debuggability and Efficiency.

c)   How is the CAP theorem applied for the chosen distributed system and why?

Please note: your discussion should incorporate specific function/service examples provided by the system to show your understanding of the CAP theorem. A general introduction to the chosen distributed computing system and the semantical properties is not sufficient.

Task 2. Understanding the Hadoop MapReduce Framework [35 marks]

You have been provided with a dataset below containing web server log files from a large e-commerce platform. Each log entry contains:

•    IP Address (e.g., 192.168.1.1)

•   Timestamp (e.g., [10/Jan/2024:10:05:23 +0000])

•    Requested URL (e.g., "GET /product/123 HTTP/1.1")

•    HTTP Status Code (e.g., 200)

•    User Agent (e.g., Mozilla/5.0)

192.168.1.1 - - [10/Jan/2024:10:05:23 +0000] "GET/home HTTP/1.1" 200 "Mozilla/5.0"

192.168.1.2 - - [10/Jan/2024:10:06:12 +0000] "GET/product/123 HTTP/1.1" 200 "Mozilla/5.0"

192.168.1.3 - - [10/Jan/2024:10:07:45 +0000] "GET/cart HTTP/1.1" 302 "Mozilla/5.0"

192.168.1.1 - - [10/Jan/2024:10:08:50 +0000] "GET/home HTTP/1.1" 200 "Mozilla/5.0"

192.168.1.4 - - [10/Jan/2024:10:10:30 +0000] "GET/checkout HTTP/1.1" 500 "Mozilla/5.0"

192.168.1.2 - - [10/Jan/2024:10:15:42 +0000] "GET/product/456 HTTP/1.1" 200 "Mozilla/5.0"

192.168.1.5 - - [10/Jan/2024:10:20:18 +0000] "GET/home HTTP/1.1" 200 "Mozilla/5.0"

192.168.1.6 - - [10/Jan/2024:10:22:30 +0000] "GET/product/789 HTTP/1.1" 404 "Mozilla/5.0"

192.168.1.7 - - [10/Jan/2024:10:25:47 +0000] "GET/cart HTTP/1.1" 200 "Mozilla/5.0"

192.168.1.3 - - [10/Jan/2024:10:30:59 +0000] "GET/home HTTP/1.1" 200 "Mozilla/5.0"

192.168.1.8 - - [10/Jan/2024:10:35:10 +0000] "GET/checkout HTTP/1.1" 500 "Mozilla/5.0"

192.168.1.2 - - [10/Jan/2024:10:40:25 +0000] "GET/product/123 HTTP/1.1" 200 "Mozilla/5.0"

192.168.1.2 - - [10/Jan/2024:10:42:47 +0000] "GET/cart HTTP/1.1" 200 "Mozilla/5.0"

192.168.1.4 - - [10/Jan/2024:10:43:45 +0000] "GET /product/357 HTTP/1.1" 200 "Mozilla/5.0"

192.168.1.9 - - [10/Jan/2024:10:45:33 +0000] "GET/product/123 HTTP/1.1" 200 "Mozilla/5.0"

192.168.1.2 - - [10/Jan/2024:10:46:30 +0000] "GET/checkout HTTP/1.1" 500 "Mozilla/5.0"

192.168.1.5 - - [10/Jan/2024:10:46:40 +0000] "GET /product/235  HTTP/1.1" 200 "Mozilla/5.0"

192.168.1.5 - - [10/Jan/2024:10:50:45 +0000] "GET /product/357 HTTP/1.1" 200 "Mozilla/5.0"

192.168.1.9 - - [10/Jan/2024:10:51:40 +0000] "GET /product/246  HTTP/1.1" 200 "Mozilla/5.0"

192.168.1.5 - - [10/Jan/2024:10:52:47 +0000] "GET/cart HTTP/1.1" 200 "Mozilla/5.0"

192.168.1.5 - - [10/Jan/2024:10:52:52 +0000] "GET /product/579 HTTP/1.1" 200 "Mozilla/5.0"

192.168.1.9 - - [10/Jan/2024:10:53:50 +0000] "GET /product/135 HTTP/1.1" 200 "Mozilla/5.0"

192.168.1.5 - - [10/Jan/2024:10:55:50 +0000] "GET/checkout HTTP/1.1" 500 "Mozilla/5.0"

192.168.1.10 - - [10/Jan/2024:11:00:12 +0000] "GET/contact HTTP/1.1" 200 "Mozilla/5.0"

192.168.1.6 - - [10/Jan/2024:11:05:29 +0000] "GET/product/456 HTTP/1.1" 404 "Mozilla/5.0"

192.168.1.2 - - [10/Jan/2024:11:15:00 +0000] "GET/product/123 HTTP/1.1" 200 "Mozilla/5.0"

192.168.1.5 - - [10/Jan/2024:11:35:10 +0000] "GET/checkout HTTP/1.1" 500 "Mozilla/5.0"

192.168.1.8 - - [10/Jan/2024:11:40:42 +0000] "GET/product/789 HTTP/1.1" 404 "Mozilla/5.0"

192.168.1.9 - - [10/Jan/2024:11:50:30 +0000] "GET/home HTTP/1.1" 200 "Mozilla/5.0"

192.168.1.10 - - [10/Jan/2024:12:00:05 +0000] "GET/checkout HTTP/1.1" 500 "Mozilla/5.0"

192.168.1.11 - - [10/Jan/2024:12:05:15 +0000] "GET/product/123 HTTP/1.1" 200 "Mozilla/5.0"

192.168.1.12 - - [10/Jan/2024:12:10:25 +0000] "GET/home HTTP/1.1" 200 "Mozilla/5.0"

192.168.1.13 - - [10/Jan/2024:12:15:40 +0000] "GET/cart HTTP/1.1" 302 "Mozilla/5.0"

192.168.1.14 - - [10/Jan/2024:12:20:00 +0000] "GET/checkout HTTP/1.1" 500 "Mozilla/5.0"

192.168.1.15 - - [10/Jan/2024:12:25:50 +0000] "GET/product/456 HTTP/1.1" 200 "Mozilla/5.0"

192.168.1.20 - - [10/Jan/2024:12:50:10 +0000] "GET/cart HTTP/1.1" 200 "Mozilla/5.0"

192.168.1.21 - - [10/Jan/2024:12:55:55 +0000] "GET/checkout HTTP/1.1" 500 "Mozilla/5.0"

Your task is to use MapReduce to analyse the logs and extract the information - the Top 10 most frequently accessed URLs within the hour peak hour (the hour with the most requests), and the number of unique visitors to each of the Top 10 URLs within the peak traffic hour. You need to break down this complicated task into multiple MapReduce jobs and justify how the jobs can lead to the desired result in the end.

In your report, you are required to:

a)   Describe the decomposition of the task into a series of MapReduce jobs and justify the  reasoning.  For each MapReduce job,  identify the input, output, Mapper function, and Reducer function.

b)   Write python programs for the map and reduce functions and run the MapReduce jobs in the Hadoop local setup. The code should be included in an appendix to the report and is excluded from the 4-page length limit.

c)   In the main report, you need to provide screenshots for the generated result of each MapReduce job. Please note that the results should be the execution results of Hadoop, not the local testing results for your Python programs.

Task 3. Understanding Parallel Programming with Spark [35 marks]

Pi is an essential number in mathematics. It is both transcendental, which means it cannot be determined as a solution to an algebraic equation (polynomial with integer coefficients) and irrational, which means it cannot be expressed as a ratio 

of integers. Determining its value is not so simple. Here is a method which uses random numbers.

Figure 1: Unit circle with radius r   =  1 and a square with side of length one.

Figure 1 is a unit circle, with radius r   =   1 and a square with unit length side.

The area of the square is As   =  r   ×  r   =   1              (1)

The area of the circle is    Ac    =  πr2    =  π                   (2)

The area of the upper quadrant contained in the black square is a quarter of the circle. Darts are thrown at random into the black square and counted to see how many fall within the quadrant.  The  number of darts inside the quadrant  is proportional to the area, and so π can be estimated by taking the ratio of darts that fall inside and outside.

Here is the function in python using random numbers for x and y coordinates to compute whether a dart landing inside or outside of a unit circle.

import random # function to compute whether a "dart" is inside the unit circle def sample(p): x, y = random.random(), random.random() return 1 if x*x + y*y < 1 else 0

Based on the key computation given above, please produce a python program to calculate Pi using pyspark. The program should be able to accept the number of samples (darts) and the number of partitions as two input arguments, and output the value of Pi given the input values.

from pyspark import SparkContext import sys #Define the sample() function here # You should start from creating a Spark Context sc=SparkContext(appName="miPi") # Take the number of samples NUM_SAMPLES=int(sys.argv[1]) # Next you should parallelise the samples in Spark. # Remember to take the number of partitions from an input argument # Your code of using spark should go here. #The number of darts inside the quadrant should be stored in an integer variable count # Calculating Pi, python3 syntax print("Pi:"+str(NUM_SAMPLES)+":"+str(4.0*count/NUM_SAMPLES))

You should also investigate how the accuracy of the calculation scales with the number of darts and the number of parallel elements.

In the report, you are required to include the following components:

a)   Screenshot of your python code (as an appendix, excluded from the length limit)

b)   In the main report, two plots respectively showing how the accuracy varies versus

a.   the number of darts at a fixed partition number;

b.   the number of partitions at a fixed number of samples.

c)   Brief discussion and justification on the observations. Explore the reasons behind unexpected observations.




联系我们
  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-21:00
  • 微信:codinghelp
热点标签

联系我们 - QQ: 99515681 微信:codinghelp
程序辅导网!