辅导Assignment 2 Advanced Map-Reduce, Similar

Assignment 2: Advanced Map-Reduce, Similar
Items and Data Streams
Formative, Weight (15%), Learning objectives (1, 2, 3),
Abstraction (4), Design (4), Communication (4), Data (5), Programming (5)
Due date: 11 : 59pm, 11 April, 2022
1 Overview
This assignment must be done in groups consisting of TWO students. You
MUST use the created groups for assignment 2, accessible from the ”People”
tab on MyUni to enrol yourself to a group before submission. Submissions
made without enrolling in a group on MyUni may NOT be marked. Every
group need to make ONLY ONE submission. If you have problems/questions
regarding grouping or require assistance, please use the discussion forum, or
email the teaching assistants directly.
Exercise 1 S-curve (exercise 3.4.1 in Leskovec, Rajaraman and Ullman) (15
points)
Write a program that evaluates and plots the S-curve 1 − (1 − s
r
)
b
for the
values s ∈ [0, 1], for the following values of r and b (one separate evaluation for
each of the three):
1. r=3 and b=10.
2. r=6 and b=20.
3. r=5 and b=50.
Note for this exercise you are allowed to use either of the following programming
languages: C++, Java or Python. Note the readibility of the plots is important
here, i.e. Make sure you have approperiate axes names and range of values in
your visualisations. Please submit both your code and include the visualisations
in your report.
1
COMP SCI 3306, COMP SCI 7306 Mining Big Data Semester 1, 2022
Exercise 2 Filtering Streams (similar to Exercises of 4.3 in Leskovec, Rajaraman
and Ullman) (15 points)
Solve the following problems:
1. For the situation of our running example of Section 4.3.1 with changed
conditions (10 billion bits, 2 billion members of the set S), calculate the
false-positive rate when using three hash functions. Do the same for four
hash functions.
2. As a function of n, the number of bits and m the number of members
in the set S, what number of hash functions minimizes the false-positive
rate?
Exercise 3 Map-Reduce: Friend Recommendation System (40 points)
Write a MapReduce program in Hadoop that implements a People You Might
Know social network friendship recommendation algorithm. The key idea is
that if two people have a lot of mutual friends, then the system will recommend
them to one another to get connected. You will have to run the program in
Hadoop, using the system setup completed in Assignment 1 in order to receive
points for this exercise.
Input File: Download the input file from the assignment page, to your
Friend Recommendation System. This input file contains the adjacency list of
friends and has multiple lines in the following format:

Here, is a unique integer ID corresponding to a unique user and
is a comma separated list of unique IDs corresponding to the friends
of the user with the unique ID . Note that the friendships are mutual
(i.e., edges are undirected): if A is friend with B then B is also friend with A.
Algorithm: You will need to design an algorithm such that, for each user
U, the algorithm recommends N = 10 users who are not already friends with
U, but have the most number of mutual friends in common with U.
Expected Output: The output has to contain one line per user in the
following format:

where is a unique ID corresponding to a user and
2
COMP SCI 3306, COMP SCI 7306 Mining Big Data Semester 1, 2022
is a comma separated list of unique IDs corresponding to the algorithms recommendation
of people that might know, ordered in decreasing number
of mutual friends. Even if a user has less than 10 second-degree friends, output
all of them in decreasing order of the number of mutual friends. If there are
recommended users with the same number of mutual friends, then output those
user IDs in numerically ascending order. Also, please provide a description of
how you are going to use MapReduce jobs to solve this problem.
Note: It is possible to solve this question with a single MapReduce job. But if
your solution requires multiple map reduce jobs, then that is fine too.
Your submission is expected to include the following:
• Include all your source code, in the exact system setup that you have run
the program. This includes all the log files, input/ outputs etc.
• Write-up: describe the algorithm you have designed to solve this problem
in enough details to understand your approach. Note this is different
to writing comments about your code. It should describe the high level
idea of your algorithm. It may include diagrams, graphs etc. for ease of
understanding.
• Results: in your write-up include the recommendations for the users with
following IDs: 924, 8941, 8942, 9019, 9020, 9021, 9022, 9990, 9992, 9993.
Exercise 4 Data streams (10 points)
Follow the scenario 1 and 2 below and answer the related questions regarding
the Flajolet-Martin Algorithm. The hash functions are of the form h(x) = ax+b
mod 32 for some a and b. You should treat the result as a 5-bit binary integer
1. Scenario 1: Suppose a data stream consists of the integers 3, 1, 4, 6, 5, 9.
Determine (a) the maximum tail length for each stream element and (b)
the resulting estimate of the number of distinct elements for the hash
functions in Question 1-3 below.
– Question 1: Hash function: h(x) = (2x + 1) mod 32
– Question 2: Hash function: h(x) = (3x + 7) mod 32
– Question 3: Hash function: h(x) = 4x mod 32
2. Scenario 2: Suppose a data stream consists of the integers 4, 5, 6, 7, 10, 15.
Determine (a) the maximum tail length for each stream element and (b)
the resulting estimate of the number of distinct elements for the hash
functions in Question 4-6 below.
3
COMP SCI 3306, COMP SCI 7306 Mining Big Data Semester 1, 2022
– Question 4: Hash function: h(x) = (6x + 2) mod 32
– Question 5: Hash function: h(x) = (2x + 5) mod 32
– Question 6: Hash function: h(x) = 2x mod 32
Exercise 5 Summary of 3.6 and 3.7 (20 points)
For this exercise you have to read Section 3.6 and 3.7 in Leskovec, Rajaraman,
Ullman (third edition, 2020).
1. Summarize the content of 3.6 in your own words (approx. 600 words).
2. Summarize the content of 3.7 in your own words (approx. 600 words).
Note: it is expected that you demonstrate understanding of the above Sections.
You may do so by explaining the content in your own words (i.e. paraphrasing)
and using digrams/ figures the convey the concept.
2 General assignment submission guidelines
As stated in the beginning of the assignment, work MUST be submited using
the group’s interface on MyUni, and a single submission per group, ONLY. The
submissions will include the following, at minimum:
• PDF file of your solutions for theoretical exercises, descriptions of the
coding exercises or the results as requested per each exercise above.
• all source files, in the exact original form that is used on your system to
run the program. This includes your code, all the Hadoop log files and all
related project files.
• a README.txt file containing instructions to run the code, the two group
members’ names, student IDs, and email addresses.
• the submissions that do not follow the above guidlines may lose points
accordingly.
Please do not hesitate to reach out using the discussion forum, workshops,
or the contact details of the teaching assistants on the home page of MyUni,
should you have any questions or concerns.