辅导Assignment 1: Basics and Map-Reduce

Assignment 1: Basics and Map-Reduce
Formative, Weight (15%), Learning objectives (1, 2, 3),
Abstraction (4), Design (4), Communication (4), Data (5), Programming (5)
Due date: 11 : 59pm, 28 March, 2022
1 Overview
This assignment must be done in groups consisting of TWO students. You
MUST use the created groups on the assignment page to submit your group’s
work. Submissions made outside the group’s submission page may NOT be
marked. Every group need to make ONLY ONE submission. If you have problems/questions
regarding grouping or require assistance, please use the discussion
forum, or email the teaching assistants directly. Note, in special cases, we
might have groups of a single member only (after seeking approval from the
course coordinator), but you will still need to follow the above instructions to
create a group and submit using the group interface.
2 Assignment
Exercise 1 Suspected Pairs (15 points)
Using the information from the first lecture (or Section 1.2.3 in the textbook),
what would be the number of suspected pairs if the following changes were made
to the data (Note all changes are to be applied at the same time).
• The number of days of observation was raised to 5000.
• The number of people observed was raised to 5 billion (and there were
therefore 500, 000 hotels).
• We only reported a pair as suspect if they were at the same hotel at the
same time on four different days.
1
COMP SCI 3306, COMP SCI 7306 Mining Big Data Semester 1, 2022
Exercise 2 TF-IDF (15 points)
• Q1: Explain what TF.IDF is and provide its formulation (Note that you
might see slightly different definitions in different sources; here the definition
in the textbook is acceptable).
• Q2: Suppose there is a repository of ten million documents. What (to the
nearest integer) is the IDF for a word that appears in (a) 40 documents
(b) 10,000 documents?
• Q3: Suppose there is a repository of ten million documents, and word w
appears in 320 of them. In a particular document d, the maximum number
of occurrences of a word is 15. Approximately what is the TF.IDF score
for w if that word appears (a) once (b) five times?
Exercise 3 Hadoop Basics (20 points)
For this exercise, you will need to set up and configure your system to use
Hadoop, using Virtual Machine. Follow the instructions in the attached Hadoop
document to set up the virtual machine as described in Section 1. Run the example
program of Section 2 and carry out the different steps given in that
section. Note that depending on your system, you might face some hurdles to
have Hadoop running. You are expected to attempt to resolve the issues; if unsuccessful,
you may use the discussion forum or the workshops to seek assitance
from the teaching assistants. After you have Hadoop running according to the
attached document, follow the instructions below:
• Run your job on the attached file 100-0.txt in standalone mode and
pseudo-distributed mode and record the outputs. Describe every step
you take to check the outputs in different modes.
• Describe what task the provided code is trying to achieve and how different
the two modes are (i.e. standalone and pseudo-distributed). Do you see
any difference in the outputs of these modes? Explain.
Exercise 4 Map-Reduce in Hadoop (30 points)
This exercise has 4 parts. In this exercise, you will be writing and implementing
two separate MapReduce programs, described below. For each part,
you will have to run your program in a psuedo-distributed mode and record the
output results.
Note1: both problems below are NOT case sensitive, so you should transform
the words to lowercase or uppercase first, to avoid counting duplicates.
Note2: you may use the StringTokenizer to find the correct answers.
Part 1: Write a program that processes the FirstInputFile (pg100.txt) and
the SecondInputFile (3399.txt) attached to the assignment. Your program will
2
COMP SCI 3306, COMP SCI 7306 Mining Big Data Semester 1, 2022
need to count the number of words with a specific number of letters in those files
- for example, count the number of words with 4 letters, 5 letters and so on...
If a specific word is repeated 20 times in the text, count it individually 20 times.
Part 2: Answer Questions 1-6.
• Q1: How many words are there with length 10 in FirstInputFile?
• Q2: How many words are there with length 4 in FirstInputFile?
• Q3: What is the longest length between words and what is its frequency
in FirstInputFile?
• Q4: How many words are there with length 2 in SecondInputFile?
• Q5: How many words are there with length 5 in SecondInputFile?
• Q6: What is the most frequent length and what is its frequency in SecondInputFile?
Part 3: Write a second program that again processes the FirstInputFile (pg100.txt)
and the SecondInputFile (3399.txt). However, in addition to counting the number
of words with a specific number of letters, if one word is repeated several
times, count it only once. So, your output will be the frequency of words with
the same length, but count a repeated word once only (i.e. unique words).
Note: both solutions with one or two MapReduce jobs are accepted.
Part 4: Answer Questions 7-12 below:
• Q7: How many words are there with length 10 in FirstInputFile?
• Q8: How many words are there with length 4 in FirstInputFile?
• Q9: What is the most frequent length and what is its frequency in FirstInputFile?
• Q10: How many words are there with length 5 in SecondInputFile?
• Q11: How many words are there with length 2 in SecondInputFile?
• Q12: What is the second-most frequent length and what is its frequency
in SecondInputFile?
Exercise 5 Summary of 2.4 and 2.5 (10 + 10 points)
For this exercise you will need to carefully read and understand Sections
2.3.9-2.3.11, 2.4, and 2.5 in Leskovec, Rajara- man, Ullman (third edition, 2020).
Then:
3
COMP SCI 3306, COMP SCI 7306 Mining Big Data Semester 1, 2022
• Q1: Summarize the contents of Section 2.4 in your own words (approx.
600 words).
• Q2: Summarize the contents of Section of 2.5 in your own words (approx.
600 words).
Note: it is expected that you demonstrate understanding of the above Sections.
You may do so by explaining the content in your own words (i.e. paraphrasing)
and using diagrams/ figures to help better convey the concept. The quality of
the summary is important not just the word count.
3 General assignment submission guidelines
As stated in the beginning of the assignment, works MUST be submited using
the group’s interface on MyUni, and a single submission per group, ONLY. The
submissions will include the following, at minimum:
• PDF file of your solutions for theoretical exercises, descriptions of the
coding exercises or the results as requested per each exercise above.
• all source files, in the exact original form that is used on your system to
run the program. This includes your code, all the Hadoop log files and all
related project files.
• a README.txt file containing instructions to run the code, the names of
the group members, student IDs, and email addresses.
• submissions which do not follow the above guidlines may lose points accordingly.
Please do not hesitate to reach out using the discussion forum, workshops,
or the contact details of the teaching assistants on the home page of MyUni,
should you have any questions or concerns.