Final Exam
MSBD 5002 Data Mining, Spring 2020
Due: 8:00am June 1st
Exam Guidelines
• This is an open book examination.
• Exam Duration: 8:00pm May 29 to 8:00am Jun 1.
• Turn In: You must submit your solutions via canvas before 8:00am Jun 1.
In oder to avoid network congestion and submission failure, please submit
your solutions in advance. Any late submission will lead to zero
mark directly.
• You must pack all solutions together into one zip file, named as itsc studentID final.zip.
The example of directory structure is shown in the Fig 1.
• Academic integrity:
– Your program and report should be based on your own effort. Stu-
dents cannot collaborate with anyone.
– In case you seek help from any reference source, you should state it
clearly in your report. Failure to do so is considered plagiarism which
will lead to appropriate disciplinary actions.
– Plagiarism will lead to zero mark directly.
Figure 1: An example of the directory structure to be submitted.
1
Datasets
The data sets for Q1, Q2, Q5, Q6 and Q7 can be downloaed via the link or the
canvas. The data set for Q3 can be downloaded via the link.
Grading
Your score will be given based on:
1. Correctness, completeness and clarity in your report.
2. Codes must be runnable, and code comments are necessary. Missing the
necessary comments will be deducted a certain score.
3. Evaluation performance (if any).
4. You can get bonus if you propose a novel approach to any question other
than Q7.
Notes
Please read the following notes carefully.
1. Please read Exam Guidelines carefully.
2. Please work hard and arrange your time reasonably.
3. For programming language, in principle, python is preferred.
4. Computation of some questions is very large, students might use cloud
computing platform, such as azure.
5. If your codes or answer refer to any blog, github, paper and so on, please
provide the links in corresponding QX report.pdf.
6. If you have any question that Google cannot answer, please send email to
{hlicg,lwangcg,sdiaa}@connect.ust.hk. Questions from other channels will
not be answered.
2
1 Dimensionality Reduction on Data Feature (10
Points)
In real-world problems, some data has a high dimensional feature that contains
noise and impedes the performance of machine learning models. In this question,
for simplicity, we give a small toy dataset and you are required to design two
dimensionality reduction algorithms to reduce the dimension of features.
You are required to code for your dimensionality reduction algorithms man-
ually, but you are allowed to use any existing package of the classification model
or clustering model to test the effect of your algorithms. For example, you feed
the features obtained by different dimension reduction algorithms to the same
classification or clustering model and compare the performance.
1.1 Data Description
The dataset contains 3686 points where each point has 400 dimension feature.
Also, in the dataset, X = Multi-dimensional point data, y = labels (0 or 1).
1.2 Submission
1. Pack all code files in folder Q1 code .
2. You should write the pseudo code of your dimensionality reduction algo-
rithms in the report Q1 report.pdf and describe them in detail.
3. You should write the experiment settings in your report Q1 report.pdf.
For example, if you split the dataset, you should write it clearly
4. You should compare the performance of your algorithms and analyze them
in the report Q1 report.pdf.
5. Please state clearly how to run your program in Q1 readme.md.
1.3 Notes
1. If your pseudo code includes the classification model or the cluster model,
you can use a function to denote it directly instead of writing it in detail.
2. How many dimensions you want to reduce depends on you.
3
2 Imbalanced Data Classification (15 Points)
In this task, you are required to model classifiers for 5 imbalanced data sets.
2.1 Data Description
You are provided with 5 training imbalanced dataset as follows:
1. Bi-class Datasets v train.csv and p train.csv are data sets with binary
classes (e.g., positive, negative).
2. Multi-class Datasets y train.csv, e train.csv and a train.csv are data
sets with multi-classes.
2.2 Submission
1. Pack all code files in folder Q2 code.
2. Predict the test data and put your prediction results in folder Q2 output.
3. Please report the algorithm you utilized in details, the experiment setting,
and training accuracy in your report Q2 report.pdf.
4. Please state clearly how to run your program in Q2 readme.md.
2.3 Notes
1. You are allowed to use any existing package for this question. But please
clearly state your reference link in Q2 report.pdf.
4
3 Fake Video Detection (15 Points)
Nowadays, people all enjoy sharing their lives via video platforms like TikTok
and Kuaishou. However, due to the development of deep fake technologies,
many online videos are actually synthesised intentionally. In this problem, you
need to design and implement a fake video detection algorithm by yourself.
3.1 Data description
1. You can download the whole dataset from here.
2. Each sub-folder denotes one category of videos. For example, for videos in
./ApplyEyeMakeup folder, their labels are all ’ApplyEyeMakeup’. These
videos are all real instead of synthesised.
3.2 Submission
1. Please write down your algorithm’s principles and details inQ3 report.pdf.
If your code refers to blogs, GitHub codes, papers or anything else, please
cite them clearly.
2. Please put all your code of this question in folder Q3 code.
3. You need put your fake video detection results in Q3 output.csv. You
can choose arbitrary fake and real videos online to do the validation. The
output format should be like
The Name of Video Is This Video real?
Diving Bird.avi False
Obama Presenting.avi True
4. Please state clearly how to run your program in Q3 readme.md.
3.3 Notes
1. You can use any algorithm that you know. You can NOT directly use com-
plete models that others have already trained to do classification without
any detailed process.
2. We will test your algorithm on separated dataset and use this metric as
the grading criteria.
3. The term fake video means that the video is not originated from the
nature world, instead it is generated by some algorithms from noise or
some distributions.
5
4 Stock Prediction (15 Points)
Everybody wants to earn money from the stock market. Is it possible to predict
stock prices in the future with data mining and machine learning technologies?
In this problem, you need to design and implement an stock price prediction
algorithm to predict the closing prices of AAPL (Apple company), PG (PG)
and GM (General Motors) on June 2, 2020.
4.1 Data Description
1. You are free to collect arbitrary data and information from the internet
like stock prices, twitters, newspapers etc.
2. Your data collecting method is also not restricted.
4.2 Submission
1. Please write down your algorithm’s principles and details inQ4 report.pdf.
If your code refers to blogs, GitHub codes, papers or anything else, please
cite them clearly.
2. Please put all your codes of this question in Q4 code folder.
3. You need give you prediction for the closing prices of AAPL (Apple com-
pany), PG (PG) and GM (General Motors) on June 2 inQ4 output.txt.
The .txt file should only contains one float number representing the stock
price like 318.25.
4. Please state clearly how to run your program in Q4 readme.md.
4.3 Notes
1. You can use any algorithm that you know. You can NOT directly use
complete models that others have already trained to do sentiment analysis
without any detailed process.
6
5 Frequent Itemset Mining (15 Points)
You are required to write programs to mine the frequent itemsets (min sup
= 100) in the provided data. Based on the frequent itemsets you mined, please
write program to mine the closed frequent itemsets and maximal frequent
itemsets.
5.1 Data Description
The benchmark data set is supported by the IBM Almaden Quest research
group, which contains 1,000 items and 100,000 transactions. For simplicity,
each number uniquely identifies an item.
5.2 Task Description
1. Task 1 (5 points) Mine the frequent itemsets (min sup = 100).
2. Task 2 (5 points) Mine the closed frequent itemsets.
3. Task 3 (5 points) Mine the maximal frequent itemsets.
5.3 Submission
1. Pack all code files in folder Q5 code.
2. Following the sample submission, put your outputs in folder Q5 output.
3. Describe the algorithm you utilized for task 1 and the running time of
each task in Q5 report.pdf.
4. Please state clearly how to run your program in Q5 readme.md.
5.4 Notes
1. You are allowed to use any existing package for this question. But please
clearly state your reference link in Q5 report.pdf.
7
6 Community Detection (10 Points)
In many social networks, many different relations usually exist between a same
set of users. In this task, given several datasets, you are required to detect
the communities among 419 tweet users. The ground truth consists of five
communities.
6.1 Data Description
419 users are referenced by their unique Twitter IDs in data ids.mtx. You are
provided with three different sources of these users social networks:
1. follow/followedby This folder records “follow” information between users.
2. mention/mentionedby This folder records “mention” information be-
tween users.
3. retweet/retweetedby This folder records “retweete” information be-
tween users.
6.2 Task Description
1. Task 1 (10 points) Detect 5 communities between 419 users.
6.3 Submission
1. Pack all code files in folder Q6 code.
2. Following the sample submission, put your outputs in folder Q6 output.
3. Describe in detail the algorithm you are using in Q6 report.pdf.
4. Please state clearly how to run your program in Q6 readme.md.
6.4 Notes
1. You are allowed to use any existing package for this question. But please
clearly state your reference link in Q6 report.pdf.
2. Do NOT use any external data.
8
7 Matrix Factorization in Graph Representation
and Recommendation System (20 Points)
7.1 Preliminary
Matrix Factorization (MF) (e.g., Probabilistic Matrix Factorization and Non-
Negative Matrix Factorization) techniques have become the crux of many real-
world scenarios, including graph representation and recommendation system
(RecSys), because they are powerful models to find the hidden properties behind
the data. More specifically, Non-Negative Matrix Factorization (NNMF) is a
group of models in multivariate analysis and linear algebra where a matrix
A P R|B|ˆ|C| is decomposed into B P R|B|ˆd and C P R|C|ˆd according to:
B,C “ arg min
B1,C1
››A´B1C1J››2
F
, (1)
where }¨}F denotes the Frobenius norm.
Graph Representation Given a graph G “ pE, V q (V is the set of nodes
and E is the set of edges), the adjacency matrix of G is represented by A P
R|V |ˆ|V |, where xij “ 1 if there is an edge e P E between the node vi with the
node vj , otherwise xij “ 0. Following Eq. 1, you can represent nodes V into
V P R|V |ˆd by minimizing the loss Lg “
››A´VVJ››2
F
. You may refer to the
last lecture slides for more details.
User-Movie RecSys Given a collection of users U and a collection of
movies V , let X P R|U |ˆ|V | denotes the user history of rating items, where
xij “ 0 if i-th user ui has not rated the j-th movie vj , otherwise xij P p0, 5s
represents the ratings of ui to the movie vj . Following Eq. 1, you can repre-
sent users U and movies V into U P R|U |ˆd and V P R|V |ˆd respectively, by
minimizing the loss Lr “
››X´UVJ››2
F
.
User-Movie RecSys with Movie Tags In real-world applications, it is
important to handle the cold-start issue in RecSys since there are new movies
coming out every day. The above user-movie RecSys cannot well handle this
issue since no user have seen a new movie before. This is a variant of MF
techniques to alleviate the cold-start issue by incorporating the information of
movie tags. Given a collection of movie tags T and a collection of movies V , let
Y P R|T |ˆ|V | denote movie tag information, where yij “ 1 if movie vj is tagged
with ti, otherwise xij “ 0. Therefore, you can represent users U , movies V , and
tags T into U P R|U |ˆd, V P R|V |ˆd, and T P R|T |ˆd respectively, by minimizing
the loss Lt “
››X´UVJ››2
F
` ››Y ´TVJ››2
F
as:
U,V,T “ arg min
U1,V1,T1
´››X´U1V1J››2
F
` ››Y ´T1V1J››2
F
¯
.
9
7.2 Task Description
You are required to code three different matrix factorization techiniques as in-
troduced above.
1. Task 1 (5 points) Given the graph data graph.npy, please represent nodes
V by minimizing Lg.
2. Task 2 (5 points) Given the rating data rating.csv, please represent users
U and movies V by minimizing Lr.
3. Task 3 (10 points) Given the rating data rating.csv and the movie data
movie.csv, please represent users U , movies V , and tags T by minimizing
Lt.
7.3 Submission
1. Pack all code files in folder Q7 code.
2. Please report the experiment settings (e.g., learning rate), your final losses
(Lg, Lr, and Lt), and the corrsponding learning curves (X-axis: iteration
and Y-axis: loss) for three tasks in Q7 report.pdf.
3. Please state clearly how to run your program in Q7 readme.md.
7.4 Notes
1. You are allowed to use any existing package for this question. But please
clearly state your reference link in Q7 report.pdf.
2. Hints: Learning in Eq. 1 is usually more stable if U or V is updated in
each subproblem based on gradient descent by reducing L slightly, such
as:
• while not converged:
– Minimizing L over U by gradient descent with V fixed.
– Minimizing L over V by gradient descent with U fixed.