首页 > > 详细

辅导MSBD 5002讲解留学生Python语言

Final Exam 
MSBD 5002 Data Mining, Spring 2020 
Due: 8:00am June 1st 
Exam Guidelines 
• This is an open book examination. 
• Exam Duration: 8:00pm May 29 to 8:00am Jun 1. 
• Turn In: You must submit your solutions via canvas before 8:00am Jun 1. 
In oder to avoid network congestion and submission failure, please submit 
your solutions in advance. Any late submission will lead to zero 
mark directly. 
• You must pack all solutions together into one zip file, named as itsc studentID final.zip. 
The example of directory structure is shown in the Fig 1. 
• Academic integrity: 
– Your program and report should be based on your own effort. Stu- 
dents cannot collaborate with anyone. 
– In case you seek help from any reference source, you should state it 
clearly in your report. Failure to do so is considered plagiarism which 
will lead to appropriate disciplinary actions. 
– Plagiarism will lead to zero mark directly. 
Figure 1: An example of the directory structure to be submitted. 
Datasets 
The data sets for Q1, Q2, Q5, Q6 and Q7 can be downloaed via the link or the 
canvas. The data set for Q3 can be downloaded via the link. 
Grading 
Your score will be given based on: 
1. Correctness, completeness and clarity in your report. 
2. Codes must be runnable, and code comments are necessary. Missing the 
necessary comments will be deducted a certain score. 
3. Evaluation performance (if any). 
4. You can get bonus if you propose a novel approach to any question other 
than Q7. 
Notes 
Please read the following notes carefully. 
1. Please read Exam Guidelines carefully. 
2. Please work hard and arrange your time reasonably. 
3. For programming language, in principle, python is preferred. 
4. Computation of some questions is very large, students might use cloud 
computing platform, such as azure. 
5. If your codes or answer refer to any blog, github, paper and so on, please 
provide the links in corresponding QX report.pdf. 
6. If you have any question that Google cannot answer, please send email to 
{hlicg,lwangcg,sdiaa}@connect.ust.hk. Questions from other channels will 
not be answered. 
1 Dimensionality Reduction on Data Feature (10 
Points) 
In real-world problems, some data has a high dimensional feature that contains 
noise and impedes the performance of machine learning models. In this question, 
for simplicity, we give a small toy dataset and you are required to design two 
dimensionality reduction algorithms to reduce the dimension of features. 
You are required to code for your dimensionality reduction algorithms man- 
ually, but you are allowed to use any existing package of the classification model 
or clustering model to test the effect of your algorithms. For example, you feed 
the features obtained by different dimension reduction algorithms to the same 
classification or clustering model and compare the performance. 
1.1 Data Description 
The dataset contains 3686 points where each point has 400 dimension feature. 
Also, in the dataset, X = Multi-dimensional point data, y = labels (0 or 1). 
1.2 Submission 
1. Pack all code files in folder Q1 code . 
2. You should write the pseudo code of your dimensionality reduction algo- 
rithms in the report Q1 report.pdf and describe them in detail. 
3. You should write the experiment settings in your report Q1 report.pdf. 
For example, if you split the dataset, you should write it clearly 
4. You should compare the performance of your algorithms and analyze them 
in the report Q1 report.pdf. 
5. Please state clearly how to run your program in Q1 readme.md. 
1.3 Notes 
1. If your pseudo code includes the classification model or the cluster model, 
you can use a function to denote it directly instead of writing it in detail. 
2. How many dimensions you want to reduce depends on you. 
2 Imbalanced Data Classification (15 Points) 
In this task, you are required to model classifiers for 5 imbalanced data sets. 
2.1 Data Description 
You are provided with 5 training imbalanced dataset as follows: 
1. Bi-class Datasets v train.csv and p train.csv are data sets with binary 
classes (e.g., positive, negative). 
2. Multi-class Datasets y train.csv, e train.csv and a train.csv are data 
sets with multi-classes. 
2.2 Submission 
1. Pack all code files in folder Q2 code. 
2. Predict the test data and put your prediction results in folder Q2 output. 
3. Please report the algorithm you utilized in details, the experiment setting, 
and training accuracy in your report Q2 report.pdf. 
4. Please state clearly how to run your program in Q2 readme.md. 
2.3 Notes 
1. You are allowed to use any existing package for this question. But please 
clearly state your reference link in Q2 report.pdf. 
3 Fake Video Detection (15 Points) 
Nowadays, people all enjoy sharing their lives via video platforms like TikTok 
and Kuaishou. However, due to the development of deep fake technologies, 
many online videos are actually synthesised intentionally. In this problem, you 
need to design and implement a fake video detection algorithm by yourself. 
3.1 Data description 
1. You can download the whole dataset from here. 
2. Each sub-folder denotes one category of videos. For example, for videos in 
./ApplyEyeMakeup folder, their labels are all ’ApplyEyeMakeup’. These 
videos are all real instead of synthesised. 
3.2 Submission 
1. Please write down your algorithm’s principles and details inQ3 report.pdf. 
If your code refers to blogs, GitHub codes, papers or anything else, please 
cite them clearly. 
2. Please put all your code of this question in folder Q3 code. 
3. You need put your fake video detection results in Q3 output.csv. You 
can choose arbitrary fake and real videos online to do the validation. The 
output format should be like 
The Name of Video Is This Video real? 
Diving Bird.avi False 
Obama Presenting.avi True 
4. Please state clearly how to run your program in Q3 readme.md. 
3.3 Notes 
1. You can use any algorithm that you know. You can NOT directly use com- 
plete models that others have already trained to do classification without 
any detailed process. 
2. We will test your algorithm on separated dataset and use this metric as 
the grading criteria. 
3. The term fake video means that the video is not originated from the 
nature world, instead it is generated by some algorithms from noise or 
some distributions. 
4 Stock Prediction (15 Points) 
Everybody wants to earn money from the stock market. Is it possible to predict 
stock prices in the future with data mining and machine learning technologies? 
In this problem, you need to design and implement an stock price prediction 
algorithm to predict the closing prices of AAPL (Apple company), PG (PG) 
and GM (General Motors) on June 2, 2020. 
4.1 Data Description 
1. You are free to collect arbitrary data and information from the internet 
like stock prices, twitters, newspapers etc. 
2. Your data collecting method is also not restricted. 
4.2 Submission 
1. Please write down your algorithm’s principles and details inQ4 report.pdf. 
If your code refers to blogs, GitHub codes, papers or anything else, please 
cite them clearly. 
2. Please put all your codes of this question in Q4 code folder. 
3. You need give you prediction for the closing prices of AAPL (Apple com- 
pany), PG (PG) and GM (General Motors) on June 2 inQ4 output.txt. 
The .txt file should only contains one float number representing the stock 
price like 318.25. 
4. Please state clearly how to run your program in Q4 readme.md. 
4.3 Notes 
1. You can use any algorithm that you know. You can NOT directly use 
complete models that others have already trained to do sentiment analysis 
without any detailed process. 
5 Frequent Itemset Mining (15 Points) 
You are required to write programs to mine the frequent itemsets (min sup 
= 100) in the provided data. Based on the frequent itemsets you mined, please 
write program to mine the closed frequent itemsets and maximal frequent 
itemsets. 
5.1 Data Description 
The benchmark data set is supported by the IBM Almaden Quest research 
group, which contains 1,000 items and 100,000 transactions. For simplicity, 
each number uniquely identifies an item. 
5.2 Task Description 
1. Task 1 (5 points) Mine the frequent itemsets (min sup = 100). 
2. Task 2 (5 points) Mine the closed frequent itemsets. 
3. Task 3 (5 points) Mine the maximal frequent itemsets. 
5.3 Submission 
1. Pack all code files in folder Q5 code. 
2. Following the sample submission, put your outputs in folder Q5 output. 
3. Describe the algorithm you utilized for task 1 and the running time of 
each task in Q5 report.pdf. 
4. Please state clearly how to run your program in Q5 readme.md. 
5.4 Notes 
1. You are allowed to use any existing package for this question. But please 
clearly state your reference link in Q5 report.pdf. 
6 Community Detection (10 Points) 
In many social networks, many different relations usually exist between a same 
set of users. In this task, given several datasets, you are required to detect 
the communities among 419 tweet users. The ground truth consists of five 
communities. 
6.1 Data Description 
419 users are referenced by their unique Twitter IDs in data ids.mtx. You are 
provided with three different sources of these users social networks: 
1. follow/followedby This folder records “follow” information between users. 
2. mention/mentionedby This folder records “mention” information be- 
tween users. 
3. retweet/retweetedby This folder records “retweete” information be- 
tween users. 
6.2 Task Description 
1. Task 1 (10 points) Detect 5 communities between 419 users. 
6.3 Submission 
1. Pack all code files in folder Q6 code. 
2. Following the sample submission, put your outputs in folder Q6 output. 
3. Describe in detail the algorithm you are using in Q6 report.pdf. 
4. Please state clearly how to run your program in Q6 readme.md. 
6.4 Notes 
1. You are allowed to use any existing package for this question. But please 
clearly state your reference link in Q6 report.pdf. 
2. Do NOT use any external data. 
7 Matrix Factorization in Graph Representation 
and Recommendation System (20 Points) 
7.1 Preliminary 
Matrix Factorization (MF) (e.g., Probabilistic Matrix Factorization and Non- 
Negative Matrix Factorization) techniques have become the crux of many real- 
world scenarios, including graph representation and recommendation system 
(RecSys), because they are powerful models to find the hidden properties behind 
the data. More specifically, Non-Negative Matrix Factorization (NNMF) is a 
group of models in multivariate analysis and linear algebra where a matrix 
A P R|B|ˆ|C| is decomposed into B P R|B|ˆd and C P R|C|ˆd according to: 
B,C “ arg min 
B1,C1 
››A´B1C1J››2 
, (1) 
where }¨}F denotes the Frobenius norm. 
Graph Representation Given a graph G “ pE, V q (V is the set of nodes 
and E is the set of edges), the adjacency matrix of G is represented by A P 
R|V |ˆ|V |, where xij “ 1 if there is an edge e P E between the node vi with the 
node vj , otherwise xij “ 0. Following Eq. 1, you can represent nodes V into 
V P R|V |ˆd by minimizing the loss Lg “ 
››A´VVJ››2 
. You may refer to the 
last lecture slides for more details. 
User-Movie RecSys Given a collection of users U and a collection of 
movies V , let X P R|U |ˆ|V | denotes the user history of rating items, where 
xij “ 0 if i-th user ui has not rated the j-th movie vj , otherwise xij P p0, 5s 
represents the ratings of ui to the movie vj . Following Eq. 1, you can repre- 
sent users U and movies V into U P R|U |ˆd and V P R|V |ˆd respectively, by 
minimizing the loss Lr “ 
››X´UVJ››2 
User-Movie RecSys with Movie Tags In real-world applications, it is 
important to handle the cold-start issue in RecSys since there are new movies 
coming out every day. The above user-movie RecSys cannot well handle this 
issue since no user have seen a new movie before. This is a variant of MF 
techniques to alleviate the cold-start issue by incorporating the information of 
movie tags. Given a collection of movie tags T and a collection of movies V , let 
Y P R|T |ˆ|V | denote movie tag information, where yij “ 1 if movie vj is tagged 
with ti, otherwise xij “ 0. Therefore, you can represent users U , movies V , and 
tags T into U P R|U |ˆd, V P R|V |ˆd, and T P R|T |ˆd respectively, by minimizing 
the loss Lt “ 
››X´UVJ››2 
` ››Y ´TVJ››2 
as: 
U,V,T “ arg min 
U1,V1,T1 
´››X´U1V1J››2 
` ››Y ´T1V1J››2 
¯ 
7.2 Task Description 
You are required to code three different matrix factorization techiniques as in- 
troduced above. 
1. Task 1 (5 points) Given the graph data graph.npy, please represent nodes 
V by minimizing Lg. 
2. Task 2 (5 points) Given the rating data rating.csv, please represent users 
U and movies V by minimizing Lr. 
3. Task 3 (10 points) Given the rating data rating.csv and the movie data 
movie.csv, please represent users U , movies V , and tags T by minimizing 
Lt. 
7.3 Submission 
1. Pack all code files in folder Q7 code. 
2. Please report the experiment settings (e.g., learning rate), your final losses 
(Lg, Lr, and Lt), and the corrsponding learning curves (X-axis: iteration 
and Y-axis: loss) for three tasks in Q7 report.pdf. 
3. Please state clearly how to run your program in Q7 readme.md. 
7.4 Notes 
1. You are allowed to use any existing package for this question. But please 
clearly state your reference link in Q7 report.pdf. 
2. Hints: Learning in Eq. 1 is usually more stable if U or V is updated in 
each subproblem based on gradient descent by reducing L slightly, such 
as: 
• while not converged: 
– Minimizing L over U by gradient descent with V fixed. 
– Minimizing L over V by gradient descent with U fixed. 
 
联系我们
  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-21:00
  • 微信:codinghelp
热点标签

联系我们 - QQ: 99515681 微信:codinghelp
程序辅导网!