辅导COMPSCI 753编程、辅导python编程、data留学生程序讲解辅导R语言程序|讲解R语言编程

COMPSCI 753
Algorithms for Massive Data
Assignment 4 / Semester 2, 2020
Recommender Systems
Kaiqi Zhao
Submission:
Please submit a single pdf file on CANVAS by 23:59 NZST, Sunday 25 October. The answer
file must contain your student ID and name.
Penalty Dates:
The assignment will not be accepted after the last penalty date unless there are special
circumstances (e.g., sickness with certificate). Penalties will be calculated as follows as a
percentage of the mark for the assignment.
• 23:59 NZST, Sunday 25 October – No penalty
• 23:59 NZST, Monday 26 October – 25% penalty
• 23:59 NZST, Tuesday 27 October – 50% penalty
1 Assignment problem (50 pts)
Recommender systems are widely used in many industrial companies, especially locationbased
service such as Yelp 1
. Yelp provides a very big dataset for research purpose (See
the Kaggle page https://www.kaggle.com/yelp-dataset/yelp-dataset). In this assignment,
we will explore this dataset using the learned recommendation algorithms. To make
the task feasible on most of the laptops and PCs, we have extracted a managable subset
of the datasets, which contains the reviews on businesses in Las Vegas. The dataset could
be found in the assignment page. The data format for users and businesses are exactly the
same as in the description on the Kaggle page. The review file only preserves the review
id, business id, user id, stars (ratings) and date to keep the dataset small.
Preprocessing: Split the review data into training and test datasets. Use the 20% latest
reviews (with largest “date” attribute) for testing and the remaining 80% for training.
The collaborative filtering algorithm does not need to train, but the similarity computation
can only conduct using ratings in the training set.
This assignment has the following tasks:
1. Item-based collaborative filtering (25 pts): In this task, we will build an itembased
CF algorithm on Yelp.
(a) Our first step is to pre-compute the similarity between each pair of businesses
(items) on the training data, such that we don’t have to compute the similarities
on-the-fly. Here, we use cosine similarity for simplicity. Report (1) the time
for computing all pairwise similarities, and (2) report the similarities
between the following pairs of businesses. (10 pts)
Pairs Business 1 Business 2
1 rjZ0L-P1SRLJfiK5epnjYg cL2mykvopw-kodSyc-z5jA
2 6H8xfhoZ2IGa3eNiY5FqLA XZbuPXdyA0ZtTu3AzqtQhg
3 rfwJFFzW6xW2qYfJh14OTA G58YATMKnn-M-RUDWg3lxw
4 0QSnurP5Ibor2zepJmEIlw 6-lmL3sC-axuh8y1SPSiqg
(b) Try to compute the above cosine similarity using matrix multiplication. Hints:
Let R be the m × n rating matrix, where m and n are the number of users
and number of businesses, respectively. RT R gives the pairwise dot product of
business vectors. Also notice that the diagonal of RT R gives the self dot product
of each business, which is the square of the L2-norm of each business vector used
1https://www.yelp.com/
1
in the denominator of the cosine similarity! Report the time for computing
the whole pairwise similarity matrix. (5 pts)
(c) Implement the item-based collaborative filtering algorithm based on the above
similarity matrix. Report the RMSE. (5 pts)
(d) Extend the above CF method by incorporating the bias from all the transactions
(bg), bias from each user u (bu = [avg rating of user u]−bg), and the bias from each
business i (bi = [avg rating for business i] − bg), e.g., the recommendation score
of user u to business i is ˆrui = bui +
P
j∈Pu
wij (ruj−buj )
P
j∈Pu
wij
, where bui = bg +bu +bi
. We
choose the top-20 similar businesses rated by the user u as Pu in this assignment.
Report the RMSE. (5 pts)
2. Latent factor model (25 pts): Hold-out 1/8 of the latest reviews in the training
data as validation set to tune the number of latent factors. This results in a data split
of 70% training, 10% validation, and 20% testing.
(a) Implement a latent factor model with number of latent factors k. Use stochastic
gradient descent with fixed learning rate η = 0.01 and regularization hyperparameter
λ1 = λ2 = 0.3. Run SGD for 20 epochs and report RMSE for
k = {8, 16, 32, 64} on both training (i.e., the minimized loss) and validation
sets. (10 pts)
(b) Extend the latent factor model with bias. Use stochastic gradient descent with
fixed learning rate η = 0.01 and regularization hyperparameter λ1 = λ2 = λ3 =
λ4 = 0.3. Run SGD for 20 epochs and report RMSE for k = {8, 16, 32, 64}
on both training and validation sets. (10 pts)
(c) Select the best k for each of the above two versions of latent factor models,
and apply on the test dataset. Report RMSE of the two models on test
dataset. (5 pts)
Note: You could use any python package to validate your implementation, to compute
the pairwise cosine similarity, or to do the stochastic gradient descent. However,
directly calling any package of CF and latent factor models will get zero mark.
2 What to submit?
An answer.pdf file reports the requested values and explanation of each task.
A source code file contains detailed comments.
2

辅导COMPSCI 753编程、辅导python编程、data留学生程序讲解 辅导R语言程序|讲解R语言编程

辅导COMPSCI 753编程、辅导python编程、data留学生程序讲解辅导R语言程序|讲解R语言编程