辅导data语言、辅导R,Python程序

The goal of the assignment is to get familiar with different types of recommender
systems. Specifically, we are going to build a system that can recommend Yelp
businesses to users.
Dataset
The dataset that we will be using contains information about Yelp businesses. More
precisely, for 14397 active users and 1000 popular businesses on Yelp, we know if a
given user visited&rated a given business, for example, a restaurant.
The folder contains:
• user-business.csv - This is the ratings matrix R, where each row corresponds to a
user and each column corresponds to a business. Rij = 1 if the user has visited&rated
that business. Otherwise Rij = 0. (To simplify the question, we ignore the exact ratings.)
The columns are separated by a space.
• business.csv - This is a file containing the names of the businesses, in the same order
as the columns of R.
Overview
In this assignment we are going to implement three types of recommender systems
namely
● User - User recommender system
● Item – Item recommender system
● Latent factor model recommender system
We are then going to compare the results of these systems for the 4th user(index
starting from 1) of the dataset. Let’s call him Alex. In order to do so, we have erased the
CS/INFO 5304 Assignment 2
first 100 entries of Alex’s row in the matrix, and replaced them by 0s. This means that
we don’t know which of the first 100 businesses Alex has visited. Based on Alex’s
behavior on the other businesses, you need to give Alex recommendations on the first
100 businesses. We will then see if our recommendations match what Alex had in fact
visited.
To verify your output using the following recommenders, the 1s in the erased first entries
are:
● Piece of Cake
● Papi's Cuban & Caribbean Grill
● Loca Luna
● Farm Burger
● Little Rey
● Seven Lamps
● Vatica Indian Cuisine
● Shake Shack
● Truva Turkish Kitchen
● Yoi Yoi Japanese Steakhouse & Sushi
Part A: user – user recommender system [10 points]
In a user-user recommender system, you need to find users who have visited and rated
similar businesses. Then among these users, you can recommend the top visited items.
for all businesses b, compute
rAlex,b = Σx∈users cos-sim(x, Alex)· Rx, b
where cos-sim(x,Alex) is the cosine similarity of other users with Alex (excluding entries
of the first 100 businesses), and R is the ratings matrix. In the above equation you are
first finding the similarity between users and then multiplying it with their product rating
for each item. So the businesses that have higher rAlex,b will be the businesses that are
popular among the users similar to Alex.
Let S denote the set of the first 100 businesses (the first 100 columns of the matrix).
From all the businesses in S, which are the five that have the highest similarity scores
(rAlex,b
) for Alex? What are their similarity scores? In case of ties between two
businesses, choose the one with a smaller index. Do not write the index of the
businesses, write their names using the file business.csv.
Part B: item – item recommender system [10 points]
CS/INFO 5304 Assignment 2
In an item-item recommender system, you need to find items that have similar ratings
and recommend it to Alex. For all business b, compute
rAlex,b = Σx ∈ business cos-sim(x, b)· RAlex,x
where R is the ratings matrix and cos-sim(x,b) is the cosine-similarity of each pair of
businesses(excluding entries of Alex). Here you are finding similar items and then
multiplying it with Alex’s ratings for items. So the businesses that are similar to the
businesses already visited by Alex will have the higher rating r
From all the businesses in S (first 100 businesses), which are the five that have the
highest similarity scores for Alex? In case of ties between two businesses, choose the
one with a smaller index. Again, hand in the names of the businesses and their
similarity score.
Part C: Latent hidden model recommender system [15 points]
Latent model recommender system is the most popular type of recommender system in
the market today. Here we perform a matrix factorization of the ratings matrix R into two
matrices U and V where U is considered as the user features matrix and V is the movie
features matrix. Note that the features are ‘hidden’ and need not be understandable to
users. Hence the name latent hidden model. (refer slides for more information)
The latent model can be implemented by performing a singular value decomposition
(SVD) that factors the matrix into three matrices
R = U Σ V
T
where R is user ratings matrix, U is the user “features” matrix, Σ is the diagonal matrix of
singular values (essentially weights), and V
T
is the movie “features” matrix. U and V
T are
orthogonal, and represent different things. U represents how much users “like” each
feature and V
T
represents how relevant each feature is to each business.
To get the lower rank approximation, we take these matrices and keep only the top k
features (k factors), which we think of as the k most important underlying taste and
preference vectors.
CS/INFO 5304 Assignment 2
With k set to 10, perform SVD to identify the U and V matrices. You can then multiply
the matrices to estimate the following
R
* = U Σ V
T
From the R* matrix, select the top 5 businesses for Alex in S (first 100 businesses). In
case of ties between two businesses, choose the one with a smaller index. Again, hand
in the names of the businesses and their similarity score.
Hint: You can use SVD in surprise package, or numpy, scipy
Part D: bonus [10 points]
Your goal is to build a good recommendation system for Yelp with an ensemble of
predictors. You can use any individual predictor and any method to combine them (it
could be linear weighted combination or vote)
- Test set:
- 5 new users x the same 1000 businesses. Their records of the 100 first
businesses are also erased.
- Submission:
- the prediction of the erased records which are 1s and 0s.
- Submission format: sample_bonus_submission.csv, (5 rows, 100
columns, separator = comma, integers) Please make sure your raw text
exactly matches the sample format. Otherwise you might have 0 points
since we run auto grading.
- Evaluation metrics:
- Since the test set is sparse, i.e. most entries are 0s. We use F1 score as
our evaluation metric.
- You can split a validation set out of the training set (for example, user 5-9 )
if you want to test your model.
- Code and write-up:
- Write your code for the test set in a separate jupyter notebook. At the top
of the notebook, add brief write-ups to explain each predictor you used
and how you combined them.
- Your bonus points = max(10*min( , 1), 0) 𝑦𝑜𝑢𝑟𝐹1 − 0.12
0.6 − 0.12
CS/INFO 5304 Assignment 2
- This means that you will get some points as long as you attempt! For
reference, a random guess(all as 1s) is 0.12. And 0.6 is pretty accurate.
Turn in:
#A2
a) A Jupyter notebook a2.jpynb with the code and answers
(if you work in a study group, write their names at the top to avoid any
trouble in the plagiarism check.)
b) A a2.py exported from your .jpynb
#A2-bonus
c) bonus_submission.csv
d) bonus.ipynb
e) A bonus.py exported from your .jpynb