COMP9414: Assignment 2: Rating Prediction

COMP9414: Artificial Intelligence

Assignment 2: Rating Prediction

Value: 25%

This assignment is inspired by a typical real-life scenario. Imagine you have been hired as a Data

Scientist by a major e-commerce retailer. Your job is to analyse customer reviews to determine

whether you can predict the ratings of new products so they can be promoted on your website.

For this assignment, you will be given a collection of Amazon customer reviews. Each review

consists of short text (a few sentences), and one of five ratings: a number from 1 to 5. You are

required to evaluate various supervised machine learning methods using a variety of features and

settings to determine what methods work best for rating prediction in this domain (these features

could then be used to recommend items to users based on their interests).

The assignment has two components: programming to produce a collection of models for rating

prediction, and a report to evaluate the effectiveness of the models. The programming part involves

development of Python code for data preprocessing of reviews and experimentation of methods

using NLP and machine learning toolkits. The report involves evaluating and comparing the

models using various metrics.

You will use the NLTK toolkit for basic language preprocessing, and scikit-learn for feature construction and evaluating the machine learning models. You will be given an example of how to use

NLTK and scikit-learn to define the machine learning methods (example.py), and an example of

how to plot metrics in a graph (plot.py).

Data and Methods

A training dataset is a .tsv (tab separated values) file containing a number of reviews, with one

review per line, and linebreaks within reviews removed. Each line of the .tsv file has three fields:

instance number, text and rating (a number from 1 to 5). A test dataset is a .tsv file in the same

format as a training dataset except that your code should ignore the rating field. Training and test

datasets can be drawn from supplied file reviews.tsv (see below). For evaluation of the models,

we will use one 80–20 split of this file.

For all models, consider a review to be a collection of words, where a word is a string of at least two

letters, numbers or the symbols / (slash), - (hyphen), $ or %, delimited by a space, after replacing

two successive hyphens - -, the tilde symbol ˜ and any ellipsis (three or more dots ...) by a space,

then removing tags (minimal text spans between < and > inclusive) and all other characters.

Two characters is the default minimum word length for CountVectorizer in scikit-learn. Note

that deleting “junk” characters may create longer words that were previously separated by those

characters, for example after removing tags, commas and full stops.

Use the supervised learning methods discussed in the lectures: Decision Trees (DT), Bernoulli

Naive Bayes (BNB) and Multinomial Naive Bayes (MNB). Do not code these methods: instead

use the implementations from scikit-learn. Read the scikit-learn documentation on Decision Trees1

and Naive Bayes,2 and the linked pages describing the parameters of the methods.

1https://scikit-learn.org/stable/modules/tree.html

2https://scikit-learn.org/stable/modules/naive bayes.html

Look at example.py to see how to use CountVectorizer and train and test the machine learning

algorithms, including how to generate metrics for the models developed, and plot.py to see how

to plot these metrics on a graph for inclusion in your report.

The programming part of the assignment is to produce DT, BNB and MNB models and your own

model for rating prediction in Python programs that can be called from the command line to train

and classify reviews read from correctly formatted .tsv files. The report part of the assignment

is to analyse these models using a variety of parameters, preprocessing tools and scenarios.

Programming

You will submit four Python programs: (i) DT classifier.py, (ii) BNB classifier.py, (iii)

MNB classifier.py and (iv) my classifier.py. The first three of these are standard models as

defined below. The last is a model that you develop following experimentation with the data. Use

the given dataset reviews.tsv containing 2500 labelled reviews to develop and test the models,

as described below.

These programs, when called from the command line, will read from standard input (not a hardcoded file reviews.tsv), and should print to standard output (not a hard-coded file output.txt),