Don’t be Sentimental!
Due: Monday February 15, 2021 @ 6 a.m. - pushed to Github and release issued.
Introduction
Have you ever read a tweet and thought, “Gee, what a positive outlook!” or “Wow, why so negative, friend?” Can computers make the same determination? They can surely try!
In Machine Learning, the task of assigning a label to a data item is called classification (putting things into different classes or categories). The more specific name for what we’re going to do is sentiment analysis because you’re trying to determine the “sentiment” or attitude based on the words in a tweet. So, Project 1 is to build a sentiment classifier! Aren’t you excited?? ( ← That would be positive sentiment!)
You’ll be given a set of tweets that are already pre-classified as positive or negative based on their content. You’ll analyze the word frequency patterns among all of those tweets to develop a classification algorithm. Using your classification algorithm, you’ll then classify another set of tweets to determine if they are positive or negative.
Building a Classifier
The goal in classification is to assign a class label to each element of a data set. Of course, we would want this done with the highest accuracy possible. For this project, we will have only two classes or labels: positive sentiment and negative sentiment. At a high level, the process to build a classifier (and many other machine learning models) is this:
1.Train
○Use a training data set with pre-classified members.
○Assume you have 10 tweets and each is pre-classified with + or - sentiment. How might you go about analyzing the words in the tweets to find words more commonly associated with negative sentiment and words more commonly associated with positive sentiment?
○The result of the training step will be two lists of words: 1 list for positive words and 1 list for negative words
2.Test
○Now, you give your classifier un-labeled tweets from a testing data set and ask it to output the class it determines.
○But behind the scenes, you already know what class each tweet actually belongs to.
○Compare the predicted class of each tweet (the output of your classifier) and the actual class of each tweet and determine the accuracy. In other words, how correct was your classifier?
The Real Data
The data set we will be using in this project comes from real tweets posted around 11-12 years ago. The original data was retrieved from Kaggle at https://www.kaggle.com/kazanova/sentiment140. I’ve pre-processed it into the file format we are using for this project. For more information, please see
Go, A., Bhayani, R. and Huang, L., 2009. Twitter sentiment classification using distant supervision. CS224N Project Report, Stanford, 1(2009), p.12.
Input files
There will be 3 different input files:
1.Training Data
2.Testing Data (no sentiment column)
3.Testing (id and sentiment for testing data for you to compare against).
The training data set is formatted as follows:
●A comma-separated-values (CSV) file containing a list of tweets, each one on a separate line. Each line of the data files include the following fields:
○Sentiment value (negative = 0, positive = 4),
○the tweet id,
○the date the tweet was posted
○Query status (you will ignore this column)
○the twitter username that posted the tweet
○the text of the tweet itself
The testing data set is broken into two files:
●A CSV file containing formatted just like the training data EXCEPT no Sentiment column
●A CSV file containing tweet ID and sentiment for the testing dataset (so you can compare your predictions of sentiment to the actually sentiment ground truth)
Below are two example tweets from the training dataset:
4,1467811594,Mon Apr 06 22:20:03 PDT 2009,NO_QUERY,peruna_pony,"Beat TCU"
0,1467811595,Mon Apr 06 22:22:03 PDT 2009,NO_QUERY,the_frog,"Beat SMU"
Here are two tweets from the testing dataset:
1467811596,Mon Apr 06 22:20:03 PDT 2009,NO_QUERY,peruna_pony,"SMU > TCU"
The sentiment file for that testing tweet would be:
4, 1467811596
Output Files
There will be one output file organized as follows::
●The first line of the output file will contain the accuracy, a single floating point number with exactly 3 decimal places of precision. See the section “How good is your classifier” below to understand Accuracy.
●The remaining lines of the file will contain the Tweet IDs of the tweets from the testing data set that your algorithm incorrectly classified.
Example of the Testing Data tweet classifications file (These tweet IDs are fake):
0.500
2323232323
1132553423
Running your Program
Your program will have two modes
●Running Catch TDD tests
○this will be indicated by passing no arguments to the executable
○this is explained more below.
●Training & Testing the Classifier
○this mode will have 4 command line arguments:
■training data set filename - the file with the training tweets
■testing data set filename - tweets that your program will classify
■testing data set sentiment filename - the file with the classes for the testing tweet data
■ouput file name - see Output Files section above
■Example:
./classifier.out