Text Processing: Emails For SPAM/HAM Classification
STA141B, Fall 2020
You may not look for or use code that addresses this dataset or use a package that can read an email
message into an R structure. You must implement the computational approach yourself. You can use
tutorials and stackoverflow for finding guides to general questions about text processing.
We are all familiar with SPAM email messages. Ultimately, we would like to be able to use statistics to
classify new email messages as SPAM or HAM (valid mail). There are many statistical techniques we could
use. But before we can use them we need “data”. Each message is an observation. We need to “measure”
variables on each message in a sample of email messages. We also need to know if a message is SPAM or
HAM. We can then train a statistical classifier using these variables to predict if a new message is SPAM or
HAM.
In this assignment, you will create a data.frame from a set of email messages. Each row corresponds to an
email message and will contain 16 or more variables derived from that message. One variable is whether
it is actually HAM or SPAM which can be derived from the name of the folder in which the message is
located.
The data we use come from the Spam Assassin project. The specific data you are to work with is available
on Canvas1
Your job is to process each email message
• to a structure (list) that contains the header, the body and any attachments (each of which will also
have a header and a body)
• create an R data frame of “derived” variables that give various measures of the email messages
• explore the data to see which variables help to discriminate between SPAM and HAM messages using
plots and numerical summaries
These derived variables might be, for example, the number of recipients to whom the mail was sent, the
percentage of capital words in the body of the text, is the message a reply to another message. See below for
a list of 25 possible variables. You are to write code to compute at least 15 of these and whether the message
is SPAM or HAM.
Many of the variables can be computed using similar approaches. For the most part, you should use regular
expressions rather than using strsplit to break the strings into individual characters.
You are very welcome to define new variables that you can compute from each email message that you think
will help to classify it as SPAM or HAM. Clearly state the meaning and specifics of the variable, why you
think it might be useful in classifying HAM or SPAM emails, and then show the code to compute it.
1
https://canvas.ucdavis.edu/files/10090382/download
1
Once you have this data frame of “derived” variables, explore the relationships among these variables and
especially how they might be used to classify SPAM and HAM messages. In other words, look at frequency
tables and scatterplots of the variables and color code the points based on if the message is SPAM or HAM.
Which variables seem to do best at discriminating between SPAM and HAM messages?
Your report should be similar in structure to the first assignment in that it has two primary sections:
• describe the computational approach and high- and intermediate-level details of how you implemented
this approach.
• explore the data and interpret them in the context of
Submit both the report (as a PDF document) and the R scripts for creating the data.frame and for exploring
the data, and, importantly, the R files containing functions you write and use in the script(s).
Some functions you might find useful include: grep, gsub, gregexpr, substring, nchar, strsplit.
sprintf, paste.
read.dcf.
table, plot, scatter.smooth, hexbin.
1 The Anatomy of an E-mail message
Electronic mail, usually called e-mail, consists of simple text messages – a piece of text sent to a recipient
via the internet. An e-mail message consists of two parts, the header and the body. The body of the e-mail
message is separated from the header by a single blank line. When an attachment is added to an e-mail
message, the attachment is included in the body of the message. Even with attachments, e-mail messages
are still only text messages.
1.1 The E-mail Header
The header contains information about the message such as the sender’s address, the recipient’s address, and
the date of transmission. This information is relayed in a special format that consists of KEY:VALUE pairs.
Below is a sample header from a message found on the SpamAssassin website.
Return-Path: whisper@oz.net
Delivery-Date: Fri Sep 6 20:53:36 2002
From: whisper@oz.net (David LeBlanc)
Date: Fri, 6 Sep 2002 12:53:36 -0700
Subject: [Spambayes] Deployment
In-Reply-To: