ISE 529 Predictive Analytics Homework 4 Submit on Oct 18 by 4 p.m. PST
The spam.csv includes data related to 4601 e-mails (the variable type identifies if the e-mail is spam or
non-spam). The data comes from https://archive.ics.uci.edu/dataset/94/spambase
There are 57 predictor variables indicating the frequency of certain words and characters in the e-mail. It
is expected that these variables are useful to predict if an e-mail is spam.
For questions 1 and 2, split the dataset into a non-test set (75%) and a test set (25%). Use the non-test
data for k-fold cross validation. Use the test set to find the test accuracy rate. For questions asking to
scale the data use from sklearn.preprocessing import StandardScaler. Also use random_state=1 if
needed.
1. (20 pts.) Use 5-fold cross validation to build a KNN model to predict if an email from the test set is
a spam. Search for the best hyperparameter value K (the number of nearest neighbors). Do not scale
the predictors. Plot the validation accuracy rates for K=1,2,...20. Report the best K value and the
test accuracy rate.
2. (20 pts.) Use 5-fold cross validation to build a KNN model to predict if an email from the test set
is a spam. Use StandardScaler to scale the predictors. Search for the best hyperparameter value
K (the number of nearest neighbors). Plot the validation accuracy rates for K=1,2,...20. Report the
best K value and the test accuracy rate.
3. (15 pts.) Use Holdout cross validation to build a logistic regression model to predict if an email
from the test set is a spam. Split the dataset into a train set (75%) and a test set (25%). Use
random_state=1. Do not scale the predictors. Report the test accuracy rate.
4. (15 pts.) Use Holdout cross validation to build a logistic regression model to predict if an email
from the test set is a spam. Split the dataset into a train set (75%) and a test set (25%). Use
random_state=1. Use StandardScaler to scale the predictors. Report the test accuracy rate.
5. (15 pts.) Use 5-Fold cross validation to build a logistic regression model to predict if an email from
the test set is a spam. Do not scale the predictors. Report the test accuracy rate.
6. (15 pts.) Use 5-Fold cross validation to build a logistic regression model to predict if an email from
the test set is a spam. Use StandardScaler to scale the predictors. Report the test accuracy rate.
Submit your report with your name and USC ID as a pdf file online (no screen captures). Report must be
made of letter size pages in portrait format (not landscape). Incomplete or truncated Python commands
are not acceptable. One submission per student, only.