CS434 Machine Learning and Data Mining – Homework 2
Linear Models for Regression and Classification
Overview and Objectives. In this homework, we are going to do some exercises about alternative losses for linear
regression, practice recall and precision calculations, and implement a logistic regression model to predict whether a tumor is malignant or benign. There is substantial skeleton code provided with this assignment to take care of some of the
details you already learned in the previous assignment such as cross-validation, data loading, and computing accuracies.
How to Do This Assignment.
• Each question that you need to respond to is in a blue "Task Box" with its corresponding point-value listed.
• We prefer typeset solutions (LATEX / Word) but will accept scanned written work if it is legible. If a TA can’t
read your work, they can’t give you credit.
• Programming should be done in Python and numpy. If you don’t have Python installed, you can install it from
here. This is also the link showing how to install numpy. You can also search through the internet for numpy
tutorials if you haven’t used it before. Google and APIs are your friends!
You are NOT allowed to...
• Use machine learning package such as sklearn.
• Use data analysis package such as panda or seaborn.
• Discuss low-level details or share code / solutions with other students.
Advice. Start early. There are two sections to this assignment – one involving working with math (20% of grade) and
another focused more on programming (80% of the grade). Read the whole document before deciding where to start.
How to submit. Submit a zip file to Canvas. Inside, you will need to have all your working code and hw2-report.pdf.
You will also submit test set predictions to a class Kaggle. This is required to receive credit for Q8.
1 Written Exercises: Linear Regression and Precision/Recall [5pts]
I’ll take any opportunity to sneak in another probability question. It’s a small one.
1.1 Least Absolute Error Regression
In lecture, we showed that the solution for least squares regression was equivalent to the maximum likelihood estimate
of the weight vector of a linear model with Gaussian noise. That is to say, our probabilistic model was
yi ∼ N (µ = wT xi
, σ) −→ P(yi
|xi
, w) = 1
σ
√
2π
e
−
(yi−wT xi
)
2
σ2 (1)
and we showed that the MLE estimate under this model also minimized the sum-of-squared-errors (SSE):
argmaxY
N
i=1
P(yi
|xi
, w)
| {z }
Likelihood
= argmin X
N
i=1
(yi − wT xi)
2
| {z }
Sum of Squared Errors
(2)
However, we also demonstrated that least squares regression is very sensitive to outliers – large errors squared can
dominate the loss. One suggestion was to instead minimize the sum of absolute errors.
In this first question, you’ll show that changing the probabilistic model to assume Laplace error yields a least
absolute error regression objective. To be more precise, we will assume the following probabilistic model for how yi
is
produced given xi
:
yi ∼ Laplace(µ = wT xi
, b) −→ P(yi
|xi
, w) = 1
2b
e
−
|yi−wT xi
|
b (3)
1
I Q1 Linear Model with Laplace Error [2pts]. Assuming the model described in Eq.3, show that the
MLE for this model also minimizes the sum of absolute errors (SAE):
SAE(w) = X
N
i=1
|yi − wT xi
| (4)
Note that you do not need to solve for an expression for the actual MLE expression for w to do this problem. Simply showing that the likelihood is proportional to SAE is sufficient because they would then have
the same maximizing w.
1.2 Recall and Precision
y P(y|x)
0 0.1
0 0.1
0 0.25
1 0.25
0 0.3
0 0.33
1 0.4
0 0.52
y P(y|x)
0 0.55
1 0.7
1 0.8
0 0.85
1 0.9
1 0.9
1 0.95
1 1.0
Beyond just calculating accuracy, we discussed recall and precision as two other
measures of a classifier’s abilities. Remember that we defined recall and precision
as in terms of true positives, false positives, true negatives, and false negatives:
Recall =
#TruePositives
#TruePositives + #FalseNegatives
(5)
and
Precision =
#TruePositives
#TruePositives + #FalsePositives
(6)
I Q2 Computing Recall and Precision [3pts]. To get a feeling for recall and precision, consider the set
of true labels (y) and model predictions P(y|x) shown in the tables above. We compute Recall and Precision
at a specific threshold t – considering any point with P(y|x) > t as being predicted to be the positive class
(1) and ≤ t to be the negative class (0). Compute and report the recall and precision for thresholds t = 0,
0.2, 0.4, 0.6, 0.8, and 1.
2 Implementing Logistic Regression for Tumor Diagnosis [20pts]
In this section, we will implement a logistic regression model for predicting whether a tumor is malignant (cancerous)
or benign (non-cancerous). The dataset has eight attributes – clump thickness, uniformity of cell size, uniformity
of cell shape, marginal adhesion, single epithelial cell size, bland chromatin, nomral nucleoli, and mitoses – all rated
between 1 and 10. You will again be submitting your predictions on the test set via the class Kaggle. You’ll need to
download the train_cancer.csv and test_cancer_pub.csv files from the Kaggle’s data page to run the code.
2.1 Implementing Logistic Regression
Logistic Regression. Recall from lecture that the logistic regression algorithm is a binary classifier that learns a linear
decision boundary. Specifically, it predicts the probability of an example x ∈ R