Homework 3
CS534 Machine Learning, Fall 2021
This homework will explore validation and model selection procedures
using regularized logistic regression models and nested cross validation.
Problem 1 - Cross validation (60 points)
This elastic-net regularized logistic regression model is derived by minimiz-
ing the negative log likelihood function for samples (xi, gi), gi ∈ {1, 2}
max
β,β0
`(β, β0) = max
β,β0
1
N
N∑
i=1
{I(gi?1)log p(xi)+I(gi?2)log (1?p(xi))}?λPα(β),
where the regularization penalty Pα(β) = (1?α)12‖β‖22+α‖β‖1 is a function
of the free parameters λ, α, and where the indicator function I(0) = 1 and
zero elsewhere. The weight λ determines the magnitude of regularization,
and the mixing parameter α determines the proportion of the penalty allo-
cated to the 1 and 2 norms. The mixing parameter α is often not chosen
through cross validation but just set using intuition or experience, where
the weight λ is typically determined through cross validation.
In this problem you will use a built-in function for elastic-net logistic
regression. The mixing parameter will be fixed α = 0.95 and you will search
for the optimal regularization weight lambda in the range λ ∈ [0, 100]. These
are often spaced on a logarithmic scale at 100 or more intervals.
Perform a nested cross validation using five folds to determine the op-
timal regularization weight and report test error. In each step, 4/5 of the
data will be used for training and validation, and 1/5 will be used to report
test error.
1.a. Validation diagram (20 points) Draw a diagram of the cross valida-
tion where you indicate which samples are in training, testing, and validation
in each of the 5 nested folds. Depict both the inner and outer loops of the
nested CV. Indicate clearly how the data is segmented, and the fraction of
the data used in each segment.
1
1.b. Model selection (20 points) For each outer CV fold, generate a plot
of the classification error on the validation set as a function of the sequence
of λ values (five plots total). Use the inner CV folds to calculate standard
deviations of the error σE(λ) at each λ, as well as the mean error μE(λ).
Choose the optimal lambda λ? as the largest λ within 1-standard devi-
ation of the minimum average error
λmin = argmin
λ
μE(λ),
λ? = max{λ} subject to λ ≤ μE(λmin) + σE(λmin)
In each plot, indicate μE(λ) with a solid black line, and the intervals of
variance μE(λ) ± σE(λ) in red, μE(λmin) as a blue point and μE(λ?) as a
green point.
1.c. Test error (20 points) Generate a box plot of the five test errors
generated by cross-validation. Display the validation errors μE(λ?) in two
additional boxplots in the same graph (there are 20 training errors and 20
validation errors). Compare and discuss the errors.
2
Problem 2 - Elastic net logistic regression (40 points)
Implement the elastic net regression algorithm using the soft-thresholding
and iterative reweighted least-squares approach described in Friedman 2010.
This problem can be solved by optimizing the penalized negative log likeli-
hood
min`(β, β0) =
1
N
N∑
i=1
{I(gi? 1)log p(xi) + I(gi? 2)log (1? p(xi))}?λPα(β),
where the indicator function I(0) = 1 and zero elsewhere, and p(xi) repre-
sents the class probability
Pr(G = 1|X = xi) = p(xi) = 1
1 + e?(β0+xTi β)
.
A quadratic approximation of this likelihood `Q can be used to transform
this problem into a weighted least-squares problem with weights wi and
response zi (see Equation 10 in Friedman paper)
`Q(β, β0) = ? 1
2N
N∑
i=1
wi(zi ? β0 ? xTi β)2.
where
zi = β?0 + x
T
i β? +
yi ? p?(xi)
wi
,
wi = p?(xi)(1? p?(xi)).
2.a. Solution path
Implement this algorithm using the mixing parameter α = 0.95 and plot the
solution path β(λ) for a sequence of 1000 logarithmically spaced λ in the
range λ ∈ [1, 100]. Label each feature j using text on the plot at the point
where βj enters the model (when this model coefficient becomes non-zero).
3
Notes
Understanding the approach at a high-level is important. You are
going to solve the quadratic approximation repeatedly using equation
10 from the paper, and at each iteration you will recalculate zi, wi to
update the quadratic approximation `Q.
You will generate a solution for each value of λ. A fixed number of
iterations is the easiest approach for deciding when each problem is
solved (50 worked for me).
To accelerate convergence you can use a warm-starting technique.
Start with the largest λ which will have the fewest non-zero coeffi-
cients. Use the solution of this regularization level as the start for the
next smallest lambda, etc.
This problem has some numerical issues that need to be addressed.
The weights, wi can shrink to zero and cause NaN values where used
in division. The parameters β will continue to grow in an attempt to
push the probabilities pi to zero or one. For this reason, you can clamp
these to 0, 1 once they are within a reasonable distance (say 1e ? 5).
The solutions may also become unstable when the regularization gets
too small (this will be apparent in the solution path plot).