Homework 7 for Math 173A - Fall 2022
1. Indicate whether the following functions are strongly convex. Justify your answer.
(a) f(x) = x
(b) f(x) = x2
(c) f(x) = log(1 + ex)
2. Let f : Rn → R be a differentiable function. Prove the following two conditions
are equivalent (i.e. each one implies the other)
(a) f(y) ≥ f(x) +?f(x)T (y ? x) + c2‖x? y‖2 for all x, y ∈ Rn
(b) The function g(x) := f(x)? c2‖x‖2 is convex.
Hint: Recall that g(x) is convex if and only if g(y) ≥ g(x) +?g(x)T (y ? x)
3. Here we will prove the inequality used in class to prove fast convergence for strongly
convex functions.
Let F (x) be a strongly convex function with constant c. Our goal is to show
F (x)? F (x?) ≤ 1
2c
‖?F (x)‖2 for all x ∈ Rd. (1)
(a) Fix x ∈ Rd and define the quadratic function
q(y) = F (x) +?F (x)T (y ? x) + c
2
‖x? y‖2.
Find the y? that minimizes q(y).
(b) Show that q(y?) = F (x)? 12c‖?F (x)‖2
(c) Use the above to deduce (1).
(d) Explain the proof technique in your own words to demonstrate understanding
of what we did.
4. Let A be a diagonal matrix with diagonal entries Aii = i (i.e. the entries run from
1 to n), and let f(x) = 12x
TAx+ bTx+ c. We plan to run gradient descent on this
function, and want to understand in theory how fast this will converge.
(a) Show which assumptions are satisfied by f . (i.e. bounded Hessian, strong
convexity).
1
(b) What should we pick as a step size?
(c) What can we say about F (x(t))? F (x?) and it’s rate of decay?
5. We will implement the SVM algorithm with gradient descent to classify two gaus-
sians in 2D.
(a) Sample 100 points from one Gaussian with mean (?3,?3) and one Gaussian
with mean (3, 3). Both Gaussians have identity covariance. You can do this by
calling randn (Matlab) or np.random.randn (Python) to generate each point
cloud, and then add or subtract 3 from each coordinate. Create a vector Y
of the labels that’s 1’s on one class and -1’s on the other class. Create and
turn in a scatter plot of the data points colored by label to ensure you have
correctly created your data.
(b) Create a function for the gradient of the loss
L(w) =
1
2
‖w‖2 +
n∑
i=1
max(0, 1? yi〈xi, w〉)
?L(w) = w +
n∑
i=1
?yixi · 11?yi〈xi,w〉>0,
where 11?yi〈xi,w〉>0 =
{
1 if 1? yi〈xi, w〉 > 0
0 else
. Also, here n = 200. To
compute the gradient, you’ll have to compute an indicator of whether 1 ?
yi〈xi, w〉 is positive or negative at every point, and sum up the contribution
of this term for all points where it’s positive.
(c) Setting the step size μ = 10?4 and starting at w(0) = (1, 1), run 1000 iterations
of gradient descent. You will create twp plots.
i. Plot the classification error (averaged over all the points) as a function
of the iterations. The classification of xi is determined by sign(〈xi, w〉).
ii. Plot the margin 2‖w‖ as a function of the iterations. This shows how much
of a gap you have between the classes you’ve learned.
(d) Create another scatter plot of your data, but this time color the points by the
function f(xi) = 1?yi ·〈xi, w〉. The numbers closest to 0 (positive numbers or
largest negative numbers) will show you which points were “most important”
in determining the classification.
Note: Here we’re only defining a subspace classifier (i.e. the classifier goes through
the origin). This is fine for our problem as the gaussians are on opposite sides of
the origin. If you want to create an intercept term, simply append a vector of all
1’s as a column of your data, and now your weight vector will be of dimension 3
instead of 2. This is done the same way as when running least squares regression.