讲解CS4787、辅导Java，Python编程

Problem Set 2: Scaling with Sampling
CS4787/5777 — Principles of Large-Scale ML Systems
This problem set is assigned on September 7, 2022 and is due two weeks later on September 21,
2022 at 11:59PM. You may work in groups of up to three students. Submit your solutions on
Gradescope.
Problem 1: Empirical Risk Minimization. Consider a machine learning task where we want to
train a model h that maps examples x ∈ Rd to labels y ∈ R. We do so by selecting from a hypothesis
class hw parameterized by weights w ∈ Rd, defined by
hw(x) = x
Tw.
(Recall here that xTw denotes the dot product of the vectors x and w.) To train this model given a
dataset of n labeled examples (x1, y1), (x2, y2), . . . , (xn, yn), we solve the empirical risk minimization
problem
minimize:
1
n
n∑
i=1
(yi ? xTi w)2 + λ ∥w∥2
over: w ∈ Rd,
where λ > 0 is a hyperparameter.
1(a). The given empirical risk minimization problem is an example of one of the following types of
machine learning models:
(A) Soft-margin support vector machine
(B) Hard-margin support vector machine
(C) k-Nearest neighbors classifier
(D) Perceptron
(E) Linear regression
(F) Logistic regression
Which one is it?
1(b). One way to solve this empirical risk minimization problem is to use gradient descent on the
loss. Define the loss function ? : Rd → R to be the function
?(w) =
1
n
n∑
i=1
(yi ? xTi w)2 + λ ∥w∥2
Let’s try to understand some things about this empirical risk minimization problem. Answer each of
the following questions, and provide a brief justification.
(i) Is the objective ? convex? (Recall that a function f is convex if it is “bowl-shaped” i.e. if any
line segment drawn between two points on its graph lies above the graph.)
1
(ii) Is ? strongly convex? If so, can you find a simple data-independent value for its constant of
strong convexity μ? (Recall that a twice-differentiable function f is mu-strongly convex if its
second directional derivative along any direction is always no less than μ, i.e. ?
2α2 f(w+αu) ≥
μ for any u,w ∈ Rd with ∥u∥ = 1.)
(iii) Is the solution of this empirical risk minimization problem necessarily unique?
(iv) Is the gradient ?? Lipschitz continuous? If so, can you find a simple data-independent value
for its constant of Lipschitz continuity L? (Recall that a function F is L-Lipschitz continuous if
∥F (u)? F (v)∥ ≤ L∥u? v∥ for any u ∈ Rd.)
1(c). Write out the expression for the update step of gradient descent with learning rate α on this
problem. You will need to derive the gradient to do this.
1(d). Now, suppose you wanted to use stochastic gradient descent (SGD) for this task to make
learning more scalable. (For this problem set, just assume the baseline SGD algorithm with no
minibatching, i.e. a batch size of 1.) Write out the expression for the update step of SGD on this
problem. You will need to derive the gradient to do this.
1(e). Your classmate, Iago, claims that gradient descent for this task is guaranteed to converge
asymptotically to some local minimum of the loss ?, as long as the step size is α = 14λ . Is Iago
correct? If so, prove it. If not, find a counterexample.
1(f). Iago also claims that if gradient descent (with batch size 1) for this task is guaranteed to
converge asymptotically to a local minimum of the loss ? for some fixed step size α > 0, then
stochastic gradient descent with that same step size α (and batch size 1) is also guaranteed to
converge asymptotically to a local minimum of the loss ? with probability 1. Is Iago’s second claim
correct? If so, prove it. If not, find a counterexample.
2
Problem 2: Computational Costs of Learning Algorithms.
In this problem, we will explore the computational cost of solving the ERM task we saw in Problem
1 using Gradient descent and SGD. Each of the parts of this problem asks you to evaluate how many
floating point operations are needed for a particular task. Note that you may need to use any of the
following types of operations:
Additions (a+ b)
Subtractions (a? b)
Multiplications (a× b)
Divisions (a/b)
Be sure to show your work, including a break down of the floating-point operations used by type.
Also note that your answer may depend on the problem size constants n and d. Note that there are
multiple possible correct answers here, which differ depending on how accumulators are initialized
in the implementations of the various operations: any answer with a good justification will be
accepted.
2(a). Suppose that we have already trained a model hw from the ERM task of Problem 1. We are
going to produce and output a prediction for an example x using hw . Suppose that we are going to
do this on a CPU using ordinary floating-point arithmetic. How many total floating-point operations
are required to compute this prediction, hw(x)?
2(b). Now suppose that we want to know the empirical risk associated with a given weight vector.
That is, we want to compute ?(w). How many total floating-point operations are required to compute
this prediction, ?(w)?
2(c). Now imagine that we are going to compute gradient descent for the Problem 1 task on a CPU
using ordinary floating-point arithmetic. Using your answer of 1(c), how many total floating-point
operations are required to compute one update step of gradient descent?
2(d). Now imagine that we are going to compute SGD (with batch size 1) for the Problem 1 task
on a CPU using ordinary floating-point arithmetic. Using your answer of 1(d), how many total
floating-point operations are required to compute one update step of SGD?
2(e). Now, let’s look at a concrete task. Suppose that you are training a model where the train-
ing examples have dimension d = 784, and there are n = 60000 total training examples. For
single-precision arithmetic, the H100 GPU accelerator has 60 teraFLOPS (60 · 1012 floating-point op-
erations per second) maximum performance. Assume that you are using single-precision arithmetic
to run gradient descent on this GPU. Using your answer from 2(c), what is the maximum number of
iterations of gradient descent you could run in an hour on this device, assuming that you are bound
only by FLOPS limitations?