讲解 CSE 5523: Machine Learning - Midterm调试R语言程序

CSE 5523: Machine Learning - Midterm

Due: 11:59 pm 03/07/2024

1) True/False (30=10 ×3 pts). Are the following statements true or false? (No need for explanations unless you feel the question is ambiguous and want to justify your answer)

(a) The core assumption of Naive Bayes classifiers is that all observed variables (features) are statistically independent, i.e., P(X1 , X2 , ··· , XD ) = P(X1 )P(X2 ) ··· P(XD ).

True False

(b) Linear Discriminant Analysis can only be applied when the dataset in question is linearly separable.

True False

(c) Regularized linear regression (with L2 regularization) can be interpreted as the MAP estimate of a model in which the weights w are endowed with a Gaussian prior.

True False

(d) The specification of a probabilistic discriminative model can often be interpreted as a method for creating new, “fake” data.

True False

(e) Gaussian discriminant analysis as an approach to classification cannot be applied if the true class- conditional density for each class is not Gaussian.

True False

(f) The optimal soft-margin hyperplane classifier tends to have larger margin when the parameter C increases.

True False

(g) Given a labeled training set Dtr = {(xi , yi )} Ni=1 of a five-class classification problem (for exam- ple,“cat”, “dog”, “panda”, “bird”, “fish”), five-fold cross-validation separates Dtr into five folds, where each fold contains data of a single class.

True False

(h) Let both k1 ( · , ·), k2 ( · , ·) be valid kernel function Rd × Rd → R. Define the new function k(˜)(u,v) :=

k1 (u,v) − k2 (u,v). Then k(˜) is also a valid kernel function.

True False

(i) Gradient descent is guaranteed to converge to a global minimum for any kind of functions.

True False

(j) Assume we have trained a model for linear discriminant analysis, and we obtained parameters Σ, the covariance matrix, and µ1 , µ2 , the class means. We learned in class that the decision boundary between classes c = 0 and c = 1, i.e. the set {x : P(y = c|x;Σ;µ1 ;µ2 ) = 0.5}, is linear in the input space. But it is not linear at thresholds other than 0.5; for example, the {x : P(y = c|x;Σ;µ1 ;µ2 ) = 0.9} is not an affine subspace.

True False

2) Multiple choices (20=5 ×4 pts).

(a) Linear regression can be interpreted from the probabilistic perspective, where the model y = wTx+ϵ has a noise ϵ . Given a training set {(xi , yi MAP (maximum a posterior) estimation into the following optimization problem

where λ > 0 is the hyper-parameter. Hint: you can assume the Gaussian has a zero mean and an identity covariance matrix. Laplacian distribution follows the form of p(z) ∝ exp( −||z|| 1 ) where ||·|| 1

A. p(ϵ) is Gaussian distribution; p(w) is Gaussian distribution

B. p(ϵ) is Gaussian distribution; p(w) is Laplacian distribution

C. p(ϵ) is Laplacian distribution; p(w) is Gaussian distribution

D. p(ϵ) is Laplacian distribution; p(w) is Laplacian distribution

(b) Choose the correct statements below.

A. In regularized empirical risk minimization, using L1 regularization tends to generate more sparse solution than L2 regularization.

B. Overfitting means the model cannot perform well (for example, cannot classify correctly) on the training data.

C. The optimal solution of empirical risk minimization problem is Bayes optimal classifier.

D. The Bayes optimal classifier can achieve 0 training and test error (or loss) on arbitrary data.

(c) Which of the followings are generative model?

A. Gaussian discriminant analysis

B. Naive bayes

C. Logistic regression

D. Support vector machine

(d) Which of the following algorithms cannot generate a nonlinear decision boundary?

A. Nearest neighbor

B. Perceptron

C. Gaussian discrminative analysis

D. Naive Bayes classifier

(e) Choose the correct statements about the perceptron algorithm below.

A. The perceptron algorithm finds the maximum margin classifier if the data is linearly separable.

B. If the data is linearly separable, perceptron algorithm is guaranteed to converge

C. If you do T sequential updates with the perceptron, you will arrive at the same final w regardless of the order of the updates.

D. Perceptron algorithm learns a generative model.

3) Bayes Optimal Classifier (10 pts).

(a) (5 pts) Consider a binary classification problem where

Use Bayes optimal classifier to predict labels of feature vectors x1 = [ − 1, 5]T , x2 = [3, 3]T

(b) (5 pts) Consider one-dimensional feature X ∈ R and binary Y ∈ {+1, − 1}. Given the following

Suppose we are given a new feature x = 4 and we want to find its prediction Y(ˆ) that minimizes the expected loss E[1(Y(ˆ) Y)], what is the prediction of x = 4?

4) MLE for the Exponential Distribution. Consider an exponential distribution. The density

function is given by

(a) Given a dataset {x1 , x2 ,..., xn }, what is the maximum likelihood estimate λ(ˆ)ML of the parameter λ?

(b) What probability would the model assign to a new data point xn+1 using λ(ˆ)ML? (i.e., find P(xn+1; λ(ˆ)ML ))

5) Linear regression (10 pts) . Consider a linear regression problem, where we have two data points

x1 = 0; x2 = 1

y1 = 0; y2 = 2

Suppose we want to find w to minimize the following:

where |w| is the absolute value of w. What is the optimal value for w? Note that we do not have bias term b in our regression model (f(x) = w · x is our linear model).

6) Maximum Margin Classifier and SVM (15 pts).

(a) (10 pts) Consider a dataset with two data points (x1 , y1 ), (x2 , y2 ):

x1 = [1, 1]T ; y1 = — 1

x2 = [ — 1, — 1]T ; y2 = +1

Find the parameters w* , b* of maximum margin classifier by solving the following optimization:

(b) (5 pts) In class we learnt that SVM can be used to classify linearly inseparable data by transforming it to a higher dimensional space with a kernel K(x;z) = ϕ(x)T ϕ(z), where ϕ(x) is afeature mapping. Let K1 be Rn x Rn kernel, and c e R+ be a positive constant. ϕ 1 : Rn 一 Rd is feature mapping of K1. Explain how to use ϕ1 to obtain the kernel K(x,z) = cK1 (x,z).