CSCC11程序讲解、辅导Python程序设计、Python编程调试讲解数据库SQL|解析Haskell程序

CSCC11 Introduction to Machine Learning, Winter 2021
Assignment 2, Due Thursday, February 25, 10am
This assignment makes use of material from week 3 to week 5 (specifically Chapter 9.6). To begin the programming
component, download a2.tgz from the course website and untar it. A directory A2 will be created; please
don’t change its structure, nor the headers of the python methods therein. Please read the assignment handout
carefully. Submission instructions are at the end.
We prefer you ask questions on Piazza. Otherwise, for your first point of contact, please reach out to your
TA, Mustafa Ammous, through email: mustafa.ammous@mail.utoronto.ca.
Written Component
1) Understanding Binary Class Logistic Regression
In this question, we investigate the assumptions and limitations of the binary class logistic regression model.
Suppose we have two classes with labels c1 and c2, we can express the posterior probability of class c1 given
input x as
P(c1|x) = 1
1 + e−wT x
= g(wT x), (1)
where w is the weight vector with bias included and g is the Sigmoid function.
Written Problems.
1. First, let’s investigate the decision boundary of a binary class logistic regression model. Recall that the
decision boundary is the set of points where P(c1|x) = 0.5 (i.e. when the decision function α(x) = 0).
Show that when P(c1|x) = 0.5, the decision boundary α(x) = 0 is a linear function.
2. Knowing that logistic regression has a linear decision boundary, what datasets would this model perform
poorly on? In this case, performing poorly means the model is unable to classify all the training data
correct. Let’s look at a toy example – the XOR (Exclusive OR) dataset. XOR is a binary input Boolean
function that outputs TRUE when both inputs are the same and outputs FALSE otherwise. The dataset
is represented as the table below. Our goal is to predict the output y ∈ {0, 1} given the input vector
x = [x1, x2]T, where x1, x2 ∈ {0, 1}.
(a) Assume we are using binary class logistic regression on the XOR dataset. What is the maximum
classification accuracy we can obtain? Explain.
(b) As in basis function regression, we can apply basis functions to create a more sophisticated model.
Consider the following feature map (basis functions) on the inputs:
Then, we can express the posterior probability of class c1 given input x as P(c1|x) = g(wTψ(x)),
where w = [w1, w2, w3]
T
is the weight vector of the model. Note that we exclude the bias term in
this question. Specify the conditions and provide an example for the weight vector w such that this
model perfectly classifies the XOR dataset.
2) Multiclass Logistic Regression
In this question we will consider multi-class logistic regression. We will begin by extending the formulation of
2-class logistic regression to the multi-class case. For background, see Chapter 9.6 of the online lecture notes.
We will denote input feature vectors by x, and let’s assume the first element of x is 1 to allow for a bias term;
i.e., x = (1, x1, ..., xd)
T
. For now, suppose there are 2 classes, with class labels 1 and 2. And let c denote the
class variable. In these terms, the conditional probability of class 1, given the data, is simply
p(c = 1 | x) = p(c = 1) p(x | c = 1)
p(c = 1) p(x | c = 1) + p(c = 2) p(x|c = 2) (3)
For the purposes of logistic regression, we define a mapping from an input feature vector to what we might call
unnormalized probabilities as follows:
for some unknown constant α, where k is either 1 or 2 in our 2-class case. Such unnormalized probabilities
are just an exponentiated linear functions of the inputs (with the addition of the bias term which is the first
element of the weight vector). We call them unnormalized because we don’t know the value of α, and hence the
exponentiated inner products e
wT
k
x don’t sum to one like a probability distribution should.
For two classes the model is specified by two weight vectors, w1 and w2. And according to the model, Eqn.
(3) can be rewritten as follows.
This is immediately recognizable as the sigmoidal function g(·) from Eqn. (39) in Chapter 9.6 of the online
lecture notes, but here it is applied to (w1 − w2)
T x instead of wT x. In other words, in the online notes on
2-class logistic regression, the weights we learned were equal to the difference of the weights used here, i.e.,
w1 − w2, which defines the normal vector to the decision boundary (w1 − w2)
T x = 0.
Now we’re ready to look at the more general case with K classes. Assume again that inputs are x =(1, x1, ..., xd)T. And for each of K classes, let there be a weight vector, wk = (bk, wk,1, ..., wk,d)T. Assuming c
denotes the class variable, we can write the probability for class c = k, where 1 ≤ k ≤ K, as
p(c = k | x) = p(c = k) p(x | c = k)PK`=1 p(c = `) p(x | c = `)
(4)
As above, the logistic regression model defines a parametric mapping from inputs to unnormalized probabilities.
Given the model parameters, w1:K, the model specifies that
p(c = k | x, w1:K) = e. (6)
The sum in the denominator in Eqn. (6), ` takes on all values between 1 and K, except k. Note that while Eqn.
(6) more closely resembles the sigmoidal function in 2-class logistic regression, it will be more convenient to use
Eqn. (5) in what follows. The function in Eqn. (5), which maps K score values (i.e., the inner products wTkx) to
probabilities, is used often in deep learning; in neural network jargon it is called a softmax function.
Now let’s consider the objective for learning. As with 2-class logistic regression we are going to use a log
loss. Given training data {(xi, yi)}Ni=1, assumed to be IID, we form the loss under the model as
E(w1:K) = − log YNi=1p(yi| xi, w1:K) .
In the lectures and notes on the 2-class case we manipulated the form of the log probability of the model in terms
of the sigmoid function with the class labels being either 0 or 1. We can do the same thing here but it’s just a
little trickier. Rather than assume that yi
takes on a value from 1 to K, instead we let yi be a so-called one-hot
binary vector. That is, we define yi
to be a vector of length K whose elements are 0 or 1. When the correct class
for the ith data point is k, then all elements of yi are 0 except for the kth; i.e.,
1 when k is the correct class
0 otherwise
With this one-hot label encoding we can express the negative log likelihood of the training data as follows:
Next we need to derive the form of the gradients of E so we can figure out the necessary conditions for the
optimal values of the parameters w1:K. To this end, just as we defined the sigmoid function for 2-class logistic
regression, here we use a similar notation to define the softmax function: i.e., (8)
We will solve the learning problem by minimizing E using gradient descent to find the weight vectors. But we
still need to find the gradient first.
Written Problems.
1. Find the gradient of σ(z1:K, k) in Eqn. (7) with respect to zj :
∂σ(z1:K, k)
∂zj
. (9)
Hint: Consider the cases when k = j and k 6= j. You may find it helpful to look at the structure of the
gradient in the 2-class case in Eqn. (46) in Chapter 9.6 of the online lecture notes.
2. Find the gradient of the log likelihood for a single point (x, y) with respect to wj :
∂
∂wj
X
K
k=1
yk log σ(z1:K, k) (10)
3. Find the gradient of the loss with respect to wj :
∂E(w1:K)
∂wj
. (11)
Hint: Use the results above. And the gradient should have a form similar to Eqn. (48) in Chapter 9.6 of
the online lecture notes.
4. Now, suppose we have D dimensional inputs and K classes. For each of K classes, let there be a weight
vector, wk = (bk, wk,1, ..., wk,d)
T
. If we include regularization, with a (diagonal) Gaussian prior, the
negative log-posterior becomes, up to an additive constant,
Eˆ(w1:K) = − log ", (12)
where p(w1:K) is the joint distribution over K indepdendent Gaussian densities with the same diagonal
covariance. That is,(13)
where the covariance matrix is given by C = diag(β, α, α, . . . , α) ∈ R
(D+1)×(D+1). Here, α denotes the
variance of the prior on each of the weights, and β is the prior variance on the bias term. Usually we don’t
want a strong prior on the bias term so β α. Derive ∂Eˆ
∂wk
for this regularized objective function.
Hint: Your negative log-posterior should have form Eˆ(w1:K) = E(w1:K) + E2(w1:K), where E2(w1:K)
is the regularization term.
Programming Component
You are asked to implement a logistic regression classification algorithm (we provide gradient descent code). The
starter code is in directory A2. You will then apply this algorithm on four datasets, each comprises a training set
and a test set. You should train your algorithm on the training data, then evaluate performance on the test set.
The test performance should be measured as the average 0-1 test loss; i.e., once you’ve fit a model to the training
data, evaluate it by counting the number of correctly labelled test points and dividing the count by the size of the
test set.
Datasets: In the starter file there is a datasets directory. In that directory there are several pickle files for the
different datasets, as described below.
• Generic datasets: There are three generic datasets, generic_.pkl where = {1, 2, 3},
all of which have the same format. Given 2D, real-valued measurements (features), we want to classify
whether a test point is class 0 or class 1 for the first two datasets, and class 0, class 1, or class 2 for the third
dataset. Each pickle file contains four arrays, train_X, train_y, test_X, and test_y. The arrays
train_X ∈ R
100×2
and train_y ∈ R
100×1
form 100 input/output pairs of the training set. The arrays
test_X ∈ R
50×2
and test_y ∈ R
50×1
form 50 input/output pairs of the test set.
• Wine dataset: The wine dataset, wine.pkl, has 13 attributes: Alcohol, Malic acid, Ash, Alcalinity of
ash, Magnesium, Total phenols, Flavanoids, Nonflavanoid phenols, Proanthocyanins, Color intensity, Hue,
OD280/OD315 of diluted wines and Proline (i.e. 13D measurements/features). Given the 13D measurements,
we want to classify whether the wine class is wine 0, wine 1, or wine 2 (i.e. 3 classes). The file
contains four arrays, train_X, train_y, test_X, and test_y. The arrays train_X ∈ R
148×13 and
train_y ∈ R
148×1
form 148 input/output pairs of the training set. The arrays test_X ∈ R
30×13 and
test_y ∈ R
30×1
form 30 input/output pairs of the test set.
Visualizing the Generic Datasets: You need to modify visualize_generic.py to visualize the three
generic datasets. Once you have visualized the datasets (NOTE: You do not need to submit the generated plots),
answer the following questions under Visualization in Questions.txt.
1. Do you expect logistic regression to perform well on generic_1? Why? What if we apply the feature
map defined in Eqn. 2?
2. Do you expect logistic regression to perform well on generic_2? Why? What if we apply the feature
map defined in Eqn. 2?
3. Do you expect logistic regression to perform well on generic_3? Why?
4. Why can’t we directly visualize the wine dataset? What are some ways to visualize it?
Implementing Logistic Regression: You should implement the multi-class logistic regression with regularization
explained in the written component and Chapter 9.6 of the online lecture notes. The gradient for the model is
derived in the written component above. You only need to modify the file logistic_regression.py. You
might find utils.py to be helpful for your implementation. The relevant methods are
• _compute_loss_and_gradient computes the negative log likelihood and its gradient given the inputs
and outputs. It is essential for learning the logistic regression model parameters. When the hyperparameters
α
−1
and β
−1
are non-zero, we compute the negative log posterior instead. IMPORTANT
NOTE: For numerical stability, divide the negative log likelihood by the number of data points, drop all
log constants, and drop all constant factors. Your loss should be in the form Eˆ
avg(w) = E(w)
N + E2(w),
where N is the number of data points. Then, compute the gradient based on Eˆ
avg.
• learn is a template for a gradient descent method, with line search, for minimizing the negative log likelihood
of the data. It uses _compute_loss_and_gradient extensively, and it provides a facility to
check your gradients numerically. Specifically, by setting the check_grad flag, it computes the numerical
gradient using finite difference. It uses your implementation of negative log likelihood and negative
log posterior for the computation.
• predict evaluates the logistic regression model given an array of exemplars and the weight parameters.
It computes the posterior class probability given the data and model parameters. This function is used
mainly for classification.
Training Logistic Regression and Evaluating Performance: To train and evaluate the classifier, you will learn
models with different parameter settings, and compute the test errors for them on different datasets. The Python
file train_logistic_regression.py allows you to train and evaluate a model given a dataset, random
seed, hyperparameters, and initial parameters for training. You will need to implement the train function and
the feature_map in train_logistic_regression.py. Specifically, you will need to implement the
code to initialize logistic regression model, initialize its parameters, train the model, and evaluate the training
and test performance with and without the feature map defined in Eqn. 2. Note that only the generic datasets
can be used with the feature map since it is meant for 2D inputs. To apply feature map on the inputs, set
apply_data_preprocessing = True.
Finally, run logistic regression using train_logistic_regression.py and answer the following
questions under Analysis in Questions.txt:
1. First, we have an experiment on generic_1 dataset. Run logistic regression without regularization and
without feature map. Did you run into any numerical errors? If so, why do you think this is the case? Now,
run logistic regression with regularization. What happens? What are the train and test accuracies?
2. Now, let’s look at generic_2 dataset. Run logistic regression without regularization and without feature
map. What are the train and test accuracies? Run it with feature map now, did the performance get better?
Why do you think that is the case?
3. generic_3 is a multi-class classification problem. Run logistic regression without regularization and
without feature map. What are the train and test accuracies? What if we run it with feature map?
4. Wine dataset: Does logistic regression perform well on this dataset?
Submission
You will need to submit two files, one for each component. The specifications of the file format and naming
convention are specified below.
1. Written Component
You may hand write or type up (e.g. Word, LateX, etc.) the solutions. We strongly recommend typing it
up for legibility. Compile the document as a PDF named A2sol UTORID.pdf (e.g. A2sol student1.pdf).
Then you should submit the pdf file onto Quercus.
2. Programming Component
Compress the A2 directory into a tar file named A2sol UTORID.tgz (e.g., A2sol student1.tgz).
Please don’t modify the method/function headers or the structure of the A2 directory tree since the
automarker will fail to run. Failing to follow the instructions will result in a 25% penalty.
To tar and zip the contents, use the command: tar -cvzf name of tar.tgz A2. Make sure to
decompress the tar file to ensure the entire A2 directory is there. Then you should submit the tar file onto
Quercus.

联系我们

QQ：99515681

邮箱：99515681@qq.com

工作时间：8:00-21:00

微信：codinghelp

热点文章

更多

辅导 comm2000 creating socia... 2026-01-08

讲解 isen1000 – introductio... 2026-01-08

讲解 cme213 radix sort讲解 c... 2026-01-08

辅导 csc370 database讲解迭代 2026-01-08

讲解 ca2401 a list of colleg... 2026-01-08

讲解 nfe2140 midi scale play... 2026-01-08

讲解 ca2401 the universal li... 2026-01-08

辅导 engg7302 advanced compu... 2026-01-08

辅导 comp331/557 – class te... 2026-01-08

讲解 soft2412 comp9412 exam辅... 2026-01-08

讲解 scenario # 1 honesty讲解... 2026-01-08

讲解 002499 accounting infor... 2026-01-08

讲解 comp9313 2021t3 project... 2026-01-08

讲解 stat1201 analysis of sc... 2026-01-08

辅导 stat5611: statistical m... 2026-01-08

辅导 mth2010-mth2015 - multi... 2026-01-08

辅导 eeet2387 switched mode ... 2026-01-08

讲解 an online payment servi... 2026-01-08

讲解 textfilter辅导 r语言 2026-01-08

讲解 rutgers ece 434 linux o... 2026-01-08

热点标签

mktg2509 csci 2600 38170 lng302 csse3010 phas3226 77938 arch1162 engn4536/engn6536 acx5903 comp151101 phl245 cse12 comp9312 stat3016/6016 phas0038 comp2140 6qqmb312 xjco3011 rest0005 ematm0051 5qqmn219 lubs5062m eee8155 cege0100 eap033 artd1109 mat246 etc3430 ecmm462 mis102 inft6800 ddes9903 comp6521 comp9517 comp3331/9331 comp4337 comp6008 comp9414 bu.231.790.81 man00150m csb352h math1041 eengm4100 isys1002 08 6057cem mktg3504 mthm036 mtrx1701 mth3241 eeee3086 cmp-7038b cmp-7000a ints4010 econ2151 infs5710 fins5516 fin3309 fins5510 gsoe9340 math2007 math2036 soee5010 mark3088 infs3605 elec9714 comp2271 ma214 comp2211 infs3604 600426 sit254 acct3091 bbt405 msin0116 com107/com113 mark5826 sit120 comp9021 eco2101 eeen40700 cs253 ece3114 ecmm447 chns3000 math377 itd102 comp9444 comp(2041|9044) econ0060 econ7230 mgt001371 ecs-323 cs6250 mgdi60012 mdia2012 comm221001 comm5000 ma1008 engl642 econ241 com333 math367 mis201 nbs-7041x meek16104 econ2003 comm1190 mbas902 comp-1027 dpst1091 comp7315 eppd1033 m06 ee3025 msci231 bb113/bbs1063 fc709 comp3425 comp9417 econ42915 cb9101 math1102e chme0017 fc307 mkt60104 5522usst litr1-uc6201.200 ee1102 cosc2803 math39512 omp9727 int2067/int5051 bsb151 mgt253 fc021 babs2202 mis2002s phya21 18-213 cege0012 mdia1002 math38032 mech5125 07 cisc102 mgx3110 cs240 11175 fin3020s eco3420 ictten622 comp9727 cpt111 de114102d mgm320h5s bafi1019 math21112 efim20036 mn-3503 fins5568 110.807 bcpm000028 info6030 bma0092 bcpm0054 math20212 ce335 cs365 cenv6141 ftec5580 math2010 ec3450 comm1170 ecmt1010 csci-ua.0480-003 econ12-200 ib3960 ectb60h3f cs247—assignment tk3163 ics3u ib3j80 comp20008 comp9334 eppd1063 acct2343 cct109 isys1055/3412 math350-real math2014 eec180 stat141b econ2101 msinm014/msing014/msing014b fit2004 comp643 bu1002 cm2030

联系我们 - QQ: 99515681 微信：codinghelp

© 2024 www.7daixie.com 站长地图

程序辅导网！

CSCC11程序讲解、辅导Python程序设计、Python编程调试 讲解数据库SQL|解析Haskell程序

CSCC11程序讲解、辅导Python程序设计、Python编程调试讲解数据库SQL|解析Haskell程序