首页 > > 详细

辅导ISYE 6740解析Python语言程序

ISYE 6740 Homework 3 
Total 100 points. 
1. Basic optimization. (30 points.) 
Consider a simplified logistic regression problem. Given m training samples (xi, yi), i = 1, . . . ,m. 
The data xi ∈ R (note that we only have one feature for each sample), and yi ∈ {0, 1}. To fit a 
logistic regression model for classification, we solve the following optimization problem, where θ ∈ R 
is a parameter we aim to find: 
max θ (θ), (1) 
where the log-likelhood function 
`(θ) = m∑ i=1 {− log(1 + exp{−θxi}) + (yi − 1)θxi} . 
(a) (10 points) Show step-by-step mathematical derivation for the gradient of the cost function `(θ) 
in (1) and write a pseudo-code for performing gradient descent to find the optimizer θ∗. This is 
essentially what the training procedure does. (pseudo-code means you will write down the steps 
of the algorithm, not necessarily any specific programming language.) 
(b) (10 points) Present a stochastic gradient descent algorithm to solve the training of logistic 
regression problem (1). 
(c) (10 points) We will show that the training problem in basic logistic regression problem 
is concave. Derive the Hessian matrix of `(θ) and based on this, show the training problem (1) 
is concave (note that in this case, since we only have one feature, the Hessian matrix is just a 
scalar). Explain why the problem can be solved efficiently and gradient descent will achieve a 
unique global optimizer, as we discussed in class. 
2. Comparing Bayes, logistic, and KNN classifiers. (30 points) 
In lectures we learn three different classifiers. This question is to implement and compare them. We are 
suggest use Scikit-learn, which is a commonly-used and powerful Python library with various machine 
learning tools. But you can also use other similar library in other languages of your choice to perform 
the tasks. 
Part One (Divorce classification/prediction). (20 points) 
This dataset is about participants who completed the personal information form and a divorce predic- 
tors scale. 
The data is a modified version of the publicly available at https://archive.ics.uci.edu/ml/datasets/ 
Divorce+Predictors+data+set (by injecting noise so you will not replicate the results on uci web- 
site). There are 170 participants and 54 attributes (or predictor variables) that are all real-valued. The 
dataset marriage.csv. The last column of the CSV file is label y (1 means “divorce”, 0 means “no 
divorce”). Each column is for one feature (predictor variable), and each row is a sample (participant). 
A detailed explanation for each feature (predictor variable) can be found at the website link above. 
Our goal is to build a classifier using training data, such that given a test sample, we can classify (or 
essentially predict) whether its label is 0 (“no divorce”) or 1 (“divorce”). 
Build three classifiers using (Naive Bayes, Logistic Regression, KNN). Use the first 80% data for 
training and the remaining 20% for testing. If you use scikit-learn you can use train test split to split 
the dataset. 
(a) (10 points) Report testing accuracy for each of the three classifiers. Comment on their perfor- 
mance: which performs the best and make a guess why they perform the best in this setting. 
(b) (10 points) Use the first two features to train three new classifiers. Plot the data points and 
decision boundary of each classifier. Comment on the difference between the decision boundary 
for the three classifiers. Please clearly represent the data points with different labels using different 
colors. 
Part Two (Handwritten digits classification). (10 points) Repeat the above using the MNIST 
Data in our Homework 2. Here, give “digit” 6 label y = 1, and give “digit” 2 label y = 0. All the 
pixels in each image will be the feature (predictor variables) for that sample (i.e., image). Our goal 
is to build classifier to such that given a new test sample, we can tell is it a 2 or a 6. Using the first 
80% of the samples for training and remaining 20% for testing. Report the classification accuracy on 
testing data, for each of the three classifiers. Comment on their performance: which performs the best 
and make a guess why they perform the best in this setting. 
3. Naive Bayes for spam filtering. (40 points) 
In this problem we will use the Naive Bayes algorithm to fit a spam filter by hand. This will en- 
hance your understanding to Bayes classifier and build intuition. This question does not involve any 
programming but only derivation and hand calculation. 
Spam filters are used in all email services to classify received emails as “Spam” or “Not Spam”. A 
simple approach involves maintaining a vocabulary of words that commonly occur in “Spam” emails 
and classifying an email as “Spam” if the number of words from the dictionary that are present in the 
email is over a certain threshold. We are given the vocabulary consists of 15 words 
V = {secret, offer, low, price, valued, customer, today, dollar, million, sports, is, for, play, healthy, pizza}. 
We will use Vi to represent the ith word in V . As our training dataset, we are also given 3 example 
spam messages, 
• million dollar offer 
• secret offer today 
• secret is secret 
and 4 example non-spam messages 
• low price for valued customer 
• play secret sports today 
• sports is healthy 
• low price pizza 
Recall that the Naive Bayes classifier assumes the probability of an input depends on its input feature. 
The feature for each sample is defined as x(i) = [x 
T , i = 1, . . . ,m and the class of the 
ith sample is y(i). In our case the length of the input vector is d = 15, which is equal to the number 
of words in the vocabulary V . Each entry x 
j is equal to the number of times word Vj occurs in the 
i-th message. 
(a) (5 points) Calculate class prior P(y = 0) and P(y = 1) from the training data, where y = 0 
corresponds to spam messages, and y = 1 corresponds to non-spam messages. Note that these 
class prior essentially corresponds to the frequency of each class in the training sample. 
(b) (10 points) Write down the feature vectors for each spam and non-spam messages. 
(c) (15 points) In the Naive Bayes model, assuming the keywords are independent of each other (this 
is a simplification), the likelihood of a sentence with its feature vector x given a class c is given 
where 0 ≤ θc,k ≤ 1 is the probability of word k appearing in class c, which satisfies 
Given this, the complete log-likelihood function for our training data is given by 
(In this example, m = 7.) Calculate the maximum likelihood estimates of θ0,1, θ0,7, θ1,1, θ1,15 by 
maximizing the log-likelihood function above. (Hint: We are solving a constrained maximization 
problem. To do this, remember, you only need to introduce one Lagrangian multiplier because 
you only have one constraint.) 
(d) (10 points) Given a test message “today is secret”, using the Naive Bayes classier that you have 
trained in Part (a)-(c), to calculate the posterior and decide whether it is spam or not spam. 
联系我们
  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-21:00
  • 微信:codinghelp
热点标签

联系我们 - QQ: 99515681 微信:codinghelp
程序辅导网!