CSC 503辅导、辅导Data Mining、Python语言讲解、Python程序设计调试讲解SPSS|讲解留学生Processing

Data Mining (CSC 503/SENG 474)
Instructions:
• You must complete this assignment on your own; this includes any coding/implementing,
running of experiments, generating plots, analyzing results, writing up results, and working
out problems. Assignments are for developing skills to make you strong. If you do the
assignments well, you’ll almost surely do better on the final and beyond.
• On the other hand, you can certainly have high-level discussions with classmates and can
post questions to the conneX forum. You can also discuss preprocessing (data normalization)
with classmates. You also are welcome to come to office hours (or otherwise somehow ask me
and the TAs things if you are stuck).
• You must type up your analysis and solutions; I strongly encourage you to use LaTeX to do
this. LaTeX is great both for presenting figures and math.
• Please submit your solutions via conneX by the due date/time indicated above. This is a
hard deadline. Any late assignment will result in a zero (I won’t accept assignments even
one day late; “I forgot to hit submit on conneX”, etc. will not be met with any sympathy).
However, I will make an exception if I am notified prior to the deadline with an acceptable
excuse and if you further can (soon thereafter) provide a signed note related to this excuse.
1
Introduction
This assignment has two parts. The first part involves implementing and applying (or finding
and applying) some of the methods you have already seen and analyzing their results on a binary
classification problem; this part is for all students (both CSC 503 and SENG 474). The theme of
this part is using cross-validation for hyperparameter tuning and model selection. The second part
is actually only for graduate students (only CSC 503) and is about training logistic regression with
stochastic gradient descent.
1 Experiments and Analysis
First, “implement” the following methods:
• Logistic regression. Your implementation should allow for some form of regularization.
I suggest going with l2 regularization. This means that you penalize the training objective
according to kwk
2
, i.e., the sum of the squares of the weights (not including the bias term).
To be clear, for binary labels yi ∈ {0, 1}, the suggested training objective is:
where:
• `(y, yˆ) = −y log ˆy − (1 − y) log(1 − yˆ) is the cross-entropy loss;
• σ(z) = 1
1+e−z is the sigmoid function;
is the squared Euclidean norm of the vector w.
Here, C ≥ 0 is the regularization parameter. Higher C means less regularization (more
emphasis given to the training error).
• Support vector machines. This is the soft-margin version of SVM (the one that works
for data that aren’t necessarily linearly separable). Your implementation should allow for
specifying a kernel function (you’ll need the linear kernel and Gaussian kernel). Please see
this page for a description of the soft-margin SVM problem, along with the regularization
parameter that is used: https://scikit-learn.org/stable/modules/svm.html#svc.
The relevant part is the equation following the text “SVC solves the following primal problem”.
Note that the regularization parameter C plays the same role as it did in logistic regression.
Higher C means less regularization.
• k-fold cross-validation. Please implement this yourself so that you actually know how
it works. Various machine learning/statistics software includes it, but I still want you to
implement it yourself. We will be checking that you coded this up. For all tasks involving
using k-fold cross-validation, you will use the implementation you coded up.
I put “implement” in quotes because I won’t actually require you to implement logistic regression
and SVMs; you can instead use machine learning software that you find online. For any
implemenations you do, you can use whatever programming language you like. Logistic regression
is implemented in scikit-learn. For SVMs in particular, it is quite difficult to implement a method
for learning “kernelized SVMs”, i.e., SVMs that use kernel functions. I therefore very strongly
recommend that you use an existing implementation. For example, scikit-learn has SVMs and lets
you specify the kernel.
2
What to test on
You’ll be analyzing the performance of these methods on a somewhat-recently devised classification
problem, known as Fashion-MNIST. Basically, the idea is to classify 28 × 28 grayscale images of
clothing/apparel items, where each image belongs to one of ten classes. You can read all about it
here if you would like to know more:
https://github.com/zalandoresearch/fashion-mnist
You can obtain the dataset from the previous link. Alternatively, you can use this direct link:
https://github.com/zalandoresearch/fashion-mnist/archive/master.zip
For instructions on how to load the dataset, use this README:
https://github.com/zalandoresearch/fashion-mnist/blob/master/README.md
I recommend the section titled “Loading data with Python (requires Numpy)”. Moreover,
you should be able to use scikit-learn for this assignment. If you encounter any trouble loading the
data, please see Appendix A before contacting the instructor/TAs.
As mentioned above, Fashion-MNIST corresponds to a multi-class problem with ten classes;
however, this assignment is about binary classification. Therefore, once you load the dataset, use
only the data points (from both the training and test sets) from classes 5 and 7 (corresponding
to “Sandal” and “Sneaker” respectively). You can let class 5 from Fashion-MNIST be class 0 of
your binary classification problem (and class 7 from Fashion-MNIST can be class 1 of your binary
classification problem). To make training easier, I recommend rescaling the inputs by dividing them
by 255, so they lie in the range [0, 1]. This rescaling is a form of preprocessing. You (optionally)
can also try other preprocessing techniques. You can feel free to ask classmates about preprocessing
techniques that work well or don’t work well.
How to do the analysis
Now that you have implementations and the data prepared, it’s time for the main part of the
assignment: the experiment analysis. Your analysis will go into a file called report.pdf and
consists of the following parts, which should be presented in separate sections in this order:
First. For logistic regression, plot how the training and test error change as you vary the regularization
parameter C. For training, use the whole training set (all 12,000 examples), and for testing,
use the whole test set (all 2,000 examples). You should try at least ten values of the regularization
parameter. I suggest using a logarithmically spaced grid for initial value C0 > 0 and base α > 1,
so you try C0, αC0, α
2C0, and so on. The values you try ideally should capture the regime of
both underfitting (too much regularization) and overfitting (too little regularization). Discuss the
results.
Second. For linear SVM (SVM with the linear kernel), plot how the training and test error
change as you vary the regularization parameter C. I suggest proceeding as you did for logistic
regression, but note that the range of values you try for the regularization parameter likely will
need to be different. Again, discuss your results.
Third. Now that you have some understanding of the regularization parameter of each algorithm,
it’s time to do some hyperparameter tuning. The goal is to do a fair, head-to-head comparison of
an optimally regularized logistic regression classifier and an optimally regularized linear SVM. For
this, using only the training data, use k-fold cross-validation (for k being some integer between 5
3
and 10) to select the optimal value of the logistic regression regularization parameter. Likewise,
use k-fold cross-validation to select the optimal value of the SVM regularization parameter. Now
that you have optimal regularization parameters for each, train each optimally-tuned method on
the entire training set and report the method’s test error. Discuss the results. Are the test errors
significantly different? Think about confidence intervals (which depend on the test set size) to
assess significance.
Fourth. The final part of the assignment is to go beyond linear classifiers. For this, you’ll use
kernelized SVMs with the Gaussian kernel. This kernel has a scale parameter (sometimes called
a “bandwidth” parameter). For each value γ in a good range of values of the scale parameter
(I suggest using a logarithmic spacing as you did with the regularization parameter), use k-fold
cross-validation on the training set to select the optimal regularization parameter Cγ. Now, for
each (γ, Cγ) pair, you have a tuned SVM that can be trained. For each tuned SVM, train on the
full training set and compute both the training error and the test error. Finally, plot the results.
The x-axis will be γ, and the y-axis will be the training error (or test error) based on an SVM with
parameters tuned to (γ, Cγ).
How do these test errors compare to the test error of linear SVM (with regularization parameter
tuned via cross-validation on the training set, which you already did)?
Advice if you training an SVM is taking too long
You might find that training an SVM can take a long time. This is especially the case when using
the Gaussian (rbf) kernel. In order to complete the assignment within a reasonable amount of time,
it is OK if you reduce the size of the original training set. If you do this, please use the same
reduced-size training set throughout your assignment.
For example, each of classes 5 and 7 consists of 6000 training examples (for 12000 training
examples total). To help with reducing the training time, you could only use 3000 training examples
for each class (for 6000 classes total). If that still takes too long, you might even go as low as using
only 1000 training examples for each class.
Final advice on the report.
Please make your analysis concise (yet thorough). Don’t ramble, and don’t make stuff up. Act as
if you are a respected scientist presenting some work to your colleagues (and you want them to
continue being your colleagues after the presentation).
What to submit
In all, for the analysis portion of the assignment, you should submit (as a zip file):
• the file report.pdf explained above;
• a file called README.txt which contains instructions for running your code (for any code you
wrote) or gives proper attribution to any code/datasets you got from other sources (like the
Internet, for example). If you mostly used some existing code/software but needed to modify
a few files, then give attribution and mention the files you modified.
• a file called code.zip, containing your code. Please organize this file well (embrace the idea
of directories). In case you mostly used some existing code/software but needed to modify a
few files, then just provide those files here, and make sure your README.txt mentions those
files in relation to the existing code/software.
• any additional files that you need.
4
2 Logistic regression and SGD - CSC 503 only
(undergrads, see Section 3 below)
In class, we saw that training a logistic regression model involves minimizing the training error
according to the cross-entropy loss, and we typically further add to the objective an l2 regularization
term.
Implement1
stochastic gradient descent (SGD) for regularized logistic regression, where you
allow the learning rate to change in each round (note: this is basically one line of code!). Therefore,
instead of having a fixed learning rate η, in round t the learning rate will be ηt
. Note that each
choice of the regularization parameter and learning rate schedule corresponds to a different logistic
regression method. For a fixed value of the regularization parameter:
Experiment with different schedules for the learning rate. One schedule, for example, is
ηt =
α
t0+t
, for different choices of α > 0 and t0 > 0. It would be fine if you limit yourself
to this schedule and focus on finding a good value of α and t0. A good strategy could
be to use cross-validation to select from a logarithmically spaced grid for each of α and
t0 (so, a grid in two dimensions). Once you select these, a good way to proceed is to
train the resulting SGD method on the entire training set and compute its test error.
Now, repeat the above for a few different values of the regularization parameter and report the results.
A good idea, if you can run experiments fast enough, is to select the value of the regularization
parameter using cross-validation on the training set.
Report your results by augmenting report.pdf with an additional section.
3 How you’ll be marked
• For each of CSC 503 and SENG 474, the total number of marks is 100, and you receive 10
marks for your code
• For undergrads (SENG 474), the analysis (in report.pdf) is worth 90 marks.
• For grad students (CSC 503), the analysis in part one of report.pdf is worth 70 marks and
the analysis in part two of report.pdf is worth 20 marks.
1
In this case, you actually do need to write code yourself.
5
A Loading Fashion-MNIST
For loading the Fashion-MNIST data, I believe everything should work fine if you are using Python
in a terminal. However, in case you have difficulty loading the data from a Jupyter notebook, then
you can also get the data in CSV format here (you will have to get a Kaggle account to download
the csv files)