COMP9417讲解、Data Mining辅导、辅导Python、Python编程设计调试辅导R语言编程|调试Matlab程序

COMP9417 Machine Learning and Data Mining – Final Examination
1. TIME ALLOWED — 24 HOURS
2. THIS EXAMINATION PAPER HAS 14 PAGES
3. TOTAL NUMBER OF QUESTIONS — 5
4. ANSWER ALL 5 QUESTIONS
5. TOTAL MARKS AVAILABLE — 100
6. OPEN BOOK EXAM - LECTURE NOTES, TUTORIALS, AND ONLINE RESOURCES
ARE PERMITTED. PLEASE USE REFERENCES WHERE NECESSARY.
7. SUBMISSION — YOU WILL SUBMIT A SINGLE PDF FILE answers.pdf
CONTAINING YOUR ANSWERS FOR ALL QUESTIONS ATTEMPTED. START EACH
SUB-QUESTION ON A NEW PAGE. MARKS MAY BE DEDUCTED FOR UNCLEAR
WORK. YOU MAY TYPE YOUR SOLUTIONS USING LATEX, OR TAKE CLEAR
PHOTOS OF HANDWRITTEN WORK, OR MAKE USE OF A DIGITAL TABLET.
FOR QUESTIONS THAT REQUIRE CODING, YOU WILL SUBMIT A SINGLE
PYTHON FILE solutions.py (SEE TEMPLATE) CONTAINING YOUR CODE,
THOUGH ALL GENERATED PLOTS/TABLES MUST BE INCLUDED IN THE
PDF FILE. CODE MUST ONLY BE SUBMITTED IN THE PYTHON
FILE solutions.py.
8. DISCUSSION WITH OTHER STUDENTS IS STRICTLY PROHIBITED. CODE
SUBMISSIONS WILL BE CHECKED FOR PLAGIARISM. CHEATING WILL RESULT
IN A FAILING GRADE FOR THE COURSE AND POTENTIAL FURTHER
1 Please see over
DISCIPLINARY ACTION.
9. IF NEEDED, YOU ARE PERMITTED TO SEEK CLARIFICATION ON THE EXAM
WEBCMS FORUM. QUESTIONS SPECIFIC TO CONTENT WILL NOT BE ANSWERED
AND MAY BE REMOVED. OFFENDERS MAY HAVE EXAM MARKS DEDUCTED IF
INFORMATION IS DIVULGED ON THE FORUM DURING OR FOLLOWING THE
EXAM.
10. ENSURE YOUR ANSWER FORMATTING IS CLEAR AND LEGIBLE. BEGIN EACH
SUB-QUESTION ON A NEW PAGE WITH A CLEAR HEADING. HIGHLIGHT FINAL
ANSWERS AND MAKE USE OF DOT POINTS FOR DISCUSSION QUESTIONS. MARKS
MAY BE DEDUCTED FOR WORK THAT IS DIFFICULT TO READ OR UNDERSTAND.
11. SUBMIT YOUR EXAM ANSWERS AND SOLUTIONS VIA THE CSE GIVE SYSTEM.
THE COMMAND TO TO SUBMIT WILL BE:
give cs9417 exam answers.pdf solutions.py
SUBMISSION WILL BE OPEN FROM 09:00 AUSTRALIAN EASTERN STANDARD
TIME (AEST) FOR 24 HOURS.
12. BY STARTING THIS EXAM AS A STUDENT OF THE UNIVERSITY OF NEW
SOUTH WALES, YOU DO SOLEMNLY AND SINCERELY DECLARE THAT YOU HAVE
NOT SEEN ANY PART OF THIS SPECIFIC EXAMINATION PAPER FOR THE
ABOVE COURSE PRIOR TO ATTEMPTING THIS EXAM, NOR HAVE ANY DETAILS
OF THE EXAM’S CONTENTS BEEN COMMUNICATED TO YOU. IN ADDITION,
YOU WILL NOT DISCLOSE TO ANY UNIVERSITY STUDENT ANY INFORMATION
CONTAINED IN THE ABOVEMENTIONED EXAM, AND YOU WILL COMPLETE THE
EXAM ON YOUR OWN WITHOUT ANY EXTERNAL HELP. VIOLATION OF THIS
AGREEMENT IS CONSIDERED ACADEMIC MISCONDUCT AND PENALTIES MAY
APPLY.
2 Please see over
Question 1 [12 marks]
This question covers Naive Bayes and requires you to refer to the training data given in the
table below, which shows the number of times each of four words A, B, C and D occurs in each
of eight documents, four in the positive class, and four negative.
Document No. A B C D Class
Consider a new document x? with 1 occurrence of A, no occurrences of B, no occurrence of C,
and 2 occurrences of D. Round all answers to four decimal places.
(a) [4 marks]
Recall that if X = (X1, X2, X3, X4) ∼ Multinomial(p1, p2, p3, p4), then
P(X = (x1, x2, x3, x4)) =P4i=1 xi.
Construct a Naive Bayes classifier for the problem of predicting whether a document should be
classified as positive or negative using the multinomial distribution to model the probability of
a word occurring or not in a class. Under your model, what is the value of P(+|x?)? Do not use
smoothing.
(b) [2 marks]
Now, apply (add-1) smoothing to your probability estimates. Under the smoothed probabilities,
what is P(−|x?) under the multinomial model?
(c) [4 marks]
Recall that if X ∼ Bernoulli(p) then
P(X = x) = px(1 − p)1−x.
3 Please see over
Construct a Naive Bayes classifier for the problem of predicting whether a document should
be classified as positive or negative using the multivariate Bernoulli distribution to model the
probability of a word occurring or not in a class. What is p(+|x?) now? Do not use smoothing.
(d) [2 marks]
Now, apply (add-1) smoothing to your probability estimates. Under the smoothed probabilities,
what is P(−|x?) under the multivariate Bernoulli model?
4 Please see over
Question 2 [23 marks] In this question, we will apply gradient descent to a simulated dataset.
You may make use of numpy and matplotlib. You are not permitted to make use of any existing
numpy implementations of gradient descent (if they exist). Generate data using the following
Python code:
1 import numpy as np
2 import matplotlib . pyplot as plt
3
4 np . random . seed (42) # make sure you run this line for consistency
5 x = np . random . uniform (1 , 2 , 100)
6 y = 1.2 + 2.9 * x + 1.8 * x **2 + np . random . normal (0 , 0.9 , 100)
7 plt . scatter (x , y )
8 plt . show ()
Your data should look like the following:
Then, consider the loss function,
where c ∈ R is a hyper-parameter.
5 Please see over
(a) [3 marks] Consider the (simple) linear model ˆyi = w0 +w1xi
for i = 1, . . . , n. We can write
this more succinctly by letting w = (w0, w1)
T and Xi = (1, xi)
T
, so that ˆyi = w
T Xi
. The loss
achieved by our model (w) on a given dataset of n observations is then
Lc(y, yˆ) = Xni=1Lc(yi, yˆi) = Xni=1 "r1c2
(yi − wT Xi)2 + 1 − 1,
Compute the following derivatives:
∂Lc(yi, yˆi)∂w0
and ∂Lc(yi, yˆi)∂w1.
If you are unable to figure this out, you may use Wolfram Alpha or a similar program to compute
the derivatives in order to make use of them in later questions. Note that you must show your
working for full marks, so solutions using a tool like Wolfram Alpha will receive a mark of zero.
(b) [2 marks] Using the gradients computed in (a), write down the gradient descent updates
for w0 and w1 (using pseudocode), assuming a step size of α. Note that in gradient descent we
consider the loss over the entire dataset, not just at a single observation. For simplicity, assume
that your updates are always based on the values at the previous time step, even if you have
access to the value at the current time step (i.e., when updating multiple parameters, the update
for w
(t+1)
1 might depend on w0, so you could use the new value of w0, w
(t+1)
0
, since it has already
been computed, but here we assume that we just use the old value w
(t)
0
).
(c) [12 marks] In this section, you will implement gradient descent from scratch on the generated
dataset using the gradients computed in (a), and the pseudocode in (b). Initialise your
weight vector to w
(0) = [1, 1]T
, and let c = 2. Consider step sizes α ∈ [1, 0.1, 0.01, 0.001, . . . , 0.00000001]
(a total of 9 step sizes). For each step-size, generate a plot of the loss achieved at each iteration
of gradient descent. You should use 100 iterations in total. Generate a 3 × 3 grid of figures
showing performance for each step-size. Add a screen shot of the Python code for this question
in your answers PDF file (as well as pasting the code into your solutions.py file). The following
code may help with plotting:
1 fig , ax = plt . subplots (3 ,3 , figsize =(10 ,10) )
2 alphas = [10 e -1 , 10 e -2 , 10 e -3 ,10 e -4 ,10 e -5 ,10 e -6 ,10 e -7 , 10 e -8 , 10 e -9]
3 for i , ax in enumerate ( ax . flat ) :
4 # losses is a list of 9 elements . Each element is an array of length 100
storing the loss at each iteration for
5 # that particular step size
6 ax . plot ( losses [ i ])
7 ax . set_title ( f" step size : { alphas [i]}") # plot titles
8 plt . tight_layout () # plot formatting
9 plt . show ()
6 Please see over
(d) [1 mark] Comment on your results in part (c): what do you see happening for different
step sizes? Why does this occur?
(e) [3 marks] To find an appropriate step-size, re-run the gradient descent to find the optimal
model parameters. State this step-size, the final model, and generate two plots: the first for
the weights w0 and w1 over the 100 iterations, and the second of the data with the final model
super-imposed. What do you think of the final model? Are there any obvious ways to improve
it? Be sure to include the plots in your answers PDF file, and the code used in your solutions.py
file.
(f) [2 marks] Consider the following scenario: you re-run the analysis for various values of c
and you notice that the optimal step-size varies across different values of c. Describe an approach
to choosing an appropriate value of c.
7 Please see over
Question 3 [25 marks] This question covers the (kernel) perceptron and requires you to refer
to the following training data for parts (a)-(c). You are only permitted to make use of numpy
and matplotlib. You are not permitted to make use of any existing numpy implementations of
perceptrons (if they exist).
x1 x2 y
-0.8 1 1
3.9 0.4 -1
1.4 1 1
0.1 -3.3 -1
1.2 2.7 -1
-2.45 0.1 -1
-1.5 -0.5 1
1.2 -1.5 1
Table 1: Data for Question 3
(a) [3 marks] Plot the data in Table 1. Recall that the polynomial kernel is defined as
k(x, y) = (m + x
T
y)
d
for m ∈ {0, 1, 2, . . . } and d ∈ {1, 2, . . . }. Each such kernel corresponds to
a feature representation of the original data. Find the simplest polynomial kernel for which this
data becomes linearly separable (note: simplest in this case is defined as the polynomial kernel
with the smallest values of both m and d).
(b) [2 marks] The optimal kernel found in (a) induces a feature representation in R
p
for some
integer p determined by your choice of kernel. Choose a subset of two coordinates from this
p-dimensional vector and plot the transformed data. For example, for vector (w, x, y, z)
T ∈ R
4
,
a subset of two coordinates could be the first two coordinates (w, x), or the first and third
coordinates (w, y), etc.). Is your transformed data linearly separable?
(c) [10 marks] Train a kernel perceptron on this data with initial weight vector w
(0) = 1p
(the vector of ones in R
p
). Use a learning rate of η = 0.2. Note: this should be done in
numpy. Provide a table outlining all updates of the weight vector, and the iteration number at
which the update occured. State the final learned perceptron and the number of iterations until
convergence. Demonstrate that your perceptron correctly classifies each of the points. You may
use the following table as a template for presenting your results:
8 Please see over
Iteration No. w0 w1 . . . wp
0 1 1 . . . 1
first update iteration 1+δ0 1+δ1 . . . 1+δp
where δj
is the update for wj computed at the first iteration. To demonstrate that your perceptron
classifies each point correctly, use the following table:
xi φ(xi) yiφT(xi)w∗[−0.8, 1]T φ([−0.8, 1]T) r1 > 0
[1.2, −1.5]T φ([1.2, −1.5]T) r8 > 0
where ri = yiφT(xi)w∗
should be positive if your perceptron has converged. Along with your
table(s) of results, provide a screen shot of your code in your answers PDF file, and add the code
to your solutions.py file.
(d) [5 marks] Let x, y ∈ R
2(i.e., x and y are two dimensional vectors), and consider the kernel
k(x, y) = (2xTy + 3)3.
Compute the feature vector φ(x) correpsonding to this kernel (in other words, the feature representation
of x and y such that φ(x)
T φ(y) = k(x, y)).
(e) [5 marks]
Consider the following dataset. The positive examples (class = +1) are:
(2, 0), (0, 2), (1, 1),
and the negative examples (class = −1) are:
(−2, 0), (0, −2), (−1, −1),
Claim: An ordering of the examples in this dataset on which a perceptron (with learning rate
η = 1 ) makes at least 5 mistakes during training cannot exist.
If you agree with this claim, provide a proof of why it must be true. Otherwise, provide an
ordering of the examples such that the number of mistakes is achieved.
9 Please see over
Question 4 [20 marks] This question covers the Support Vector Machine and requires you to
refer to the following two-dimensional training data:
x1 x2 y
-3 9 -1
-2 4 -1
-1 1 -1
0 -3 1
1 1 1
2 4 -1
3 9 -1
(a) [1 mark] Plot the data (no need to provide any code if you do this in Python). Is the data
linearly separable?
(b) [11 marks] In this section, we will build an SVM classifier by hand. Recall that the dual
formulation of the SVM classifier for a dataset of size n is
arg max, subject to αi ≥ 0, for all i, Xn
i=1
αiyi = 0.
Solve for (α1, . . . , α7) for the provided dataset. (Hint: Using the plot in the previous section
and your knowledge of SVMs, try to simplify the (dual) objective before doing any calculus.)
(c) [3 marks] For your SVM, compute the corresponding weight vector (w ∈ R
2
) and bias t.
Superimpose a plot of the SVM onto the scatter in (a). What is the margin for this model?
(d) [2 marks] Discuss briefly the following claim: Linear classifiers are unable to represent
non-linear functions.
(e) [3 marks] In your own words, explain what is meant by the ‘kernel trick’. Why is this such
an important technique for machine learning?
10 Please see over
Question 5 [20 marks] In this question, we will consider the Scikit-learn implementations of
the following classifiers:
• Decision Trees
• Random Forest
• AdaBoost
• Logistic Regression
• Multilayer Perceptron (Neural Network)
• Support Vector Machine
You are required to compare the performance of the above models on a binary classification task.
The following code loads in these classifiers and defines a function to simulate a toy dataset:
1 import numpy as np
2 import matplotlib . pyplot as plt
3 from matplotlib . colors import ListedColormap
4 import warnings
5 warnings . simplefilter ( action =’ignore ’, category = FutureWarning )
6
7 import time
8 from sklearn . svm import SVC
9 from sklearn . linear_model import LogisticRegression
10 from sklearn . ensemble import AdaBoostClassifier
11 from sklearn . ensemble import RandomForestClassifier
12 from sklearn . tree import DecisionTreeClassifier
13 from sklearn . neural_network import MLPClassifier
14 from sklearn . model_selection import train_test_split
15 from sklearn . preprocessing import StandardScaler
16 from sklearn . datasets import make_classification
17
18 def create_dataset () :
19 X , y = make_classification ( n_samples =1250 ,
20 n_features =2 ,
21 n_redundant =0 ,
22 n_informative =2 ,
23 random_state =5 ,
24 n_clusters_per_class =1)
25 rng = np . random . RandomState (2)
26 X += 3 * rng . uniform ( size = X . shape )
27 linearly_separable = (X , y )
28 X = StandardScaler () . fit_transform ( X )
29 return X , y
11 Please see over
(a) [3 marks] Generate a dataset. Then, randomly split the dataset into training set Xtrain
and test set Xtest, with 80 examples for training and 20 for testing. Plot the decision boundaries
of each of the classifiers on the test set. You may wish to use the following plotter function which
plots the decision boundary and can work for any sklearn model.
1 def plotter ( classifier , X , X_test , y_test , title , ax = None ) :
2 # plot decision boundary for given classifier
3 plot_step = 0.02
4 x_min , x_max = X [: , 0]. min () - 1 , X [: ,0]. max () + 1
5 y_min , y_max = X [: , 1]. min () - 1 , X [: ,1]. max () + 1
6 xx , yy = np . meshgrid ( np . arange ( x_min , x_max , plot_step ) ,
7 np . arange ( y_min , y_max , plot_step ) )
8 Z = classifier . predict ( np . c_ [ xx . ravel () , yy . ravel () ])
9 Z = Z . reshape ( xx . shape )
10 if ax :
11 ax . contourf ( xx , yy , Z , cmap = plt . cm . Paired )
12 ax . scatter ( X_test [: , 0] , X_test [: , 1] , c = y_test )
13 ax . set_title ( title )
14 else :
15 plt . contourf ( xx , yy , Z , cmap = plt . cm . Paired )
16 plt . scatter ( X_test [: , 0] , X_test [: , 1] , c = y_test )
17 plt . title ( title )
Note that you can use the same general approach for plotting a grid as used in Question 2,
and the plotter function supports an ‘ax’ argument. Note that this plotter function differs
slightly from the one in the sample final. Be sure to include all your code in solutions.py,
and any figures generated in your answers PDF file.
(b) [10 marks] Now, test how the performance of each of the classifiers varies as you increase
the size of the training set. Fix your training and test sets from part (a). Then, starting from a
random subset (chosen with replacement) of your training set of size 50, train your classification
model, and compute the accuracy on the test set. Repeat this process for training set sizes of
[50, 100, 200, 300, . . . , 1000]. Repeat the experiment a total of 10 times for each classifier. Then,
for each classifier, plot its average accuracy at each training set size. Compare the accuracy
across different algorithms in a single figure, and in 5 lines or less, discuss your observations.
Please use the following color code for your plots: [Decision Tree, Random Forest, AdaBoost,
Logistic Regression, Neural Network, SVM]. Be sure to include all your code in solutions.py, and
any figures generated in your answers PDF file.
12 Please see over
(c) [7 marks] Using the time module, record the training time for each of the classifiers at
each of the training set sizes. Plot the average training time over the 10 trials of each classifier
as a function of the training size. You may add code for this section to your code in part (b).
What do you observe? In 5 lines or less, discuss your observations. Use the same color scheme
as in (b). Be sure to include your code in solutions.py, and any figures generated in your answers
PDF file.