# STAT3006 Assignment 3 Classification

STAT3006 Assignment 3, 2022

Classification
Weighting 30% - due 18/10/2022
This assignment involves constructing and assessing classifiers. We will first consider the
now familiar Iris dataset, collected by Anderson (1936) and first statistically analysed by
Fisher (1936), which requires one to solve the species problem, that is to predict which
(known) species a given specimen belongs to.
The second dataset to be used to train classifiers is the Modified National Institute of
Standards and Technology (MNIST) database of handwritten digits. This contains 70,000
images, of which 10,000 were reserved for testing. In each case you will use the given
labelled data and attempt to construct classifiers which can accurately classify unlabelled
observations.
You should select four classifiers, with at least one classifier based on a probability model,
and one which is not, with each preferably mentioned in this course. Classifiers discussed in
the course which are based on a probability model include linear, quadratic, mixture and
kernel density discriminant analysis. Classifiers discussed which are not based on a
probability model include k nearest neighbours, classification trees and support vector
machines. All of these are implemented via various packages in R. If you wish to use a
different method, please check with the lecturer. You cannot use a (classifier, dataset)
combination that you have used or are using for an assignment in another course.
1. Assume that we will perform linear discriminant analysis for three classes in two
dimensions. Additionally assume that we know the class proportions are equal, that common
covariance matrix is known to be Σ = � 𝜎𝜎1
2 𝜌𝜌 𝜎𝜎1𝜎𝜎2
𝜌𝜌 𝜎𝜎1𝜎𝜎2 𝜎𝜎2
2 � and that the class means are also
known to be 𝜇𝜇𝐴𝐴 = �
𝜇𝜇
𝜇𝜇
𝐴𝐴1
𝐴𝐴2�, 𝜇𝜇𝐵𝐵 = �
𝜇𝜇
𝜇𝜇
𝐵𝐵1
𝐵𝐵2� and 𝜇𝜇𝐶𝐶 = �
𝜇𝜇𝐶𝐶1
𝜇𝜇𝐶𝐶2�, respectively, where all the mean
scalar parameters are real-valued, the standard deviations are positive and the correlation 𝜌𝜌 ∈
[−1,1].
(a) Prove that the decision boundaries between the pairs of classes meet at a single point
under most conditions, and state those conditions. [4 marks]
(b) Determine this point in terms of the given parameters. [1 mark]
2. Apply one probability-based and one non-probability-based classifier to the Iris dataset
(with all 4 predictor dimensions) using R, report the results and interpret them.
Results for each classifier should include the following:
(a) Characterisation of each class as modeled by the classifier. Note that, where possible, this
should include parameter estimates for each class. Where not possible, another attempt
should be made to characterise the class, such as via summaries of the observations which the
classifier puts into each class. [1 mark]
(b) Cross-validation (CV)-based estimates of the overall and class-specific error rates:
obtained by training the classifier on a large fraction of the whole dataset and then applying it
to the remaining data and checking error rates. You may use 5-fold, 10-fold or leave-one-out
cross-validation to estimate performance, but you should give a statistical reason for your
choice. Also include an approximate 95% confidence interval for each error rate, along with a
description of how this was obtained. [2 marks]
(c) Find, list and discuss any Iris observations which were misclassified in the CV checks. [1
mark]
(d) Plots of the predicted classes as they apply to the data and the data space, including visual
representation of the decision boundaries, covering all unique pairs of explanatory variables.
Note: you do not need to derive these boundaries – they can emerge from a plot. [1 mark]
(e) Compare and contrast the decision boundaries between classes produced by the two
methods and try to explain their shapes. Which method do you think was best for this
dataset? Explain. Describe some aspects of either method that you think are appropriate or
inappropriate for this classification problem. [2 marks]
(f) You are now asked to predict the class of new observations collected from an area where
the class proportions have changed to 0.5, 0.2 and 0.3 for setosa, virginica, and versicolor,
respectively.
Describe (with mathematical details) how you would change or refit each classifier to give it
the lowest possible expected cost of prediction under these circumstances. Change or refit the
classifiers as necessary to do this and report point estimates of the new classifier parameters,
or class characterisation, as in (a). Explain the nature of the changes. [3 marks]
Note: we will not compare our classification results with those of Fisher’s 1936 paper “The
Use of Multiple Measurements in Taxonomic Problems”. However, this paper is worth
reading that paper for background on the dataset and some of the aims of its analysis.
3. Choose two methods of classification that you have not used on the Iris dataset and apply
them to the MNIST dataset – see http://yann.lecun.com/exdb/mnist/ . Leave the train and test
split as it is, but feel free to use some of the training data to help choose a model, if desired.
Aim for best possible predictive performance, but view this as primarily a learning exercise.
I.e. you do not need to choose methods with the best performance. However, you should aim
to get reasonable performance out of any method chosen, e.g. with reasonable choice of any
hyperparameters (say 20% error rate or less). You should not pre-process the data in a way
which makes use of any knowledge you have of the digit recognition problem. I.e. don’t try
to produce new explanatory variables which represent image features, even though this would
likely help performance. You can use dimension reduction if you wish (e.g. PCA).
(a) Give a brief introduction to the dataset, including quantitative aspects. [1 mark]
(b) Give a summary of the predictive performance on the test set for each classifier. Make
sure you do not use the test set at all before doing this. Include at least estimated overall error
rate and class-specific error rates, along with approximate 95% confidence intervals for these.
[2 marks]
(c) For each classifier, also report error rates as estimated using the training set. Attempt to
explain any differences between the error rates estimated from the training and test sets. Note
that reference to training and test sets here are to the labelling of the original data, not to how
you may have used them. [2 marks]
(d) Explain why you chose each classifier type and describe some of their apparent strengths
and weaknesses for this problem. [2 marks]
(e) For each classifier, show 1 example per possible digit (i.e. 20 in total) of handwritten
digits which were classified into the correct class with the most certainty, and quantify what
you mean by certainty. Explain why you think the classifiers were particularly successful at
classifying these correctly and with certainty. Note: defining and implementing “certainty”
may take some thought, creativity, maths and coding. [3 marks]
(f) For each classifier, show 1 example per digit (i.e. 20 in total) of the worst errors made by
your classifier and quantify what you mean by worst. Explain why you think some of these
errors may have been made by your classifier and been among the worst seen. [2 marks]
(g) What is the difference between a handwritten 7 and a 1 according to each classifier? Try
to explain what each classifier is doing in this case, i.e. what are the main things the classifier
considers to make this decision and how are they used? [3 marks]
Notes:
(i) Some R commands you might find useful:
objects() – gives the current list of objects in memory.
attributes(x) – gives the set of attributes of an object x.
(ii) For the Iris dataset, we will assume that each species was collected from an environment
where all three are equally likely to be selected in a random sample. We can view the sample
as representative and the prevalence of each species is similar in some environments. (See
section VI of Fisher, 1936 for some details on how the observations were collected.)
(iii) Make it a habit to give reasons or justifications for decisions or statements.
(iv) Please put all the R commands in a separate file or files and submit these separately via a
single text file or a zip file. You should not give any R commands in your main report and
should not include any raw output – i.e. just include figures from R (each with a title, axis
labels and caption below) and put any relevant numerical output in a table or within the text.
marking.
submit should be your own work. Even where working from sources, you should endeavour
to write in your own words. Use consistent notation throughout your assignment and define
all of it.
(vii) Some references
R:
Maindonald, J. and Braun, J. Data Analysis and Graphics Using R - An Example-Based
Approach, 3rd edition, Cambridge University Press, 2010.
Venables, W.N. and Ripley, B.D., Modern Applied Statistics with S, Fourth Edition, Springer,
2002.
Wickham, H. and Grolemund, G. R for Data Science, O'Reilly, 2017.
Classification and Clustering:
Bishop, C. Pattern Recognition & Machine Learning, Springer, 2006.
Devroye, L., Gyorfi, L. and Lugosi, G., A Probabilistic Theory of Pattern Recognition,
Springer, 1996.
Duda, R.O., Hart, P.E. and Stork, D.G., Pattern Classification, Wiley, 2001.
Goodfellow, I., Bengio, Y. & Courville, A. (2016), Deep Learning, MIT Press
Hardle, W.K. and Simar, L., Applied Multivariate Statistical Analysis, 4th ed., Springer,
2015.
Hastie, T. and Tibshirani, R. Discriminant analyses by Gaussian mixtures, Journal of the
Royal Statistical Society B, 8, 155-176. (MDA paper), 1996.
Hastie, T. and Tibshirani, R. and Friedman, J. The Elements of Statistical Learning: Data
Mining, Inference, and Prediction, 2nd edition, Springer, 2009.
McLachlan, G.J. and Peel, D. Finite Mixture Models, Wiley, 2000.
McLachlan, G.J. Discriminant Analysis and Statistical Pattern Recognition, Wiley, 1992.
Scholkopf, B. and Smola, A. J. Learning with Kernels, MIT Press, 2001.
Data Sources:
Anderson, E. The species problem in Iris, Annals of the Missouri Botanical Garden 23 (3):
457–509, 1936.
Fisher, R.A. The use of multiple measurements in taxonomic problems, Annals of Eugenics 7
(2): 179–188, 1936.
LeCun, Y., Bottou, L., Bengio, Y. and Haffner, P. Gradient-Based Learning Applied to
Document Recognition, Proceedings of the IEEE, 86, 2278-2324, 1998.