讲解CMP-6002B Machine Learning

School of Computing Sciences
Module: CMP-6002B Machine Learning
Assignment: Classification with Decison Trees
Value: 50%
Date due: Wednesday 18th May 2022 3pm
Returned by: Wednesday 17th June 2022
Submission: Blackboard
Learning outcomes
The students will learn about implementing variants of decision trees and ensemble techniques
and the general issues involved with implementing any classifier. They will understand better
how to evaluate and compare classifiers on a range of data and for a specific problem. They will
appreciate the difficulty in presenting technical results in a way to support or refute a hypothesis.
The exercises will improve the following transferable skills: programming in Java; analysis
of data and results; writing clearly and concisely; presenting data coherently. The students will
also learn the basics of how to use GitHub to work with open source software.
Specification
Overview
The task for this assignment is to implement components of decision trees and ensembles,
then to evaluate/compare them on a range of real world data.
1
Part 1: Building a Tree by Hand (10%)
The purpose of this exercise is to make sure you understand how the basic algorithms work and
to provide you with a bench check for later implementations.
The data in Table 1 shows a training data set for a classification problem involving predicting
the region of a whisky based on the three taste attributes: Peaty; Woody; and Sweet. Train a
Table 1: Whisky Region Classification Example
Peaty Woody Sweet Region
yes no yes Islay
yes yes yes Islay
yes no no Islay
yes no no Islay
no yes no Islay
no yes yes Speyside
no yes yes Speyside
no yes yes Speyside
no no yes Speyside
no no yes Speyside
decision tree on this data using the information gain as a splitting criteria.
Construct decision trees for this data using
1. the chi-squared statistic as the splitting criterion;
2. the information gain as the splitting criterion.
The stopping criteria for both trees should be to stop at a pure node (node of just one class) or
if there are no more attributes to use. Include all calculations and draw the final decision tree.
2
Part 2: Implement Variants of Decision Trees and Ensembles (50%)
This exercise involves implementing, debugging and testing variants of decision tree classifiers.
A java code base, tsml, is provided. To get started, you need to clone the ml6002b-coursework
branch on the tsml GitHub repository. There is guidance on how to do this on blackboard.
Please follow method and class names specified in the instructions. Failure to do so may result
in a marks penalty.
Static functions to help assess attribute quality (15%)
Implement a class called AttributeMeasures in the package ml 6002b coursework that contains
four static methods, each of which measures the quality of an attribute split at a node.
They should all take a two dimensional array of integers as an argument and return a double.
You can assume that the rows represent different values of the attribute being assessed, and
the columns the class counts. The methods are:
1. measureInformationGain returns the information gain for the contingency table.
2. measureInformationGainRatio returns the information gain ratio for the contingency
table.
3. measureGini returns the gini measure for the contingency table.
4. measureChiSquared returns the chi-squared statistic for the contingency table.
The formal definitions for information gain, gain ratio, gini and chi-squared are given in the lecture
on Decision Trees, and I want you to follow these formula. Comment the code to indicate
how you have dealt with any edge cases. Include a main method test harness that tests each
function by finding each measure for the attribute Peaty in terms of the diagnosis. Print each
measure to the console, in the form “measure for Peaty = ”.
Alternative attribute selection methods for CourseworkTree (15%)
Your task is to facilitate using your attribute measures as selection mechanisms within the
CourseworkTree classifier provided in the package ml 6002b coursework. It is advisable to
revisit the lab sheet on decision trees and to look at the original source for ID3. There is an
interface called AttributeSplitMeasure that is used in CourseworkTree. You need to implement
variants of this interface and use them within the CourseworkTree. The interface
contains a single abstract method, computeAttributeQuality, that takes an Instances and
an Attribute, that should return the quality of the attribute passed. There is also a default
method splitData to split Instances by the Attribute.
1. Implement and test the skeleton class IGAttributeSplitMeasure so that the split is
performed using information gain or information gain ratio. The choice of measure should
be controlled by a boolean variable called useGain.
2. Implement and test a class ChiSquaredAttributeSplitMeasure that implements
AttributeSplitMeasure and measures the quality using the chi-squared statistic.
3. Implement and test a class GiniAttributeSplitMeasure that implements
AttributeSplitMeasure and measures the quality using the Gini index statistic.
3
4. Configure CourseworkTree so that it can be used with any of the four attribute split measures
you have implemented (including both information gain and information gain ratio).
Adjust CourseworkTree so that the split criterion can be set through setOptions.
5. Currently, CourseworkTree can only be used with nominal attributes. Implement a
method in AttributeSplitMeasure called splitDataOnNumeric that makes a binary
split of instances into those below the value and those above the value. This should be
done prior to measuring the attribute quality. Please research how best to do this.
The majority of the marks will be for correctness. However, some consideration will be given
to efficiency and code quality. Include a main method in all three AttributeSplitMeasure
classes that prints out the split criteria value for Peaty, Woody and Sweet for the data from Section
1 for the whole data. Print in the form “measure for attribute splitting
diagnosis = ”. Add a main method to the class DecisionTreeCoursework that loads
the problem optdigits (all nominal attributes) from the directory test data (do not use absolute
file paths for this), performs a random split, then builds classifiers using IG, ChiSquared
and Gini split criteria, outputting the test accuracy of each to console in the form
“DT using measure on optdigits problem has test accuracy = ”.
Repeat this for the data set ChinaTown (all continuous attributes).
Implement an Ensemble Classifier (20%)
The task is to implement an Ensemble classifier that can be used with variants of your enhanced
CourseworkTree classifier. You must implement this classifier from scratch rather than use any
existing classifiers in tsml. Implement a classifier, TreeEnsemble, that extends
AbstractClassifier and consists of an ensemble of CourseworkTree classifiers stored in an
array or List. The size of the ensemble should be controlled by an integer attribute numTrees,
set to a default value of 50. TreeEnsemble should be in the package ml 6002b coursework.
1. The method buildClassifier should construct a new set of instances for each base
classifier of the ensemble by selecting a random subset (without replacement) of the
attributes. The proportion of attributes to select should be a parameter (defaults to 50%).
The TreeEnsemble will need to store which attributes are used with which classifier in
order to recreate the attribute selections in classifyInstance and
distributionForInstance.
2. Further diversity should be injected into the ensemble by randomising the decision tree
parameters. This diversification should include attribute selection mechanisms you have
implemented, but can also involve other tree parameters such as stopping conditions.
3. Implement classifyInstance so that it returns the majority vote of the ensemble. i.e.,
classify a new instance with each of the decision trees, count how many classifiers predict
each class value, then return the class that receives the largest number of votes.
4. Implement distributionForInstance so that it returns the proportion of votes for
each class.
5. Implement an option where rather than counting votes, the ensemble can average probability
distributions of the base classifiers. This should be controlled by a boolean attribute
called averageDistributions.
6. Implement a main method in the class TreeEnsemble that loads the problems optdigits
and ChinaTown, prints out the test accuracy but also prints out the probability estimates
for the first five test cases.
4
Evaluation of Decision Trees and Ensembles Classifiers (40%)
Your task is to perform a series of classification experiments and write them up as a research
paper. Your experiments will address the question of whether the variations you have implemented
for decision trees improve performance over a range of problems and whether your
ensemble is a good approach for a specific case study data set. You have been assigned a specific
data set (see blackboard) and a link to further information. Please note that for this part
of the coursework we mark the paper itself, not the code used to generate the results. We
advise you reserve significant time for reading and checking your paper, and recommend you
ask someone else to read it before submission. Imagine you are writing a paper for a wide
audience, not just for the marker. Aim for a tight focus on the specific questions the paper
addresses. We have provided two sets of datasets to perform these experiments: these are
in files UCIDisrete.zip and UCIContinuous.zip. A list of the problems in these files is given in
the otherwise empty class DatasetLists.java. You will be assigned a completely different
problem from timeseriesclassification.com for the case study.
There are four issues to investigate:
1. Decision Trees: Test whether there is any difference in average accuracy for the attribute
selection methods on the classification problems we have provided. Compare your versions
of DecisionTree to the Weka ID3 and J48 classifiers.
2. Tree Ensemble vs Tree Tuning: Test whether tuning CourseworkTree, including choosing
attribute selection criteria method, is better than ensembling. It is up to you to select
ranges of values to tune and ensemble over, but you should ensure that the range of
parameters you tune over is the same as those you use to ensemble. Perform this experiment
with the proportion of attributes set to 100%, then repeat the experiment with the
proportion set to 50%. You can use any code available in tsml to help you do the tuning.
3. Compare your ensemble against a range of built in Weka classifiers, including other ensembles,
on the provided classification problems.
4. Perform a case study on your assigned data set to propose which classifier from those
you have used would be best for this particular problem.
Experiments
You should compare classifiers based on accuracy (or, equivalently, error) and any other metrics
described in the evaluation lecture that you feel are appropriate. You can use the built
in classifier evaluation tools provided in tsml, or use any other mechanism. You should think
about the experimental design for each experiment, including deciding on a sampling method
(train/test, cross validate or resample) and performance measures. The choice of classifiers for
part 3 is up to you, but we recommend you include random forest and rotation forest in the list.
Include a table of all parameter settings of all classifiers in the paper. You can also compare the
time each algorithm takes to build a model. You can use both the discrete data and continuous
data, although it may be advisable to split the analysis into two parts if you use both. For the
case study, the data you have been given has all continuous attributes. You should experiment
with the data in this format, and compare results to those obtained by first discretizing the data
so that it is converted to nominal attributes. The convertion can be done using the Weka Filter
Discretize or any other means
5
The Write Up
You should write a paper in Latex using the style file available on blackboard. The paper should
be called “An experimental assessment of decision trees and decision tree ensembles”. This
should be around 4 pages, and should not be longer than 8 pages, including references and
appendices. Your paper should include the following sections:
1. Introduction: start with the aims of the paper, a description of the hypotheses you are
testing and an overview of the structure. Include in this your prior beliefs as to what you
think the outcome of the test will be, with some rationalisation as to why. So, for example,
do you think attribute evaluation will make a difference? Can you find any published
literature that claims to answer any of the questions?
2. Data Description: an overview describing the data, including data characteristics summarised
in a table (number of attributes, number of train/test cases, number of classes,
class distribution). For your case study data set you should have more detail, including a
description of the source of the data and the nature of the problem the data describes.
Look for references for similar types of problem. Do not simply copy the data description
from timeseriesclassification.com.
3. Classifier Description: Include a description of your C45Coursework classifier, including
details of design choices you made and data structures you employed. You should include
an example of how to use the classifier with the refinements you have included.
Also provide an overview of the other classifiers you use in the case study, including references.
4. Results: A description of the experimental procedure you used and details of the results.
Remember, graphs are good. There should be a subsection for each hypothesis. The
case study section should go into greater depth and remember, accuracy is not the only
criteria.
5. Conclusions: how do you answer the questions posed? Are there any biases in your
experiment and if so, how would you improve/refine them?
Presenting the output of experiments is a key transferable skill in any technical area. I stress
again we mark the paper for this section, not the experimental code. Avoid writing a description
of what you did. We do not need to know about any blind alleys you went down or problems
you encountered, unless they are relevant to the experiments (e.g. memory constraints).
Please also take care to include references where appropriate, and to check the formatting of
references.
These experiments will require a reasonable amount of computation. We suggest you store
the results of experiments in the format used on the module to avoid data loss in the case of
computer failure. You can remotely log on to the lab machines and run experiments on this
hardware if yours is limited.
Submission Requirements
You should submit a single zipped file containing your solution to part 1, 2 and 3 to the blackboard
submission point.
For part 1, submit a single pdf with your worked solution. For part 2, submit a zipped Intellij
project with your code. For part 3, submit a single pdf of your paper. A template format is here
https://www.overleaf.com/read/ycgpvxbgwmkf
6
Feel free to copy this project.
The blackboard submission portal will go live one week before the deadline.
Plagiarism and collusion: Stackoverflow and other similar forums (such as the module discussion
board) are valuable resources for learning and completing formative work. However, for
assessed work such as this assignment, you may not post questions relating to coursework
solutions in such forums. Using code from other sources/written by others without acknowledgement
is plagiarism, which is not allowed (General Regulation 18).
Marking Scheme
1. Part 1. Decision trees by hand (10 marks)
2. Part 2. Implement tree variants (50 marks)
3. Part 3. Evaluate Classifiers (40 marks)