Information
School.
INF6028 Coursework 2019-20
Mining and Evaluating a Structured Dataset
1. Introduction
The assessment for INF6028 Data Mining consists of a piece of individual coursework to assess your
ability to understand key data mining, analysis and evaluation concepts. You will be assigned a single
dataset and an associated complete Knime workflow. Each workflow applies appropriate data mining
methods to the dataset in order to solve a supervised prediction problem - this might be regression or
classification – and to evaluate the relative performance of these different approaches/algorithms. You
will interpret and critically discuss the various techniques and best practises employed in the workflow
and will evaluate the performance of the algorithms.
Note: a video taking you through the workflow step-by-step will also be provided.
You should write a 2,000 word structured report (see Section 3) that includes the following headings
(more details on how the report will be assessed are provided below):
• Introduction - introduce the prediction problem.
• Data mining theory - provide a theoretical description of the two supervised data mining
methods used in the workflow (for example, the classification or regression techniques that have
been used) and why they are appropriate to the prediction task.
• Data exploration and preparation – describe the approaches used in the workflow for feature
selection, transformation and normalisation, where appropriate.
• Experimental setup - describe the experimental setup and the evaluation measures used in the
workflow and how the data has been handled to ensure that the models were not over-fitted.
You should explain which nodes were used in KNIME and provide a rationale for the various
parameter settings that were used.
• Results – present the results for each data mining method and compare the performance of the
different methods using graphical and tabular methods. What insights can you gain from the
models? For example, which are the most important features, are there any outliers in the
predictions?
• Conclusion and reflections – summarise the main findings of your report and reflect on the
methods used.
Charts, tables, references and appendices are not included in the word count.
Remember: your report should be a critical evaluation of the workflow in the context of the data mining
problem posed, it should not be merely a description of what was done.
This assessment is worth 100% of the overall module mark for INF6028. A pass mark of 50 is required to
pass the module. Submission deadline: June 8 via Turnitin. See Section 4 for more general information
about Coursework Submission Requirements within the Information School.
2. The Datasets and KNIME Workflows
You will be assigned a single dataset and KNIME workflow to base your report on. Please ensure before
you start working on the assessment that you are using the correct dataset and workflow.
Note: You should try to open the workflow in KNIME and work from there, however, should you be
unable to open the workflow or install KNIME on your machine, you will also be provided with a video,
which will take you through the workflow step-by-step.
The datasets have been derived from Kaggle competitions and are downloadable from MOLE in the
Coursework Brief Information section. A brief description of the attributes in each dataset is given at
the end of this document. Note that in both cases the data are different to the standard Kaggle
datasets.
Titanic-derived dataset
The data is split across two files each of which contains 1204 entries representing 1204 passengers,
although it should be noted that the passengers are not necessarily the same in the two files. The two
files are titanic_ticket_data.csv and titanic_personal_data.csv
The aim of this challenge is to build a model that is able to predict whether or not a passenger will survive
the sinking of the titanic.
Australian Weather-derived Dataset
The Australian weather dataset consists of weather data for 16 cities and towns in Australia over the
period of nearly 10 years.
The aim of this challenge is to predict the total daily rainfall based on other features of the weather.
3. Report Structure
You are required to produce a structured report that includes all the sections detailed in Table 1. You
must state the word count somewhere in the report. As there is a word count limit you should aim to
make your writing as concise and informative as possible. The emphasis of the report should be on the
clarity, accuracy and quality in communicating your findings.
Table 1: Required content of the structured report.
Section Description
Maximum allocated marks
Structured
abstract
This should provide a summary of your report
in a structured manner. This is not included in
the word count.
Required, but 0 marks
Introduction This section should introduce the data mining
task that is addressed in the report. You
should indicate the property/data value that
is predicted and give a brief overview of the
dataset and methods used.
10 marks
Data Mining
Theory
This section should provide an overview of
the algorithms for predictive data mining
used in the workflow from a theoretical
aspect. Explain why they are relevant to the
25 marks
prediction problem. Support your rationale
by providing references to the literature
where the techniques have been applied to
similar problems.
Include a short discussion of the most
appropriate methods for evaluating the
performance of these data mining methods.
Data Exploration
and Preparation
This section should provide a brief
description of the data and of the approaches
used to pre-process the data. You should
present an investigation of the attributes
(including the data value to be predicted) and
describe any data cleaning employed,
including handling of missing data, data
transformations and data aggregations.
10 marks
Experimental
Setup
This section should describe the
experimental design in the workflow.
You should describe the process followed in
order to find the best performing model for
each method and how this was validated.
For example, which KNIME nodes were used?
How were they configured? Was any cross-
validation or a separate validation set used
and why?
20 marks
Results and
Discussion
Present the results of the data mining
process including the results of experiments
to find the best model for each data mining
method. Compare the best performance of
the different methods and, if appropriate,
consider which attribute contributes most to
each model.
Discuss the advantages and disadvantages of
the data mining methods. Which of the
chosen methods produced the best model
and why?
20 marks
Conclusion and
reflections
Summarise the main findings of the analysis
and reflect on the choice of methods for the
problem, for example, how might the models
be improved with hindsight? Use evidence
from the literature to support your
arguments.
15 marks
4. Information School Coursework Submission Requirements
It is the student's responsibility to ensure no aspect of their work is plagiarised or the result of other
unfair means. The University’s and Information School’s Advice on unfair means can be found in your
Student Handbook, available via http://www.sheffield.ac.uk/is/current
Your assignment has a word count limit. A deduction of 3 marks will be applied for coursework that is
5% or more above or below the word count as specified above or that does not state the word count.
It is your responsibility to ensure your coursework is correctly submitted before the deadline. It is
highly recommended that you submit well before the deadline. Coursework submitted after 10am on
the stated submission date will result in a deduction of 5% of the mark awarded for each working day
after the submission date/time up to a maximum of 5 working days, where ‘working day’ includes
Monday to Friday (excluding public holidays) and runs from 10am to 10am. Coursework submitted
after the maximum period will receive zero marks.
Work submitted electronically, including through Turnitin, should be reviewed to ensure it appears as
you intended.
Before the submission deadline, you can submit coursework to Turnitin numerous times. Each
submission will overwrite the previous submission. Only your most recent submission will be assessed.
However, after the submission deadline, the coursework can only be submitted once.
Details about the submission of work via Turnitin can be found at http://youtu.be/C_wO9vHHheo
If you encounter any problems during the electronic submission of your coursework, you should
immediately contact the module coordinator and one of the Information School Teaching Support
Team (Julie Priestley 0114 2222839). This does not negate your
responsibilities to submit your coursework on time and correctly.
Titanic Dataset
The titanic data consist of two files that need to be merged.
The titanic_ticket_data.csv data consists of the following variables:
PassengerId: the identifier
Survived: the value to predict
Ticket: the Ticket Number
Fare: the passenger fare
Cabin: Cabin number
Embarked: Port of embarkation. C = Cherbourg, Q = Queenstown, S = Southampton
The personal data titanic_personal_data.csv consists of the following variables:
PassengerId – the identifier
Name: the name of the passenger
Sex: male or female
Age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5
SibSp: number of siblings/spouses where family relations are defined as follows:
Sibling = brother, sister, stepbrother, stepsister
Spouse = husband, wife
Parch: number of parent/children where family relations are defined as follows:
Parent = mother, father;
Child = daughter, son, stepdaughter, stepson.
Some children travelled only with a nanny, therefore parch=0 for them
Salary: in dollars
Job: job title
Australian Weather Dataset
The Australian weather dataset consists of a single CSV file, which contains weather data for 16 cities
and towns in Australia over the period of nearly 10 years. The file consists of the following variables:
Date: date of observation
Location: name of town/city where observation was made
MinTemp: minimum temperature recorded (Celsius)
MaxTemp: maximum temperature recorded (Celsius)
Rainfall: total daily rainfall (mm)
Sunshine: total daily sunshine (hours)
WindDir9am: wind direction at 9am
WindDir3pm: wind direction at 3pm
WindSpeed9am: wind speed at 9am (kph)
WindSpeed3pm: wind speed at 3pm (kph)
Humidity9am: humidity at 9am (%)
Humidity3pm: humidity at 3pm (%)
Pressure9am: atmospheric pressure at 9am (hpa)
Pressure3pm: atmospheric pressure at 3pm (hpa)
Temp9am: temperature at 9am (Celsius)
Temp3pm: temperature at 3pm (Celsius)
RainToday: did it rain? (Boolean)
RISK_MM: total daily rainfall the following day (mm)
RainTomorrow: did it rain the following day? (Boolean)
Note: this dataset is different from the “Rain in Australia” dataset on Kaggle.