Lab 1 classification and regression

Lab 1

This is the first lab. It covers both classification and regression and is composed of four tasks, each consisting of several subtasks, described below. The lab will test you on your problem solving skills using classification and regression techniques, as well as your ability to use different platforms to achieve similar outcomes.

What you should already know

KNIME & Python: If you have not worked with KNIME or Python before, it is important that you have finished the exercise-tools and do the second and third exercise at your earliest convenience.

What you will learn in this lab

This lab will introduce you to a number of new things:

Exploring and preprocessing datasets

Using aggregations and graphs to get to know your data

Solving classification and regression tasks using different machine learning techniques

Solving classification and regression tasks using different platforms

Using a methodological approach to evaluation

Classification

The data set can be found and retrieved here or here.

Introduction to the dataset

We will use the classic Titanic dataset. The data consists of demographic and traveling information for 1,309 of the Titanic passengers, and the goal is to predict the survival of these passengers. The full Titanic dataset is available from the Department of Biostatistics at the Vanderbilt University School of Medicine. The Encyclopedia Titanica website (https://www.encyclopedia-titanica.org/) is the website of reference regarding the Titanic. It contains all the facts, history, and data surrounding the Titanic, including a full list of passengers and crew members. The Titanic dataset is also the subject of the introductory competition on Kaggle.com (https://www.kaggle.com/c/titanic, requires opening an account with Kaggle and does not contain target data for all instances).

The Titanic data contains a mix of textual, Boolean, continuous, and categorical variables. It exhibits interesting characteristics such as missing values, outliers, and text variables ripe for text mining – a rich database that will allow us to demonstrate data transformations.

Here’s a brief summary of the 14 attributes:

pclass: Passenger class (1 = 1st; 2 = 2nd; 3 = 3rd)

survival: A Boolean indicating whether the passenger survived or not (0 = No; 1 = Yes); this is our target

name: A field rich in information as it contains title and family names

sex: male/female

age: Age, asignificant portion of values aremissing

sibsp: Number of siblings/spouses aboard

parch: Number of parents/children aboard

ticket: Ticket number.

fare: Passenger fare (British Pound).

cabin: Doesthe location of the cabin influence chances of survival?

embarked: Port of embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)

boat: Lifeboat, many missing values

body: Body Identification Number

home.dest: Home/destination

Take a look at http://campus.lakeforest.edu/frank/FILES/MLFfiles/Bio150/Titanic/TitanicMETA.pdf for more details on these variables.

We have 1,309 records and 14 attributes, three of which we will discard. The home.dest attribute has too few existing values, the boat attribute is only present for passengers who have survived, and the body attribute is only for passengers who have not survived. You will have to begin by removing these attributes.

Subtasks to be performed in KNIME and Python

Before solving the tasks, make sure you have done the following:

Read up on the data so that you understand the problem

oUse the discussion forum on Kaggle for some further input

oLook at the kernels at Kaggle for suggestions on how to get to know the data and solve part of the subtasks

Load all necessary data into KNIME/Python

Get to know your data using aggregations and graphs

You may use the python notebooks available at Kaggle as your python testbed.

Perform the following subtasks in both KNIME and python and report the results

1.Build a transparent classifier (e.g. a decision tree) using all data and identify the most important attributes.

a.List the top 5 attributes that you identify as most important

b.Motivate your selection

2.Evaluate three different kinds of classifiers and compare the results using both accuracy and AUC (Area Under ROC Curve).

a.Use a proper evaluation methodology

b.Motivate the setup used for evaluation

c.Identify the most appropriate classifier for the problem

3.Optimize the parameters of the identified classifier from 2.c using AUC as optimization criteria

4.Handle the class imbalance problem on the training set and train a classifier (using the setup found in 3) with the original data and the manipulated data, compare the performance using precision and recall.

a.Motivate which setup is most suitable for the task

Regression

Datasets

In this part, you will practice algorithm evaluation on a larger scale. When discussing evaluation and comparisons of classifiers there are three major questions:

How should the future error rate (i.e. on novel data) of a specific classifier be estimated using only results on available data?

oWhat performance can we expect on new data?

How should the results of two classifiers or two different algorithms be compared against each other on a specific data set?

oWhat algorithm works best on my problem?

How should the results of several classifiers or algorithms be compared against each other over several data sets?

oValuable for research and method development purposes

You will use datasets from the delve repository. Only use datasets with task type set to R, i.e., only use regression datasets for this part. You can also use the set of datasets made available on canvas.

For more information on evaluation and statistical comparisons, read the paper by Demšar (2006).

Subtasks to be performed in KNIME and Python

1.Select a dataset from the repository, select a suitable algorithm to evaluate and use the holdout method to estimate the future performance.

a.What performance can be expected on your problem?

b.How confident can you be that your estimate is close to the true performance?

c.How does the size of the training/test sets affect the reliability?

2.Select a dataset from the repository, select a suitable algorithm to evaluate and use cross validation to estimate the future performance.

a.What performance can be expected on your problem?

b.How confident can you be that your estimate is close to the true performance?

c.Which result is most reliable, the results from 1 (using the holdout method) or 2 (using cross validation)? Why?

3.Select a dataset from the repository and compare the performance of two different algorithms on the same dataset.

a.Which algorithm works best?

b.Is the difference significant (in a statistical sense)?

4.Select at least 10 different datasets from the repository and compare the performance of at two different algorithms on all 10 datasets

a.Which algorithm works best?

b.Is the difference significant (in a statistical sense)?

Submission

Upload your KNIME solutions as well as your python solutions for both the classification and the regression tasks. Submit a PM with your answers and motivations to the questions asked above. If your answer is the same for the KNIME and Python solutions, you do not have to comment both, but if they differ, reflect upon how and why.

Reference

Demšar, Janez. "Statistical comparisons of classifiers over multiple data sets." Journal of Machine learning research 7.Jan (2006): 1-30.

联系我们