Assessment Task 3 Data mining Assignment 2.

Scenario

This assignment is a practical data analytics project that follows on from the data exploration you did in Assignment 2.

You will be acting as a data scientist at a consultant company and you need to make a prediction on a dataset. The dataset can be found below.

You need to build classifiers using the techniques covered in the lectures to predict the class attribute. At the very minimum, you need to produce a classifier for each method we have covered. However, if you explore the problem very thoroughly (as you should do in industry), preprocessing the data, looking at different methods, choosing their best parameters settings and identifying the best classifier in a principled and explainable way, then you should be able to get a better mark. If you choose to use KNIME and you show 'expert' use (i.e. exploring multiple classifiers, with different settings, choosing the best in a principled way and being able to explain why you built the model the way you did), this will attract a better mark. If you choose to use R or Python to build, optimise and test different models, this will also attract better marks.

You need to write a short report describing how you solved the problem and the results you found. See below for the requirements for the report.

You also need to attend a short oral defence of your classifier of around 5 minutes where you show the classifier (e.g. using the KNIME workflow or Python/R code) and answer some questions about it. Details about the oral defences will be given by email and in class.

Kaggle

Competition

For this assignment, you will use the Kaggle website (kaggle.com) to submit your assignment solution. The report itself will be submitted through Canvas as with the other assignments. Go to this link to sign up to the competition on Kaggle:

https://www.kaggle.com/t/aa4b260382b54263bdbc206dd70ff2e6 (Links to an external site.)

You need to use the link to access the competition because it is a private competition for students in 32130 (along with the undergraduate version, 31250) only. Sharing the competition with anyone not relevant to the subject is strictly prohibited. To submit to Kaggle you will need to make a Kaggle login using your UTS email address, and set your display name (in My Profile -> Edit Profile -> Display Name) as UTS_32130_xxxx where xxxx is your student ID. Please make sure you follow these instructions exactly in order to access the competition.

Datasets

Below you will find 3 datasets: a training dataset for training and optimising your model (it contains the target values), an "unknown" dataset for the final model assessment (it does not have the target values - you need to predict them) and a submission sample which shows you what the file submitted to Kaggle should look like. In particular, you will need to set the column names in your submission file correctly - that is, "row ID" and "AIRLINE-NAME". These datasets can also be found on the Kaggle competition page under the "Data" tab.

Assignment3-FlightDataset.csv.zip (Links to an external site.)

Assignment3-UnknownDataset.csv.zip (Links to an external site.)

Assignment3-Kaggle-Submission-Random-Sample.csv (Links to an external site.)

The attribute description for the dataset is similar to that from assignment 2: Assignment3-Dataset-Attribute-Description-updated.docx (Links to an external site.)

Assessment

Assessment is real-time. This means that as soon as you submit the file, Kaggle will assess the performance of your classifier and provide you with the result. You can submit multiple times, but Kaggle has a limit for the number of times you can do this per day.

Do not use the measure of performance reported by Kaggle as a measure of your test error in the final competition and optimise to it. This is because Kaggle has two measures: a public measure, which it reports to you, and a private measure, which it keeps hidden. Instead, develop several models and estimate the test error yourself before submitting to Kaggle. Remember that your estimate of test error is just that: an estimate. The actual private measure will probably be a little bit different.

Classification task

Build a classifier that classifies the “AIRLINE-NAME” attribute. The classification goal is to predict whether it is Southwest Airlines Co. or not (target attribute: AIRLINE-NAME {(binary: 0, 1), 1--> Southwest Airlines Co. and 0--> Other Airlines (Delta Airlines Inc., American Airlines Inc., United Airlines Inc.,.........)}. You can do different data pre-processing and transformations (e.g. grouping values of attributes, converting them to binary, etc.), providing explanations for why you have chosen to do that. You may need to split the provided training set further into training, validation and/or test sets to accurately set the parameters and evaluate the quality of the classifier.

You can use KNIME to build classifiers, or feel free to use any other tool such as R, Weka, Python, Orange, scikit-learn or other software. If you do this, though, please explain more about your classifier - and be sure that you are producing valid results! You don't need to limit yourself to the classifiers we used in class, but if you do use other classifiers you need to describe them in your report and make sure you are producing valid results.

A hint: Usually it's not a case of having a 'better' classifier that will produce good results. Rather, it's a case of identifying or generating good features that can be used to solve the problem.

Assignment report and submission

Report

Your report should include the following information:

A description of the data mining problem;

The data preprocessing and transformations you did (if any);

How you went about solving the problem;

Classification techniques used and summary of the results and parameter settings;

The best classifier that you selected - the type, its performance, how it solved the problem (if it makes sense for that type of classifier), and reasons for selecting it;

Reflection: One page reflecting on your learning in Assignment 3. What did you learn about data mining and yourself as a result of doing the assignment? How would you approach the problem differently if you were to do it again? The more incisive and thoughtful your reflection is, the better your mark.

The report should be a PDF (preferable) or MS Word doc. The filename should include your student ID and/or name.

The report should be around 10-12 pages, in 11 or 12 point Times or Arial font.

Kaggle

The predictions on the unknown set should be submitted as a .csv file to the Kaggle competition here:

https://www.kaggle.com/c/2021s-uts-data-analytics-assignment-3/ (Links to an external site.)

Submission to Kaggle is not mandatory, but you do need to make predictions on the unknown dataset.

On average each student will require between 24 and 36 hours to complete this assignment.

Assessment

This assignment is assessed as individual work.

The report contributes up to 30 marks out of the 50. The marking criteria are here: Assignment3-Marking-Criteria-32130.pdf Download Assignment3-Marking-Criteria-32130.pdf.

The oral defence contributes up to 20 marks out of the 50. At the oral defence, students need to explain how they solved the problem and answer questions about their solutions showing the workflow in KNIME or working code in Python, R or other tools.

Students receive either 0, 10, 15 or 20 marks as follows.

Students using baseline classifiers who are able to satisfactorily answer questions about them will receive 10 out of 20.

Students showing an investigation with many classifiers using Python/R/KNIME or other tools, with basic data preprocessing, parameter estimation and model evaluation, will receive 15 marks out of 20.

Students showing an in-depth investigation using Python/R/KNIME - multiple classifiers with valid data preprocessing, parameter estimation and model evaluation - will receive 20 marks out of 20.

Students who fail the oral defence will be permitted to undertake it one more time. If they pass, they will receive a maximum of 10 marks out of 20.

Due Date:

11.59pm Friday 20 May 2022

联系我们