CMPT459: Data Mining Assignment 1

CMPT459: Data Mining, Spring 2021

Assignment 1 [total marks: 100]

The goal of this assignment is to implement a Decision Tree and to test it on a dataset to classify

people into those who earn less than 50k and more than 50k based on their attributes. The adults

dataset consists of 14 features (6 continuous and 8 categorical) and one class label. Provided

data.zip file includes three files:

• data.summary.txt: information about the features

• adult.data.csv: training data

• adult.test.csv: testing data

a) [25 marks] Present a pseudo-code for a simple Decision Tree with error reduction pruning:

The information gain is used as a split criterion. [5 marks]

ii)

The tree is grown deep, i.e. it is grown until all training examples corresponding to

a leaf node belong to the same class. [5 marks]

iii)

Works on categorical data. [5 marks]

iv)

Works on numerical data. [5 marks]

Error reduction pruning using validation data. [5 marks]

b) [75 marks] Implement your pseudo-code using Python. Your implementation should

include the following functions (method signatures might be different based on the specific

needs of your implementation):

● grow(dataset) -> tree: grows a deep tree on the given dataset and returns the tree

object. [20 marks]

● prune(dataset, tree) -> tree: accepts a tree object and prunes it using the validation

dataset. Returns the pruned tree. [15 marks]

● test(dataset, tree) -> accuracy: returns the accuracy of the given tree on the given

dataset. [10 marks]

You should use the above methods to complete the following tasks:

1. Train and evaluate on the dataset adult.data.csv using 5-fold-cross-validation. Each

time you grow a tree, you need to prune it before evaluation. You can leave 10% of

the training data as validation data. Report the average accuracy. [15 marks]

2. Train one final tree on the dataset adult.data.csv and use it to predict samples in

adult.test.csv and save outputs in a csv file. [10 marks]

3. You need to properly handle missing values. Explain briefly how you did that. [5

marks]

[IMPORTANT] Submit a file [student-id].zip which includes the following:

● A file report.pdf with the pseudocode and your answers for tasks 1. and 3.

● One and only one .py file which includes all your implemented functions and classes.

● A file predictions.csv with the predictions for the test data.

● A requirements.txt file, including all the required packages to run your code.

● data/ directory with the datasets.

Running the python command on your .py file should reproduce all your reported results. You

need to use relative paths for accessing your data through your code to make it runnable on all

machines.

Deadline: The deadline is 23:59 pm on Feb 11th. We accept late submissions up to 24 hours late

but deduct 10% of the marks. You will lose all the marks for submissions after that.

Libraries: You can use libraries including math, numpy, scipy, random, etc. You MUST provide

YOUR OWN implementation for the information gain calculation, decision tree growing, pruning

and the methods to perform 5-fold-cross validation and predictions. These MUST be implemented

from scratch i.e. not using scikit-learn libraries. You will be marked on the correctness of your

implementation.

联系我们

QQ：99515681
邮箱：99515681@qq.com
工作时间：8:00-21:00
微信：codinghelp

热点文章

辅导 cs1b spring 2024 tth hw... 2024-04-19
讲解 managing financial risk... 2024-04-19
辅导 cs 0449 – project 5: /... 2024-04-19
辅导 elec 2141 digital circu... 2024-04-19
讲解 csc171 — videogame pro... 2024-04-19
讲解 comp3411 artificial int... 2024-04-19
讲解 stat3061: random proces... 2024-04-19
辅导 accounting 452, spring ... 2024-04-19
辅导 finc5001 foundations in... 2024-04-19
辅导 7ssmm712 – topics in a... 2024-04-19
讲解 com 337 - film studies ... 2024-04-19
辅导 mes202tc - digital vlsi... 2024-04-19
辅导 geography 2041b distanc... 2024-04-19
辅导 ecos3006 international ... 2024-04-19
讲解 fit5225 2024 sm1 creati... 2024-04-19
讲解 cit 593: introduction t... 2024-04-19
讲解 math 4931: take home ex... 2024-04-19
辅导 csci 547|info 533: syst... 2024-04-19
辅导 cs536-s24 intro to pls ... 2024-04-19
讲解 fit5212 - assignment 1辅... 2024-04-19

热点标签

comp5313/comp4313—large

联系我们 - QQ: 99515681 微信：codinghelp

程序辅导网！