首页 >
> 详细

Data Mining (CSC 503/SENG 474)

Assignment 1 Due on Friday, June 26th, 11:55pm

Instructions:

• You must complete this assignment on your own; this includes any coding/implementing,

running of experiments, generating plots, analyzing results, writing up results, and working

out problems. Assignments are for developing skills to make you strong. If you do the

assignments well, you’ll almost surely do better on the midterm and final.

• On the other hand, you can certainly have high-level discussions with classmates about course

material. You also are welcome to come to office hours (or otherwise somehow ask me and

the TAs things if you are stuck).

• You must type up your analysis and solutions; I strongly encourage you to use LaTeX to do

this. LaTeX is great both for presenting figures and math.

• Please submit your solutions via conneX by the due date/time indicated above. This is a

hard deadline. Any late assignment will result in a zero (I won’t accept assignments even

one day late; “I forgot to hit submit on conneX”, etc. will not be met with any sympathy).

However, I will make an exception if I am notified prior to the deadline with an acceptable

excuse and if you further can (soon thereafter) provide a signed note related to this excuse.

1

Assignment 1

This assignment has two parts. The first part involves implementing some of the methods that

you have already seen and analyzing their results on some classification problems; this part is for

all students (both CSC 503 and SENG 474). The second part is actually only for graduate students

(only CSC 503). The first part might seem like a lot of work. However, if you do it well, you’ll

become strong, and also gain valuable skills for your project. This is a good thing.

1 Experiments and Analysis

First, “implement” the following methods:

• Decision trees (with pruning). For simplicity, I suggest going with reduced error pruning

(this is the form of pruning we covered in class). For the split criterion, you can use information

gain or the Gini index or some other good criterion. You might even try a few different

ones as part of your analysis (explained further below).

• Random forests (no pruning). What forest size should you use? Well, you should experiment

with different forest sizes and see what happens. For the size of the random sample

(sampled with replacement) used to learn each tree, I suggest setting the number of samples

with replacement to be equal to the original sample size. For the number of random features

when selecting the split feature at each decision node, I suggest starting at √

d (for d features)

and experimenting by going up and down from there.

• Neural networks. Any number of layers is fine, as long at there is at least one hidden layer;

I suggest going with 3 layers (i.e. the input layer, 1 hidden layer, and the output layer).

I put “implement” in quotes because I won’t actually require you to implement these methods;

you can instead use machine learning software that you find online. If you do implement anything,

you can use whatever programming language you like. For neural networks in particular, I shamefully

recommend using an existing implementation here, but eventually (after this assignment) for

your own edification it would be great if you implement a neural network yourself.

What to test on

You’ll be analyzing the performance of these methods on two classification problems.

For the first problem, you’ll be using the heart disease dataset from the UCI repository:

https://archive.ics.uci.edu/ml/datasets/Heart+Disease

In order to make your life a bit easier, I have provided a slightly cleaned and modified dataset,

cleaned_processed.cleveland.data. In case you’re interested, you can find details on how I

prepared this dataset in Appendix A. The last attribute/feature (the label) takes values 0 and 1,

where 0 indicates no heart disease and 1 indicates some level of heart disease. Keep in mind that

the dataset has not been split into a training and test set. You should do this. For any training/test

split, a good rule of thumb is 80% for the training set and 20% for the test set.

The second problem will actually be designed by you. It will again be a classification problem,

meaning that it consists of a set of training examples and a set of test examples for a classification

task. For simplicity, I suggest going with a binary classification task. You can either generate

the data yourself or find some data online. The important thing is that the problem should be

interesting: that is to say, the learning task should not be so easy that every method can obtain

zero test error, but it should be easy enough where some method is able to obtain a reasonably

small test error. What does “reasonably small test error” mean? Well, for binary classification,

this would be a test error well below 50%. Why? Because a classifier that just randomly predicts

with equal probability will always get close to 50% test error.

2

How to do the analysis

The core of this assignment, meaning what is worth the most marks, is the experiment analysis.

Your analysis will go into a file called report.pdf. This file should:

• contain a description of the second classification problem (the one you came up with or

selected), including an explanation of why the problem is interesting; specifically, explain

why it is non-trivial and why it enables good comparison of the methods.

• present the performance of the methods in terms of the training and test error on each of

the problems. In particular, you should present graphs that show how each of training error

and test error vary with training set size. But beyond that, you should also play with the

parameters of the methods to see the effect. For instance, what happens when you change the

learning rate (or number of nodes in the hidden layer, or the number of iterations of training)

for the neural network? What happens if you change the number of random features used

for the random forest? What happens if you change the pruning rule (or use a different split

criterion) for the decision tree? As much as possible, try to present your results with plots.

• contain a detailed analysis of your results, including various design choices you made (like

choice of split criterion for decision trees, choice of nonlinearity for neural networks, etc.).

Try and explain the results that you got for the various experiments you ran, and use these

results to compare the methods. Think about the best parameter settings for the methods

(and maybe think about how the best parameter setting might change as the training sample

size increases). Ideally, use your analysis to come up with ideas on how you might improve

the methods.

Please make your analysis concise (yet thorough). Don’t ramble, and don’t make stuff up. Act

as if you are a respected scientist presenting some work to your colleagues (and you want them to

continue being your colleagues after the presentation).

What to submit

In all, for the analysis portion of the assignment, you should submit (as a zip file):

• the file report.pdf explained above;

• a file called README.txt which contains instructions for running your code (for any code you

wrote) or gives proper attribution to any code/datasets you got from other sources (like the

Internet, for example). If you mostly used some existing code/software but needed to modify

a few files, then give attribution and mention the files you modified.

• a file called code.zip, containing your code. Please organize this file well (embrace the idea

of directories). In case you mostly used some existing code/software but needed to modify a

few files, then just provide those files here, and make sure your README.txt mentions those

files in relation to the existing code/software.

• any additional files that you need; in particular, this should include the training and test sets

for the second classification problem in case you came up with your own problem.

3

2 Problem-solving part - CSC 503 only

In this problem, we consider the gradient descent algorithm for a two-dimensional regression problem.

So, we have 2 continuous features x1 and x2, and the task is to predict a continuous target y.

Suppose that for a given input xi =

x1,i

x2,i!

, we predict using hypotheses of the following form:

fw(xi) = w1 x1,i + w2 x2,i + w3 x1,i x2,i + b.

Assume that we have n training examples (x1, y1), . . . ,(xn, yn). Suppose that we measure the

error according to the squared error with a further penalty on the squared Euclidean norm of w.

Then, for a fixed, positive number λ, the training error can be written as

E(w) = 1

In the below, assume the learning rate (also called the step size) is some positive number η.

(a) Derive the gradient descent update for b.

(b) Derive the gradient descent update for w1.

(c) Derive the gradient descent update for w3.

(d) Show the full gradient descent update for the vector w.

3 How you’ll be marked

• For each of CSC 503 and SENG 474, the total number of marks is 100, and you receive 10

marks for your code (in case you write code) together with the classification problem that

you design.

• For undergrads (SENG 474), the analysis (in report.pdf) is worth 90 marks.

• For grad students (CSC 503), the analysis (in report.pdf) is worth 80 marks and the

Problem-solving part is worth 10 marks.

A Cleaned version of processed.cleveland.data

In order to obtain the provided cleaned version of the data, I started from the original dataset

processed.cleveland.data, which can be found in the Data Folder from the link above. You

should not use this dataset!

In the original processed.cleveland.data dataset, some of the 6 examples have one feature

with the value ‘?’. This is a missing value. There are various ways to deal with missing values,

but an easy way (when not many values are missing) is to simply discard those examples; this is

what I did. Also, the last attribute/feature (the label) takes values {0, 1, 2, 3, 4}. I formed a binary

classification task by:

• keeping the label 0 as is;

• grouping together the labels in the set {1, 2, 3, 4} by changing their label to 1. Note that all

of these non-zero labels indicate some level of heart disease.

联系我们

- QQ：99515681
- 邮箱：99515681@qq.com
- 工作时间：8:00-23:00
- 微信：codinghelp2

- Tsp课程作业代写、代做algorithms留学生作业、代做java，C/C 2020-06-23
- Kit107留学生作业代做、C++编程语言作业调试、Data课程作业代写、代 2020-06-23
- Sta302h1f作业代做、代写r课程设计作业、代写r编程语言作业、代做da 2020-06-22
- 代写seng 474作业、代做data Mining作业、Python，Ja 2020-06-22
- Cmpsci 187 Binary Search Trees 2020-06-21
- Comp226 Assignment 2: Strategy 2020-06-21
- Math 504 Homework 12 2020-06-21
- Math4007 Assessed Coursework 2 2020-06-21
- Optimization In Machine Learning Assig... 2020-06-21
- Homework 1 – Math 104B 2020-06-20
- Comp1000 Unix And C Programming 2020-06-20
- General Specifications Use Python In T... 2020-06-20
- Comp-206 Mini Assignment 6 2020-06-20
- Aps 105 Lab 9: Search And Link 2020-06-20
- Aps 105 Lab 9: Search And Link 2020-06-20
- Mech 203 – End-Of-Semester Project 2020-06-20
- Ms980 Business Analytics 2020-06-20
- Cs952 Database And Web Systems Develop... 2020-06-20
- Homework 4 Using Data From The China H... 2020-06-20
- Assignment 1 Build A Shopping Cart 2020-06-20