首页 >
> 详细

INFS4203/7203 Assignment

Semester 2/2019

Marks: 100 marks (20%)

Due Date: 11th October 2019, 23:59

What to Submit: See deliverables part

Where to Submit: Electronic submission via blackboard

The goal of this project is to gain practical experience in applying clustering and

classification to real data. You must work on this project individually. The

standard academic honesty rules apply. You must use R for this project.

There are three main tasks: Data Preparation, Clustering, and Classification.

Please read the report carefully until the end. Since some parts of your code

require a random seed, you need to pre-set the seed for reproducible results. Your

seed value is the first 2 digits of your student ID. (For example: If your student ID

is 12345678 then the seed value will be = 12).

Dataset:

You will be using the “ILPD (Indian Liver Patient Dataset) Data Set” data1

. The

data has 583 observations (rows) and 11 attributes (columns). The attribute in the

last column is a binary class variable with value 1 for “patient” and 2 for

“non_patient”. You may find and learn more detailed information about the data in

the data description page2

1 Data Folder: https://archive.ics.uci.edu/ml/machine-learning-databases/00225/ .

2 Data Description:

https://archive.ics.uci.edu/ml/datasets/ILPD+(Indian+Liver+Patient+Dataset)

Task 1 - Data Preparation:

Write the necessary code to pre-process the data. That pre-processing stage

includes the following tasks:

1.1. Extract the data into an R data frame.

1.2. Assign the following names to the 11 different columns in your dataset

Column Number Column Name Meaning

1 Age Age of the patient

2 Gender Gender of the patient

3 TB Total Bilirubin

4 DB Direct Bilirubin

5 Alkphos Alkaline Phosphotase

6 Sgpt Alamine Aminotransferase

7 Sgot Aspartate Aminotransferase

8 TP Total Protiens

9 Albumin Albumin

10 AG_Ratio Albumin and Globulin Ratio

11 Class Indicator for patient and non_patient

1.3. There are some missing values in the column AG_RATIO, fill them with the

median of the column.

1.4. Replace all “2” in the “class” column with “0” to indicate “non_patient”.

1.5. Notice that R might define the Class column as integer. In that case, change

its type from integer to factor.

1.6. Save the dataframe into a file with filename ilpd_processed.Rda.

Task 2 - Clustering:

Apply K-Means and Hierarchical clustering to cluster the data as follows.

2.1. Load the preprocessed data file from Task 1 into a data frame. Please note

that in Task 2, we exclude Age, Gender, AG_Ratio, and Class attributes in this

task since they are in different units from the others.

2.2. Rescale the values of every column to the range of (0,1). Reference link3

.

2.3. Cluster the data into 2 clusters (i.e. k = 2) using K-Means clustering using

the default parameters for the function. Plot the results of the

clusters as a 2D plot where the x-axis is Alkphos and the y-axis is TP.

2.4. Plot another 2D plot with the same dimensions above, but color the points

according to the Class column.

2.5. Compare the 2 plots obtained in the previous two tasks (tasks 2.3. and 2.4.)

– do the clusters visually represent the vs classes?

2.6. Cluster the data into more than 2 clusters (i.e., k = 3, 4, 5) using K-Means

clustering and plot all the clustering results.

2.7. Compare the plots and Sum of Squared Error (SSE) obtained in the

previous task and provide your comments on the quality of clustering.

2.8. Apply hierarchical clustering to the data using the function with

default parameters and plot the corresponding dendrogram. Particularly,

cluster the dendrogram into 2, 3, 4, and 5 clusters and plot all of them.

2.9. Compare the plots obtained in the tasks 2.3., 2.4., 2.6. and 2.8. and provide

your observations on the achieved clusters - should we have a new subtype

of diseases?

2.10. Try different agglomeration methods in hierarchical clustering (i.e.,

). Plot the resulting dendrograms

and provide your comments on the quality of clustering - is the data

sensitive to the used agglomeration method? Based on your results, what

do you think is the default agglomeration method used in Task 2.8.?

3 https://stackoverflow.com/a/15468888/6350054

Task 3 - Classification:

Apply binary classification using decision tree and K-NN techniques.

3.1. Load the preprocessed data file from Task 1 into data frame. Divide the

dataset into “training” and “test” subsets randomly (70% and 30%

respectively). We use all attributes in Task 3.

3.2. Learn a classification tree from the training data using the default

parameters of the function from the library. Plot that

classification tree and provide your comments on its structure (e.g., what

are the important/unimportant variables? Is there any knowledge we can

infer from the tree representation that helps in differentiating between the

classes?). Using the learned tree, predict the class labels of the test data.

Calculate the accuracy, precision, and recall.

3.3. Try building your classification tree again via the function but using

parameters that are different from the default settings. Can you achieve

better accuracy or more meaningful representation by tuning some

parameters? (Note that in the function from library, you can

modifiy parameters. Execute form RStudio Console for

the detailed documentation.)

3.4 Apply K-NN classification to predict the labels in the test subset and

calculate the accuracy, precision and recall. Particularly, try different values

of K (e.g. K = 1, 2, 3, 4, 5), and report your observations on the achieved

classification.

Deliverables: R project with your student number as the project’s name (e.g.

12345678.Rproj). The project should have the following folders and files inside the

folders:

1) Folder - Code:

● : Code to complete Task 1.

● : Code to complete Task 2.

● : Code to complete Task 3.

Provide the appropriate header in each file (your identity and file description) and

give meaningful comments in the script.

2) Folder - Data:

● : orgininal dataset.

● : preprocessed data output from Task 1.

3) Folder - Plot:

● All plots in jpg format generated in Tasks 2.3., 2.4., 2.6., 2.8., 2.10.,

3.2. and 3.3.

4) Report: your report should include the following:

● Brief description for the main functions in your source code and any

assumptions or special settings of those functions.

● Plots, evaluations, and your comments on the observed results.

Please note that your preprocessed data, plots, and results should be

reproducible. That is, we can delete them and be able to generate them by running

your code. Hence, remember to set the seed before any function that requires a

random value. That seed is the first 2 digits of your student number.

Marking: Your total mark earned for this assignment is based on:

● Report: accurate statistics and clear presentation;

● Code and reproducible results.

● Demo: one-on-one demo presentation, if needed.

Submit one archive (zip) file with your student number as the file name (e.g.

12345678.zip) with all the files and folders mentioned above. The project is due

11th October 2019, 23:59. The submission is done through the BlackBoard and no late submission is allowed.

联系我们

- QQ：99515681
- 邮箱：99515681@qq.com
- 工作时间：8:00-23:00
- 微信：codinghelp

- 95-712留学生作业代做、代写java编程设计作业、代做polymorph 2019-10-18
- 代做ols留学生作业、代写r程序语言作业、代做r编程设计作业、代写linea 2019-10-18
- Sdgb 7844作业代做、R编程设计作业代做、代写markdown留学生作 2019-10-18
- 代写math4/68091作业、代做statistical Computin 2019-10-18
- Analytics留学生作业代做、代写data Visualisation作 2019-10-18
- 代写frt留学生作业、代写r语言作业、代做datasets课程作业、R编程设 2019-10-18
- 代做sta 442课程作业、代写effects Models作业、Pytho 2019-10-17
- Se 3316A作业代做、代写web Technologies作业、Java 2019-10-17
- 代写csi213留学生作业、代做data Structures作业、代写ja 2019-10-17
- Cse 325作业代写、C/C++编程设计作业调试、Program留学生作业 2019-10-17
- 代做fit2014留学生作业、代做c++课程设计作业、Information 2019-10-17
- Scm 460课程作业代做、代写r编程设计作业、代做r实验作业、代写data 2019-10-17
- 代写rad留学生作业、代做r, Matlab/C++编程语言作业、代写r课程 2019-10-17
- Comp5338作业代写、Sql程序语言作业调试、代写schema Desi 2019-10-17
- 代做159.20留学生作业、代写rle课程作业、Java编程设计作业调试、P 2019-10-17
- 代做gu4206/Gr5206作业、代写r程序语言作业、代写data留学生作 2019-10-16
- Fit2104留学生作业代做、代写sql语言作业、Sql编程语言作业调试、代 2019-10-16
- 代写glfrustum课程作业、代做python程序语言作业、代写java， 2019-10-15
- 代写software留学生作业、代做information Technolo 2019-10-15
- 31927留学生作业代做、Net Applications作业代写、C++程 2019-10-15