讲解Data set、辅导Python，Java编程语言、讲解c/c++设计辅导Web开发|讲解Database

Coursework Assignment 2

The ‘Diabetes’ Data set (provided in arff. format is available on the Blackboard) contains information about patients affected by the Diabetes disease. The task is to predict if these patients have or have not diabetes (Histology: Yes or No).

Each instance represents individual patients and their various medical attributes along with diabetes classification

Number of Instances: 768

Number of Attributes: 9

1 Pregnancies: Number of pregnancies
2 PG Concentration: Plasma glucose at 2 hours in an oral glucose tolerance test
3 Diastolic BP: Diastolic Blood Pressure (mm Hg)
4 Tri Fold Thick: Triceps Skin Fold Thickness (mm)
5 Serum Ins: 2-Hour Serum Insulin (mu U/ml)
6 BMI: Body Mass Index: (weight in kg/ (height in m)^2)
7 DP Function: Diabetes Pedigree Function
8 Age: Age (years)
9 Diabetes: Whether or not the person has diabetes

You should use the Weka data mining package, which is installed in the university computers and also available to download from: http://www.cs.waikato.ac.nz/~ml/weka/

You should hand in a report covering the following:

a)Select a suitable tree building algorithm and build a model. Describe the validation method you are using (data split for training and test sets). Interpret the output results (the accuracy rates/metrics, which attributes were used to make predictions, how many nodes and leaves you obtained).
b)Give a detailed technical description of the classification model (which algorithm is used, the tree induction method, which attribute selection criteria is used and how). Include a diagram showing the structure of the model that you built.
c)Vary the following parameters of the algorithm, report changes in the tree structure and accuracy rates:
-Set the ‘REP’ parameter (Reduced Error Pruning) to ‘TRUE’. Explain the meaning of this operation. Report and discuss any change in the model structure and accuracy.
-Change the confidence factor to 15%, report and discuss any impact.
-Set the parameter ‘unpruned’ to ‘TRUE’, Report and discuss impact. Discuss the pruning method used for this algorithm.
d)Use other 2 models of your choice (for example, neural networks or SVM) to predict the histology. Compare results and discuss possible reasons of better or worse performance.
e)Show a confusion matrix for the model and interpret it. Show a ROC curve and a Lift chart and interpret them.
f)Convert a subtree path of the decision tree into a set of rules along the following attributes: Plasma – Mass – Age – Plasma – Pedigree – Class ‘Yes’.