ADM3308: Business Data Mining
Data Mining Project Using IBM SPSS Modeler 
(Team work)
_____________________________________________________________________________________
_____________________________________________________________________________________
Weight: 25% of the final mark.        This is a team work project (only one submission per team).
_____________________________________________________________________________________
Important Note:  Read the following academic integrity statement, type in your full name and student ID, and include a copy in your submission. Submitting this form electronically by the team representative is considered as signing the document by BOTH members of the team. 
Personal Ethics & Academic Integrity Statement
By typing in my name and student ID on this form and submitting it electronically, I am attesting to the fact that I have reviewed not only my own work, but the work of my team member, in its entirety.
I attest to the fact that my own work in this project adheres to the fraud policies as outlined in the Academic Regulations in the University’s Undergraduate Studies Calendar. I further attest that I have knowledge of and have respected the “Beware of Plagiarism” brochure found on the Telfer School of Management’s doc-depot site. To the best of my knowledge, I also believe that each of my group colleagues has also met the aforementioned requirements and regulations. I understand that if my group assignment is submitted without a completed copy of this Personal Work Statement from each group member, it will be interpreted by the school that the missing student(s) name is confirmation of non-participation of the aforementioned student(s) in the required work.
We, by typing in our names and student IDs on this form and submitting it electronically, 
warrant that the work submitted herein is our own group members’ work and not the work of others
acknowledge that we have read and understood the University Regulations on Academic Misconduct
acknowledge that it is a breach of University Regulations to give or receive unauthorized and/or unacknowledged assistance on a graded piece of work 
The IBM SPSS Modeler is a commercial data mining package offered by the IBM capable of performing data mining tasks including predictive and descriptive models with user-friendly interfaces. The IBM Modeler is available on the computers in the lab. There will be tutorials presented to class on using the IBM Modeler for data mining. Students are also required to consult on-line resources to learn more about IBM Modeler. 
For this project, you are required to complete two parts: 
Part-1  (100 points): Data mining modelling project using a selected datasets from Table-1. 
Part-2  (30 points): Perform data pre-processing and data cleaning on the raw dataset provided to you (Unclean-Bank-Data.Xlsx) using IBM SPSS Modeller nodes to clean and pre-process the data.
PART-1
(A) Dataset Selection:
Each team must select one of the datasets listed in Table-1 (or from other recommended repositories with the pre-approval of the professor), and announce it on the “Discussion Board” on the Forum named  “Announcing Dataset Selection”. Post your name, your tem-member’s name, and the dataset selected. If a dataset is already taken by one of the teams, as posted on the Forum, that dataset cannot be selected by other teams.  Therefore, I recommended that you select your dataset and announce it on the Discussion Board as early as possible.
NOTE: You may choose a dataset other than what listed in Table-1 with the professor prior approval. If you would like to analyze a dataset not listed in Table-1, please email me the details of the dataset for my review (e.g. the source of the data, how many records, how many attributes). 
(B) Data Analysis and Model Building: 
You are required to import the data,  perform pre-processing tasks if needed (such as reformatting the data, normalizing it, dealing with missing values, dealing with outlier), followed by two or more modeling tasks such as classification (Decision tree, Bayesian, KNN, neural networks, etc.), clustering (K-means, agglomerative), and association rules mining.  
(C) Project Report for Part-1:
Your report for this part of the project should include:
Explaining the data you selected for your project (attributes, instances, etc.)
Explaining your pre-processing tasks if any (cleaning, transforming, normalizing, etc.)
Explaining the data mining modeling techniques  you performed on the data (at least two techniques)
Demonstrating the graphs/tables of the results produced by the techniques 
Interpretation of the modeling results: useful patterns, predicted values, significance of the features, what actions you might suggest based on your findings
Concluding remarks, your recommendations, actionable discoveries, and future trends/studies you would recommend 
Overall, your report for this part of the project should be 15 to 25 pages long (including graphs). Use 12pt Times New Roman font, with 1.5 line space. Keep a margin of 1” on all sides of the page.
Rubrics for Part-1
Your report for Part-1 of the project will be evaluated as follows: 
Components of the Report (Part-1)	Points
Abstract OR Executive summary (or abstract)	10
Explanation of the data set, and the pre-processing tasks (if any) to prepare the data 	10
Explanation of at least two data mining tasks you performed on the data. Also, explain why you considered the specific data mining tasks for your dataset  	20
Relevant graphs showing the output results of the techniques you applied	20
Interpretation of the modeling results: useful patterns, predicted values, significance of the features, what actions you might suggest based on your findings	10
A conclusion section summarizing your findings, discussing the results, your understanding of the results, your recommendation, and any useful patterns, rules, prediction or future trend you infer from the data	10
Overall organization of the paper, its soundness and readability, and quality of the presentation	20
Total  (Part-1)	100
(D) List of Datasets:
Select one of the following datasets, then post a message on the Discussion Board on Brightspace to claim your dataset. 
Table-1: List of datasets for Part-1 of the project
Note: These datasets are available at the UCI Machine Learning Repository. For more information, visit http://archive.ics.uci.edu/ml/datasets.html
#	Name	Number of features	Number of Samples	Comments
1	Waveform Database Generator (version 2)	40	5000	Use the dataset without Noise
2	Statlog (Landsat Satellite)	36	6435	Training and Testing datasets are different
3	seismic-bumps	22	8124	
4	Image Segmentation	19	2310	Use only the testing dataset
5	Bank Marketing	17	45211	
6	Pen-Based Recognition of Handwriting Digits	16	10992	Training and Testing datasets are different
7	Student Performance	33	649	
8	Adult	14	48842	Training and Testing datasets are different
9	Statlog (Shuttle)	9	58000	
10	Abalone	8	4177	
11	Nursery	8	12960	
12	Yeast	8	1484	
13	One-hundred plant species leaves data set	64	1600	Use just-data_Mar_64.txt
14	Spambase	57	4601	
15	Cardiotocography	23	2126	
16	Statlog (German Credit Card)	20	1000	
17	Letter Recognition	16	20000	
18	EEG Eye State	15	14980	
19	Page Blocks Classification	10	5473	
20	Contraceptive Method Choice	9	1473	
21	Weight lifting exercises monitored	10	39242	Use the following features: roll_belt, pitch_belt, yaw_belt, gyros_belt_x, gyros_belt_y, gyros_belt_z, accel_belt_y, accel_belt_z, magnet_belt_x, magnet_belt_y, (class as output)
22	Connect-4	42	67557	
23	Mushroom	22	8124	
24	Default of credit card clients	24	30000	
25	Autism Screening Adult Data Set	21	704	
26	Drug consumption (quantified) Data Set	32	1885	
27	Polish companies bankruptcy data, Data Set	64	10503	
PART-2
In this part of the project, all teams will use the dataset  Unclean-Bank-Data.Xlsx posted on the “Project Description” page of the course website.
This dataset includes missing values, invalid values, and outliers.  You should use the IBM SPSS Modeler nodes to pre-process and clean the data. 
Do not remove a record if there is only one missing value in that record. Instead, use the IBM Modeler to fill in the missing value with an algorithm of your choice.
Similarly, do not remove a record if it has only one invalid value. Instead, use the IBM Modeler to fill in the invalid value with an algorithm of your choice.
If you find a record with more than one missing value, or more than one invalid value, then you may either remove the record, or use the IBM Modeler to fill in for the missing or invalid values. 
If you detect outliers, you may then delete the entire record.
You may also want to do other pre-processing tasks such as data normalization, binning data, etc.
Deliverables for Part-2:
1- Include in your project report a short explanation of three  different cleaning and pre-processing tasks you applied on the data using the IBM SPSS Modeller. 
2- Also, include the clean dataset (name it  “Clean-Bank-Data.xlsx”) in  your submission together with your project report (you may submit everything in one zip file).
Rubrics for Part-2
Your report for Part-2 of the project will be evaluated as follows: 
Components of the Report (Part-2)	Points
Explanation of three cleaning and pre-processing tasks applied on the data; explaining the results after you pre-processed the data; including the Clean-Bank-Data.xlsx with your submission. 
3 X 10
Total (Part-2)	30