CMPT 459 Spring 2021 Data Mining

CMPT 459 Spring 2021 Data Mining [Total Marks:100]

The aim of this assignment is to implement DBSCAN, which is a density-based clustering algorithm, using Python programming language, and to test it on a household power consumption dataset. DBSCAN pseudocode is provided on page

268

Household Dataset This dataset contains 525600 measurements of electric power consumption in one house located in Sceaux (7km from Paris, France) in 2007. Attributes are as follows: ● date: Date in format dd/mm/yyyy ● time: time in format hh:mm:ss ● global_active_power: household global minute-averaged active power (in kilowatt) ● global_reactive_power: household global minute-averaged reactive power (in kilowatt) ● voltage: minute-averaged voltage (in volt) ● global_intensity: household global minute-averaged current intensity (in ampere) ● sub_metering_1: energy sub-metering No. 1 (in watt-hour of active energy). It corresponds to the kitchen, containing mainly a dishwasher, an oven and a microwave (hot plates are not electric but gas powered). ● sub_metering_2: energy sub-metering No. 2 (in watt-hour of active energy). It corresponds to the laundry room, containing a washing-machine, a tumble-drier, a refrigerator and a light.

● sub_metering_3: energy sub-metering No. 3 (in watt-hour of active energy). It corresponds to an electric water-heater and an air-conditioner. Tasks a. Preprocessing [15 marks] The Household dataset contains missing values that you need to handle. Moreover, you should normalize the data. You can also apply any other preprocessing that you think is helpful. Please note that you should explain all the preprocessing you have done in your report file to receive full marks. b. Implementation of DBSCAN [45 marks] Implement the DBSCAN algorithm. If an object is density-reachable from two clusters, then it should be assigned to both clusters [10 marks]. The implementation takes the dataset as input (along with other necessary attributes) and returns a file with an additional attribute “cluster label” [10 marks]. Cluster labels should start from 0, and noise objects should be labelled as -1 [5 marks]. Your implementation has to include a method, named “fit”, which gets input data and returns a list of cluster labels. You will not get the marks for this part if your implementation does not have a method with exact same name and same functionality [20 marks]. c. Test your DBSCAN implementation on the household dataset. [40 marks] Use the heuristic approach taught in the class or some other approach to determine reasonable values for the parameters epsilon and MinPts. Explain how you chose the parameters and include your k-distance-diagram in your report file [25 marks]. Provide the statistics of your resulting clustering: how many clusters, how many objects per cluster. Discuss how good the resulting clustering is [15 marks]. Hints: ● Running DBSCAN algorithm on this dataset takes approximately 18 minutes on a Intel core i5 processor.

● You may want to use a smaller subset of this data for debugging purposes, and then run your implementation on the whole dataset only after you made sure it works correctly. Note that this is only for debugging and your report has to be based on the whole data. ● You may want to remove Date and Time columns. Or you may want to become more creative! It’s your choice. [IMPORTANT] You should submit two files: ● A report file: [studentID].pdf ● A Python file: [studentID].py Libraries: You can use libraries including math, numpy, scipy, random, etc. You MUST provide YOUR OWN code for the DBSCAN algorithm and for all the tasks specified in this assignment. These MUST be implemented from scratch i.e. not using scikit-learn or other libraries. You will be marked on the correctness of your implementation.