讲解 6COM2005 Practical Assignment: Data Mining Semester AB 2024/2025辅导 Python语言

6COM2005 Practical Assignment: Data Mining

Semester AB 2024/2025

There are 45 marks to achieve, each translating to 1% of your overall module grade.

There are three main tasks for this assignment plus an evaluation. For each main task, you have two options to choose from. Each option gives you the same number of marks and aligns completely with the later tasks. Pick the option, you feel more comfortable with. You also do not get additional marks if you hand in both options. (I will then choose at random.)

Submission requirements

You may discuss your general ideas and thoughts with peers but the work handed in must be distinctly yours and your own. The following documents must be submitted through Canvas as individual files, not a directory.

(a) Cleaned and reduced data set as a csv-file

(b) Python implementation of the clustering algorithm as a .py file (c) Python implementation of the classification model as a .py file

(d) Training and test splits as csv-files (e) Your report in PDF format

1 Task: Prepare the data set [9 marks]

Choose between these two data sets that are both sets are uploaded to Canvas. Download the data set of your choice.

Use both numerosity reduction as well as feature reduction so that your data set only has 3 features (The class column does not count to these) and 1200 entries.

In the report, explain how you chose the data to keep and justify the choices using concepts from the lecture (max 500 words). Focus on the main ideas and how your process employs these. [3 marks for the methods, 3 for the justification.]

Save this data set as a csv-file for further processing and submission [3 marks].

2 Task: Clustering [15 marks]

Choose one of the following Clustering algorithms: k-means or DBSCAN. Make sure you remove the class column from the data set before clustering.

2.1 K-Means

Implement the K-means algorithm to work with the cleaned data set. You can use parts of your implementation you created during the practical but will need to adapt it to work with 3-dimensional data [5 marks]. If you use a library for the core algorithm you will not get the marks for the implementation but can still achieve the marks for the results and evaluation.

Choose the number of centroids for your data set and justify your choice in the report [3 marks].

Run the algorithm 3 times and store the results, so that it is clear which point belongs to which centroid in one or multiple csv-files for submission [3 marks].

In the report, create a section for your results. Add a table for each run that contains the final position of each centroid and shows the count of data points assigned to each cluster after the run.[4 marks]

2.2 DBSCAN

Implement the DBSCAN algorithm to work with the cleaned data set. You can use parts of your implementation you created during the practical but will need to adapt it to work with 3-dimensional data [5 marks]. If you use a library for the core algorithm you will not get the marks for the implementation but can still achieve the marks for the results and evaluation.

Choose 3 sets of parameters (ϵ and MinPoints) to run the algorithm with. State these in your report and justify why you chose these specific values [3 marks].

Run the algorithm 3 times with the different parameter sets and store the results in one or multiple csv-files for submission [3 marks].

In the report, create a section for your results. Add a table for each run that contains the count of core, border and noise points after each run. In a different table, list the count of data points assigned to the different clusters after the run. You may have differently many clusters for each run? [4 marks]

3 Task: Classification [15 marks]

Choose one of the following Classification Algorithms: K-nearest neighbour or Gaussian naive Bayes. Make sure to store the class column in a separate variable to be used as labels for the algorithm.

3.1 K-Nearest Neighbour

Implement the K-Nearest Neighbor algorithm to work with the cleaned data set. You can use parts of your implementation you created during the practical but will need to adapt it to work with 3-dimensional data [5 marks]. If you use a library for the core algorithm you will not get the marks for the implementation but can still achieve the marks for the results and evaluation.

Choose 3 different values for k to create the model with. Split your data into training and test data. State these in your report and justify why you chose these specific values [4 marks].

Create the 3 models with the different k using your training data [3 marks].

Use the test data to evaluate your resulting classifier using the confusion matrix and accuracy [3 marks].

3.2 Gaussian Naive Bayes

Implement the Gaussian Naive Bayes algorithm to work with the cleaned data set. You can use parts of your implementation you created during the practical but will need to adapt it to work with 3-dimensional data [5 marks]. If you use a library for the core algorithm you will not get the marks for the implementation but can still achieve the marks for the results and evaluation.

Choose 2 different ways to split your data into training and test data. State these in your report and justify why you chose these specific values [3 marks].

Create the 2 models with the different training data sets [3 marks].

Use the different test data to evaluate your resulting classifiers using the confusion matrix and accuracy. Make sure you use the correct test set for the classifier [4 marks].

4 Task: Comparison and Discussion [6 marks]

Lastly, compare your clustering results with the classification results.

This is a very general task and there are manythings you can notice and discuss. For example, you could discuss the choice of parameters, what happens when the number of clusters matches the number of classes (or not), specific data points that are difficult to cluster or classify and why and many more.

My expectation here is to see three points discussed within 200-300 words total. If you can fit four points in without getting superficial, great. If you only cover two but in depth, also great. Just stay within the 200-300 word range.