COMP20008 Project 2

V1.0: September 6, 2019

DUE DATE

The assignment is worth 20 marks, worth (20% of subject grade) and is due 11:59PM on

Thursday 26 September 2019. Submission is via the LMS. Please ensure you get a submission receipt via email. If you don’t receive a receipt via email, this means your submission

has not been received and hence cannot be marked. Late penalty structure is described at

the end of this document.

This project may be done individually or in groups of two or three. Groups

must be registered by Friday 13th September using the following form: https://forms.gle/

Q9TeXZchamfPQVhJ6, or we will assume you are completing the project individually.

Introduction

For this project, you will perform a data linkage on two real-world datasets and explore different classification algorithms.

The project is to be coded in Python 3. Seven (7) relevant data sets can be downloaded from

LMS:

• datasets for Data Linkage

– amazon.csv

– google.csv

– amazon google truth.csv

– amazon small.csv

– google small.csv

– amazon google truth small.csv

• dataset for Classification: all yeast.csv

Part 1 - Data Linkage (7 marks)

Amazon and Google both have product databases. They would like to link the same products

in order to perform joint market research.

Na¨ıve data linkage without blocking (4 marks)

For this part, data linkage without blocking is performed on two smaller data sets: amazon small.csv

and google small.csv.

The Tasks: Using amazon small.csv and google small.csv, implement the linkage between the two data sets and measure its performance. Comment on your choice of similarity

functions, method of deriving a final score, and the threshold for determining if a pair is a

match.

The performance is evaluated in terms of recall and precision. Ground truth (true matches)

are given in amazon google truth small.csv.

recall = tp/(tp + fn)

precision = tp/(tp + f p)

where tp (true-positive) is the number of true positive pairs, f p the number of false positive pairs, tn the number of true negatives, and fn the number of false negative pairs. The

four numbers should sum up to the total number of all possible pairs from the two datasets.

(n = f p + fn + tp + tn)

Note: The python package textdistance implements many similarity functions for strings

(https://pypi.org/project/textdistance/). You can use this package for the similarity

calculations for strings. You may also choose to implement your own similarity functions.

Your output for this question should include:

1. Your implemented code that generates matches and the two performance measures. (3

marks)

2. A discussion on the linkage method and its performance. Comments should include the

choices of similarity functions and final scoring function and threshold and also include

comments on overall performance of your method. (1 mark)

Blocking for efficient data linkage (3 marks)

Blocking is a method to reduce the computational cost for record linkage.

The Tasks: Implement a blocking method for the linkage of the amazon.csv and google.csv

data sets and report on your proposed method and the quality of the results of the blocking.

Ground truths are given in amazon google truth.csv.

To measure the quality of blocking, we assume that when comparing a pair of records, we can

always determine if the pair are a match with 100% accuracy. With this assumption, note

the following:

• a record-pair is a true positive if the pair is found in the ground truth set and also the

pair are in the same block.

• a record-pair is a false positive if the pair co-occur in some block but are not found in

the ground truth set.

• a record-pair is a false negative if the pair do not co-occur not in any block but are

found in the ground truth set

• a record-pair is a true negative if the pair do not co-occur not in any block and are also

not found in the ground truth set.

Then, the quality of blocking can be evaluated using the following two measures:

Pair completeness (P C) = tp/(tp + fn)

Reduction ratio (RR) = 1 1 (tp + fp)/n

where n is the total number of all possible record pairs from the two data sets. (n =

f p + fn + tp + tn)

Your output for this question should include

1. Your implemented code that generates blocks and the two quality measures. (2 marks)

2. A discussion on the blocking method and its quality. Comments should include the

choices of blocking method and how the method relates to the two quality measures.

(1 mark)

Part 2 - Classification (12 marks)

A biological sciences research team is interesting in prediciting the cellular localization sites

of proteins in yeast, based on particular attributes. In particular, they wish to separate cytoplasm proteins (CYT) from other proteins (non-CYT). They have collected some (somewhat

noisy) data from their current yeast samples. The dataset includes a Class column specifying

whether a particular sample relates to a CYT or non-CYT protein, along with other features

related to the sample. As data scientists, we wish to help them build and assess a classifier

for performing this task.

Pre-processing (2 marks)

• Impute missing values: Use replacement by mean and replacement by median to fill

in missing values. Display the min, median, max, mean and standard deviation for the

data with imputations. Justify which imputation method is more suitable.

• Scale the features: Explore the use of two transformation techniques (mean centering

and standardisation) to scale all attributes after median imputation, and compare their

effects. Display the min, median, max, mean and standard deviation for the both types

of data scaling.

Comparing Classification Algorithms (3 marks)

We wish to compare the performance of the following 3 classification algorithms: k-NN (k=5

and k=10) and Decision tree algorithms on the all yeast dataset.

Use the data, all yeast, as transformed in your pre-processing (mean centered, median imputed). Train the classifiers using 2/3 of the data and test the classifiers by applying them to

the remaining 1/3 of the data. Use each classification algorithm to predict the Class feature

of the data (CYT or non-CYT) using the first 8 features (mcg-nuc).

What are the algorithms’ accuracies on the test data? Which algorithm performed best on

this dataset? Explain your experiments and the results.

Feature Engineering (7 marks)

In order to achieve higher prediction accuracy for k-NN, one can investigate the use of feature

generation and selection to predict the Class feature of the data. Feature generation involves

the creation of additional features. Two methods are

• Interaction term pairs. Given a pair of features f1 and f2, create a new feature f12 =

f1 × f2 or by f12 = f1 ÷ f2. All possible pairs can be considered.

• Clustering labels: apply k-means clustering to the data in all yeast data and then use

the resulting cluster labels as the values for a new feature fclusterlabel. At test time, a

label for a testing instance can be created by assigning it to its nearest cluster.

Given a set of N features (the original features plus generated features), feature selection

involves selecting a smaller set of n features (n < N). Here one computes the mutual information between each feature and the class label (on the training data), sorts the features

from highest to lowest mutual information value, and then retains the top n features from

this ranking, to use for classification with k-NN.

Your task in this question is to evaluate whether the above methods for feature generation

and selection, can deliver a boost in prediction accuracy compared to using k-NN just on the

original features in all yeast. You should:

• Implement the above two methods for feature generation. You should experiment with

different parameter values, including k and different numbers of generated features.

• Implement feature selection using mutual information. Again you should experiment

with different parameter values, such as how many features to select.

Your output for this question should include

1. Your implemented code that generates accuracies or plots for classification using interaction term pairs (3 marks)

2. Your implemented code that generates accuracies or plots for classification using clustering labels (1 mark)

3. A discussion (2-3 paragraphs) including:

• Which parameter values you selected and why. (1 mark)

• Whether feature selection+generation with interaction term paris can deliver an

accuracy boost, based on your evidence from part 1. (1 mark)

• Whether feature generation with clustering labels can deliver an accuracy boost,

based on your evidence from part 2. (1 mark)

Marking scheme

Correctness (19 marks): For each of the questions, a mark will be allocated for level of correctness (does it provide the right answer, is the logic right), according to the number in parentheses next to each question. Note that your code should work for any data input formatted in

the same way as all yeast.csv, amazon.csv, google.csv, and amazon google truth.csv.

E.g. If a random sample was taken from all yeast.csv:, your code should provide a correct

answer if this was instead used as the input.

Correctness will also take into account the readability and labelling provided for any plots

and figures (plots should include title of the plot, labels/scale on axes, names of axes, and

legends for colours symbols where appropriate).

Coding style (1 mark): A Mark will be allocated for coding style. In particular the following

aspects will be considered:

• Formatting of code (e.g. use of indentation and overall readability for a human)

• Code modularity and flexibility. Use of functions or loops where appropriate, to avoid

highly redundant or excessively verbose definitions of code.

• Use of Python library functions (you should avoid reinventing logic if a library function

can be used instead)

• Code commenting and clarity of logic. You should provide comments about the logic of

your code for each question, so that it can be easily understood by the marker.

Resources

The following are some useful resources, for refreshing your knowledge of Python, and for

learning about functionality of pandas.

• Python tutorial

• Python beginner reference

• pandas 10min tutorial

• Official pandas documentation

• Official mathplotlib tutorials

• Python pandas Tutorial by Tutorialspoint

• pandas: A Complete Introduction by Learn Data Sci

• pandas quick reference sheet

• sklearn library reference

• NumPy library reference

• Textdistance library reference

• Python Data Analytics by Fabio Nelli (available via University of Melbourne sign on)

Submission Instructions

All code and answers to discussion questions should be contained within a single Jupyter

Notebook file, which is to be submitted to the LMS by the due date. The first line of your

file should clearly state the full names, login names and Student IDs of each member of your

group. If you are in a group, please ensure only one student submits on behalf of the group.

Other

Extensions and Late Submission Penalties: All requests for extensions must be made by email,

sent to the head tutor Chris Ewin and CC’d to the lecturer Pauline Lin. If requesting an

extension due to illness, please attach a medical certificate or other supporting evidence. All

extension requests must be made at least 24 hours before the due date. Late submissions

without an approved extension will attract the following penalties

• 0 < hourslate <= 24 (2 marks deduction)

• 24 < hourslate <= 48 (4 marks deduction)

• 48 < hourslate <= 72: (6 marks deduction)

• 72 < hourslate <= 96: (8 marks deduction)

• 96 < hourslate <= 120: (10 marks deduction)

• 120 < hourslate <= 144: (12 marks deduction)

• 144 < hourslate: (20 marks deduction)

where hourslate is the elapsed time in hours (or fractions of hours).

This project is expected to require 20-25 hours work.

Academic Honesty

You are expected to follow the academic honesty guidelines on the University website

https://academichonesty.unimelb.edu.au

Further Information

A project discussion forum has also been created on the subject LMS. Please use this in the

first instance if you have questions, since it will allow discussion and responses to be seen by

everyone. The teaching team will support the discussions on the forum until 24 hours prior

to the project deadline. There will also be a list of frequently asked questions on the project

page.

联系我们

QQ：99515681
邮箱：99515681@qq.com
工作时间：8:00-21:00
微信：codinghelp

热点文章

mgt202辅导、讲解 java/pytho... 2025-06-28
讲解 pbt205—project-based l... 2025-06-28
辅导 comp3702 artificial int... 2025-06-28
辅导 cs3214 fall 2022 projec... 2025-06-28
辅导 turnitin assignment讲解... 2025-06-28
辅导 finite element modellin... 2025-06-28
讲解 stat3600 linear statist... 2025-06-28
辅导 problem set #3讲解 matl... 2025-06-28
讲解 elen90066 embedded syst... 2025-06-28
讲解 automatic counting of d... 2025-06-28
讲解 ct60a9602 functional pr... 2025-06-28
辅导 stat3600 linear statist... 2025-06-28
辅导 csci 1110: assignment 2... 2025-06-28
辅导 geography调试r语言 2025-06-28
辅导 introduction to informa... 2025-06-28
辅导 envir 100: introduction... 2025-06-28
辅导 assessment 3 - individu... 2025-06-28
讲解 laboratory 1讲解留学生... 2025-06-28
辅导 ct60a9600 renewable ene... 2025-06-28
辅导 economics 140a homework... 2025-06-28

热点标签

msinm014/msing014/msing014b

联系我们 - QQ: 99515681 微信：codinghelp

程序辅导网！