COMP SCI 4094辅导、Database讲解、讲解C++、C++编程语言调试讲解SPSS|辅导Python编程

COMP SCI 4094/4194/7094 - Distributed Databases and Data Mining
Assignment 2
Important Notes
• Handins:
.
– You must do this assignment individually and make individual submissions.
– Your program should be coded in C++ and pass test runs on 3 test files. The sample
input and output files are downloadable in “Assignments” of the course home page
(https://myuni.adelaide.edu.au/courses/54718/assignments/176864/).
– You need to use svn to upload and run your source code in the web submission system
following “Web-submission instructions” stated at the end of this sheet. You should
attach your name and student number in your submission.
– Late submissions will attract a penalty: the maximum mark you can obtain will be
reduced by 25% per day (or part thereof) past the due date or any extension you are
granted.
• Marking scheme:
– 12 marks for testing on 3 standard tests: 4 marks per test.
– 3 marks for the code structure.
– Note: If it is found your code did not implement the required computation tasks
in this assignment, you will receive zero mark regardless of the correctness of testing
output.
If you have any questions, please send them to the student discussion forum. This way you
can all help each other and everyone gets to see the answers.
The assignment
In this assignment you are required to code a traffic packet clustering engine to cluster the raw
network packet to different applications, such as http, smtp. To accomplish this assignment,
a data preprocessing module and a clustering module should be implemented, the structure is
illustrated below:
You have two input files, and you should print two output files.
The input file1 contains a distance threshold and the raw network packet information, that is,
seven attributes of a packet: source address, source port, destination address, destination port,
protocol, arrival time, and packet length. input file1.txt is Sample traffic flow information;
Input file2.txt has a number K, and on the next line include K integer numbers represent an
initial set of K medoids.
In the data preprocessing module, your program should prepare the flow data for clustering
by the raw packet data, two steps are involved: you need to firstly merge the packets into flows
by the rule: a network flow includes at least TWO packets with same source address, source
port, destination address, destination port, and protocol, then calculate two clustering features:
average transferring time and the average packet length of a flow.
In the clustering module, you need to apply k-medoids algorithm (course slides Chapter 10,
not the book’s random method) to find the minimum number of clusters that the sum of the
distance of each flow to its centroid is less than the given threshold. Note: the clustering features
come from data preprocessing module, the distance measurement is Mannhaton distance.
For your convenience, below is the framework of the k-medoids algorithm which you should
follow:
Example
Sample traffic flow information
src addr src port dst addr dst port protocol arrival time packet length
202.234.224.254 49880 31.65.181.210 80 6 115258 52
202.234.224.254 49880 31.65.181.210 80 6 115307 52
202.234.35.144 55256 74.39.124.220 443 6 115310 46
119.188.179.82 50592 150.79.7.129 80 6 115314 40
202.234.224.254 49880 31.65.181.210 80 6 115341 52
119.188.179.82 50592 150.79.7.129 80 6 115350 40
119.188.179.82 50592 150.79.7.129 80 6 115363 40
Data preprocessing module
In the above traffic flow information, there are two flows: The first, second, and fifth packet
belong to the first flow(index is 0); the fourth, sixth, and seventh packet belong to the second
flow(index is 1).
The Average transferring time of first flow = (( the arrival time of fifth packet - the arrival
time of second packet ) + (the arrival time of second packet - the arrival time of first packet))
÷ (3 - 1) = ((115341 - 115307) + (115307 - 115258)) ÷ 2 = 41.5. The Average length of first
flow = (P packet length) ÷ 3 = (52 + 52 + 52) ÷ 3 = 52. Similarly, the Average transferring
time of second flow = 24.5, the average length of second flow = 40.
(arrival time is microsecond(µs))
Clustering module
We use Mannhaton distance to measure the distance between flows. In our sample, the distance
between the two flows is |41.5 − 24.5| + |52 − 40|.
Example input initial medoids.txt — initial k medoids
1 (k=1)
0 (Start from index 0, as the initial start medoid)
Example Output
At begin you should output the flow after Data preprocessing module, include index, average
transferring time x value and average length y value.
ID X Y
In this case, flow.txt should print:
0 41.50 52.00
1 24.50 40.00
Rounding numbers (X,Y) to 2 decimal place. You can use:
cout << f ixed << setprecision(2) << 3.1415926;
or
printf(”%0.2f”, 3.1415926);
After doing KMedoid, you will get K clusters. It includes K+2 lines. First line is absoluteerror
criterion. Next one line include K medoids’ index. Following each line have several flow
index represent each medoid includes which flows.
29.00 (Absolute-error of the cluster,2 decimal place)
0 (Medoid is 0)
0 1 (This cluster include 2 flows index 0 and index 1)
Web-submission instructions
• First, type the following command, all on one line (replacing xxxxxxx with your student
ID):
svn mkdir - -parents -m “DDDM”
https://version-control.adelaide.edu.au/svn/axxxxxxx/2020/s2/dddm/assignment2
• Then, check out this directory and add your files:
svn co https://version-control.adelaide.edu.au/svn/axxxxxxx/2020/s2/dddm/assignment2
cd assignment2
svn add KMedoids.cpp
· · ·
svn commit -m “assignment2 solution”
• Next, go to the web submission system at:
https://cs.adelaide.edu.au/services/websubmission/
Navigate to 2020, Semester 2, Distributed Databases and Data Mining, Assignment 2.
Then, click Tab “Make Submission” for this assignment and indicate that you agree to the
declaration. The automark script will then check whether your code compiles. You can
make as many resubmissions as you like. If your final solution does not compile you won’t
get any marks for this solution.
• Note:
1. Please follow the forms in sample output files.
2. Your local file path will not work with our web-submission system.
3. We prepared ten test files in web-submission system, when you submit your program,
random test files will be allocated for you.
4. The auto-marker script compiles and runs named ”KMedoids.cpp” by using following
command:
g++ -std=c++11 KMedoids.cpp -o runKMedoids
./runKMedoids network packets.txt initial medoids.txt
In this assignment, you need to read two files network packets.txt ( network packets
traffic information) and initial medoids.txt (initial medoids) which are generated
randomly by the system.
you should print two output files named med Flow.txt (flow data after preprocessing)
and KMedoidsClusters.txt (k-medoids clustering results) as shown in the following
twosamples:.
Example1
input:File1.txt
src addr src port dst addr dst port protocol arrival time packet length
202.234.224.254 49880 31.65.181.210 80 6 115258 52
202.234.224.254 49880 31.65.181.210 80 6 115307 52
202.234.35.144 55256 74.39.124.220 443 6 115310 46
119.188.179.82 50592 150.79.7.129 80 6 115314 40
202.234.224.254 49880 31.65.181.210 80 6 115341 52
119.188.179.82 50592 150.79.7.129 80 6 115350 40
119.188.179.82 50592 150.79.7.129 80 6 115363 40
input:File2.txt