CS434 — Implementation Assignment 4 — Due 11:59PM -, 2020

General instructions.

1. Please add onto the Python starter code (version 3.6+). Note: you may need to install several packages

(pip install –user), but try not to require anything else. I’ll run it with the packages already there,

and you should have everything you need. (Things like ’os’ are fine though if they’re packaged with

Python 3.6).

2. You can work in a team of up to 3 people. Each team will only need to submit one copy of the source

code and report. You need to explicitly state each member’s contribution in percentages (a rough

estimate).

3. Your source code and report will be submitted through Canvas

4. You need to submit a readme file that contains the commands to run your code for requested experi-

ments (e.g. python main.py args).

5. Please make sure that you can run code remotely on the class server ( vm-cs434-1 ).

6. Be sure to answer all the questions in your report. You will be graded based on your code as well as

the report. In particular, the clarity and quality of the report will be worth 10 pts. So please

write your report in clear and concise manner. Clearly label your figures, legends, and tables.

7. In your report, the results should always be accompanied by discussions of the results. Do

the results follow your expectation? Any surprises? What kind of explanation can you provide?

1

Unsupervised Learning for Human Activity Recognition

(total points: 80 pts + 10 report pts + 10 result pts)

In this assignment we will work with the Samsung Human Activity Recognition dataset to practice un-

supervised and also dimensional reduction methods. More specifically we are interested in applying k-means

clustering and Principal Component Analysis (PCA) methods.

The data for this assignment comes from accelerometers and gyros of Samsung Galaxy S3 mobile phones.

Each instance in the dataset is a set of feature values extracted from sensor signals and pre-processed to

classify the type of activities a person has with his/her phone in his pockets. There are six activities WALK-

ING, WALKING UPSTAIRS, WALKING DOWNSTAIRS, SITTING, STANDING, LAYING.

The label of each instance (type of activities) are given, however we assume that they are unknown and thus we

try to apply an unsupervised k-means clustering to cluster the people based on the available features. We only use

the labels to evaluate the clusters with purity measure.

Data. The specifications of the datasets is as follows:

1. (x train.txt): Contains 7352 rows (samples) each with 561 features.

2. (y train.txt): Contains 7352 rows of labels corresponding to the rows in the (x train.txt).

Important Guidelines. Please note that a skeleton code (starter) is provided and the students should only fill

the code where is specified.

2

Part 1 (50 pts) : k-means clustering. For this part please implement the following steps in the starter

skeleton:

1. Please fill the places marked in the “clustering” module to complete the k-means class. In method “init-center”

the center of clusters will be initialized. You don’t need to implement a sophisticated method to initialize the

clusters. A random selection from the input instances will be sufficient. If you implement method “predict”

efficiently (using vector based operations) the running time will reduce significantly.

2. The k-means class will be used in module main. In this assignment we only consider value 10 as the maximum

number of classes. The maximum number of iterations is set to 20 by default, however we might need more

iterations until the k-means converges. Please update function ”apply-kmeans“ of main as follows:

• To reduce the sensitivity of the k-means to the initialization, update this function to produce the average

of SSE and purity vs k and iterations for 5 different runs of k-means. Note that each run should generate

different initial centers in method “init-center”.

• Plot the average (over 5 runs) of SSE versus iterations for k = 6. You could use the plot functions

provided in the main or change them if needed to show the observation more properly.

• Plot the average (over 5 runs) of the SSE versus k for k ∈ 1 . . . 10. Apply elbow on the curve of SSE

versus k for k ∈ 1 . . . 10, to select the best k. Please report the best k you found.

• Plot the average of purity versus k for k ∈ 1 . . . 10 for the train set and make observation on this.

Part 2 (40 pts) : dimension reduction (PCA). In this part we are interested in reducing the dimension

of the data (which is currently 561) to a smaller number. Please implement the following steps:

1. Implement the marked area in module ”decompose“ to implement the PCA class. In your implementation

please use the mean, cov and eig functions provided in the class.

2. In the main module, complete the ”visualize“ function to visualize the data points in the first two principle

component directions, and color each class with a distinct color.

3. The retain ratio r is the percentage of variance we are interested to maintain and is defined as follows:

where d and m are the reduced and original dimensions respectively and d ≤ m . By default this value is set

to 0.9. Please report the d you will find for this ratio.

4. Apply the k-means for k ∈ 1 . . . 10 described in part 1 (average over 5 runs) on the data with reduce dimension

for retain ratio r = 0.9. Plot the purity of the train for this experiment. Do you observe harmful effect due to

dimension reduction. If it hurts the purity, please increase r to a higher values (with a 2 or 3 trials) and report

the best r which still reduces the dimension but does not hurt the performance.

Submission. Your submission should include the following:

1. The modified source code with a short instruction on how to run the code for your experiments in a readme.txt.

2. Your report (only in PDF format), which begins with a general introduction section, followed by one section

for each part of the assignment.

3. Please note that all the files should be in one folder and compressed only by .zip.