Homework 1
Part I: Basic Concepts
1. There are two vectors, x1 and x2
What is the distance between x1 and x2 ? (Please show the steps of the calculations)
(1) if the distance measure is based on L2 norm (a.k.a Euclidean norm)
(2) if the distance measure is based on L1 norm
(3) if the distance measure is based on L∞ norm (a.k.ainfinity norm)
In class, we studied the customer segmentation example and tried to find the most valuable customers who have good income but low spend. There are two feature components in this application.
Assume that we use a clustering method similar to k-mean, and this method could use any type of vector norms as distance measure. Then, does the L∞ norm-based distance measure make sense for this application?
2. We define a scalar valued function of a vector variable
Here,x is a column vector, xT is the transpose of x, and A is asymmetric matrix
To simplify this question, let's assume x has only two elements and
The derivative of f with respect to x is a vector defined by
Show that dx/df = 2Ax
Hint: calculate f(x), 2Ax, da/df and dβ/df
K-means clustering (10 points)
3. Briefly describe the two key steps in one iteration of the k-means algorithm. (1 point)
4. What is the distance measure used ink-means (implemented in sk-learn)? (1 point)
5. The k-means algorithm can always converge in a finite number of iterations. Why? (1 point)
6. The clustering result of k-means could be random. Why? (1 point)
7. The minimum value of the loss function is zero for any dataset. What is the clustering result when the loss function is zero? - assuming that the dataset has millions of different samples. (1 point)
Note: for questions 3,4,5,6,7, you only need to write a few words (bullet points) for each one.
8. find the optimal centers by following the steps below when cluster labels are given. (5 points)
The loss function is as defined in the lecture notes.
First, we calculate , where k could be 1, 2, 3, … , K.
Then, we set and we obtain the optimal center
What are (A), (B), (C), (D), and (E) in the above equations ?
Note: (E) is a variable, not a word or sentence
Part 2: Programming
Complete the tasks in the files:
H1P2T1_kmeans.ipynb
If you want to get some bonus points, try this task:
H1P2T2_kmeans_compression.ipynb
Grading: the number of points
|
Undergraduate Student
|
Graduate Student
|
Basic Concepts
|
10
|
10
|
K-means clustering
|
10
|
10
|
H1P2T1
|
30
|
30
|
H1P2T2
|
5 (bonus)
|
5 (bonus)
|
Total number of points
|
50 + 5
|
50 + 5
|
The following rules are used for every homework assignment.
Each homework assignment is an individual assignment, NOT a group assignment.
For part 1: You may use MS-word to write the answers, convert the file to PDF, and upload it to Blackboard. You may write the answers on a piece of paper, take a photo using your cellphone, and upload the picture to Blackboard. Make sure that your handwriting is human/TA-readable, otherwise you may lose points.
For part 2: complete the ipynb files. Do NOT convert ipynb files topdfor py or another format.
Upload your files to Blackboard and do not miss any files.
Before you submit homework files, make sure you run each and every code cell of your program files. If a code cell is supposed to generate some output (e.g., figure or text) and nothing shows up below the cell because you forget to run the cell, you will lose the points of that cell.
Do NOT use ChatGPT and Copilot.