辅导MA429解析R语言程序、R语言辅导留学生

MA429 Algorithmic Techniques for Data Mining
Exercises 9 Homework is summative
Submission
Submit the summative solutions by 11.59pm of Thursday 2nd April (London time) to
the Moodle submission link. This homework is worth 10% of the final course mark. (This
is increased from the 5% planned, in order to decrease the weight on the final exam due to
coronavirus disruption.)
Submit a single pdf that contains the solutions to all questions. (This can include scanned
versions of handwritten arguments, although typesetting is preferred.) Submit your R code for
the programming question as an R script or Rmd file; if Rmd, please include the report that
includes the code’s output.
Your files should be named with your candidate exam number, which should also appear
within the files to be sure. Do not show your name or registration number anywhere: these
submissions are anonymous.
Details of whether you need to submit a plagiarism/receipt form, and how, are under dis-
cussion within Mathematics and will follow in due course.
You are to do this homework individually. You are welcome to use the course materials
provided and recommended, including all textbooks mentioned. You may also use generic online
reference sources such as Wikipedia pages, but (though I doubt that it would help anyway),
you must not search specifically for solutions to these problems.
1. The figure below shows some data points in two dimensions.
(a) What are the two principal component (PC) directions?

(b) If the coordinates of the red point are (3,−1), roughly what would be its coordinates
in the principal component space?
2. You wish to cluster the points shown below into two clusters, hoping for the clusters
indicated here by colour. We compare K-means with K = 2, DBSCAN, and spectral
clustering. What clusters would you expect each to produce? Which method(s) would
find the right clusters, as colour-coded here?
Gregory Sorkin (version: 2020.03.24 10:37) c©LSE 2020
MA429 Algorithmic Techniques for Data Mining Exercises 9 Homework is summative
3. You begin with standardised data with n data points in p dimensions.
(a) What is the sum of squares of all n× p scores, and why?
(b) You perform PCA on this data, getting p score vectors, each of dimension n. What
do they represent?
(c) What is the sum of squares of all pn entries of these PCA score vectors, and why?
(d) Is the variance of entries of the first score vector more or less than 1, and why?
(e) What is the interpretation of the variance of the first score vector divided by p?
(f) The biplot below is for the iris dataset, with 4 features. Explain the meaning of the
“42” near the top left corner, near coordinates (−2, 2.3).
(e) What does K-means give for K = 2 in the two cases? For both r = 1 and r = 2,
answer the following question: Is it true that you will always obtain the two natural
clusters for any choice of the initial centres?
6. 10 marks. Spectral clustering uses K-means on a different representation of the point
set via eigenvectors obtained from the normalised Laplacian matrix. Instead, we could try
using the principal components obtained by PCA and run K-means clustering on them;
let us call this method PCA clustering. Consider the example in the picture:
Let K = 3. We have seen in the lecture that spectral clustering would successfully identify
the three clusters corresponding to the three rings. Would PCA clustering also succeed
identifying these rings? Justify your answer. (Hint: What are the two principal component
directions? What would the points look like in that coordinate system?)
7. 15 marks.
(a) Generate a simulated data set with 20 observations in each of three classes (i.e. 60
observations total), and 50 variables. The three classes should be clearly separated
Page 5 of 6
MA429 Algorithmic Techniques for Data Mining Exercises 9 Homework is summative
from one another. (Hint: there are many ways to obtain such a data set. For
example you can generate the variables randomly, and then add different shifts to
certain variables for the different classes.)
(b) Perform PCA on the 60 observations and plot the first two principal component score
vectors. Use a different color to indicate the observations in each of the three classes.
If the three classes appear separated in this plot, then continue on to part (c). If not,
then return to part (a) and modify the simulation so that there is greater separation
between the three classes.
(c) Perform K-means clustering of the observations with K = 3. How well do the clusters
that you obtained in K-means clustering compare to the true class labels?
(d) Now perform PCA clustering: K-means clustering with K = 3 on the first two
principal component score vectors, rather than on the raw data. That is, perform K-
means clustering on the 60× 2 matrix of which the first column is the first principal
component score vector, and the second column is the second principal component
score vector. How do the clusters compare to the true class labels?