Instructions: Submit two files. One should be a write-up of all solutions and observations, as LastnameFirstnameSolution.pdf.
The second should be an archive LastnameFirstnameCode.zip containing code and any results
files.
1 [5 pts.] Irreducible data example
In class we discussed that not all datasets’ dimensionality can be successfully reduced using PCA.
(a)[1 pts] Discuss the cases when PCA will fail.
(b)[2 pts] How do we quantify that it fails?
(c)[2 pts] Provide an example dataset of 2D points (specify the points as vectors of numbers) in which PCA will
not work well for dimensionality reduction. Explain why. Hint: Think of 2D points and reduction to 1D.
2 [40 pts] Dimensionality reduction
For this question you can use the cloud.data.pdf from the UCI ML repository: https://archive.ics.uci.edu/ml/
datasets/Cloud. Read about it to get familiar with what is measured. Within the data, there are two datasets: DB
#1 and DB #2. For this homework, just use the 1024 vectors in DB #1. Use python for all your programming. You
will have to submit your code in LastnameFirstnameCode.zip together with the relevant write-up in the main
solution file LastnameFirstnameSolution.pdf.
(a)[5 pts] Load the data into a python program and center it. Note: there should be a function called center()
in your code that achieves this.
(b)[5 pts] Compute the covariance matrix of the data Σ. Hint: by using the definition of sample covariance, as a
matrix product or as a sum of outer products. See book for details. Use Numpy for linear algebra computations
(https://docs.scipy.org/doc/numpy-1.13.0/reference/routines.linalg.html).As a result you should have a function
covar() in your code which does not use the built-in covariance functions.
(c)[5 pts] Compute the eigenvectors and eigenvalues of Σ. The numpy linear algebra module referenced above has a
function that can help.
(d)[10 pts] Plot the percentage of retained variance as a function of the number of principle components used.
Determine the number of principal components (PCs) r that will ensure 90% retained variance? How did you
compute this? Provide a function in your code that determines r based on an arbitrary percentage α of retained
variance.
(e)[10 pts] Compute the reduced dimension data matrix A with two dimensions by projection on the first two PCs.
Plot the points using a scatter plot (two dimensional diagram that places each sample i according to its new
dimensions ai1, ai2). Discuss the observations. Are there clusters of close-by points? Argue for or against whether
1
these are sufficient dimensions.
(f)[5 pts] Study the PCA implementation in pythons’ sklearn library https://scikit-learn.org/stable/
modules/generated/sklearn.decomposition.PCA.html#sklearn.decomposition.PCA. Do PCA using the library
on the same data. Do the eigenvalues approximately match to what you computed above?
3 [20 pts.] Kernel methods
Consider the problem of finding the most dissimilar diametric pair (MDDP): this is a pair of data points that are
dissimilar from the mean and also dissimilar from each other. Below is an algorithm that would find such a pair
given a data matrix D:
Algorithm 1: MDDP(D)
Result: a, b - the most dissimilar diametric points in D
Compute the data mean µ = mean(D);
s = +∞;
for i in (1 . . . n) do
for j in (i + 1 . . . n) do
temp = x
T
i µ + x
T
j µ + x
T
i xj ;
if temp < s then
s = temp;
a = xi
;
b = xj ;
end
end
end
The algorithm computes the sum of inner products x
T
i µ + x
T
j µ + x
T
i xj for each pair of points and returns the
pair with the lowest such quantity.
(a)[5 pts] Demonstrate the execution of this algorithm on the following data matrix of 2D instances: D=
0 1
1 3
5 0
2 4
.
Show the steps and the resulting MDD pair of points.
(b)[15 pts] As we discussed in class sometimes we would like to kernelize methods to handle non-linearity in data.
Provide a pseudo-code for a kernel version of the MDDP algorithm above. The goal is to kernelize the algorithm
for an arbitrary kernel Hint: Assume that you can compute the kernel matrix K, corresponding to
some mapping φ() and then use the basic kernel operations we discussed in class and also in the
book, to derive the steps of MDDP in terms of elements in K.
4 [10 pts.] Orthogonality of Error in Regression:
Prove that Yˆ T
= 0, where Yˆ is the predicted response and = Y −Yˆ is the error between the actual and predicted
response. Hint: Use the solution for the predicted response as a transformation of Y through the
hat matrix.
5 [25 pts.] Regression to understand CO2 pollution:
For this task, you will use data about CO2 emissions in European cities included as an excel sheet. The data has
total emissions per city (Column E) as well as other variables including number of airports, buildings, etc
(columns G and later). Use only data for cities (Column C: admin level=8). Feel free to prepare the data into
simpler text format before you use code.
The goal will be to train linear regression models of the total CO2 as a function of the predictor variables in
columns G and on.
2
(a) [7pts] Use scikitlearn’s linear regression: linear model.LinearRegression to learn an unregularized model.
Report the corresponding coefficient for each predictor variable.
(b) [7pts] Use scikitlearn’s ridge regression: linear model.Ridge to learn a regularized model. Try 2 values for
alpha: 1 and 10. Report the coefficients of the predictor variables for the two models.
(c) [11pts] Discuss your findings. Which are the most important factors determining the CO2 emissions? How
do the loading coefficients for factors differ between regularized and non-regularized models? How does the
residual error differ? Use visuazlizations as necessary to discuss your findings.