MATH5855: Multivariate Analysis
Assignment 2
Due data: 5 pm on Tuesday October 25, 2022
Instructions:
The assignment 2 contains 3 questions and worth a total of 100 points which will count
towards 15% of the final mark for the subject.
Use tables, graphs and concise text explanations to support your answers. Unclear answers
may not be marked at your own cost. All tables and graphs must be clearly commented and
identified.
You may choose to submit two files, the pdf file of the answers and the R markdown file,
containing the R codes, OR answer all the questions as an R markdown file.
Questions
Question 1. (Test and confidence region for mean) [20 Marks] Municipal wastewater treatment
plants release their discharges into rivers and streams and they are required to test the biochemical
oxygen demand (BOD) and suspended solids (SS) of their discharges on a regular basis. There are
some concerns about the reliability of the results provided. So, to confirm the results, a study was
conducted and n = 11 samples of effluent were divided and sent to two laboratories for testing.
One-half of each sample was sent to the Wisconsin State laboratory of Hygiene , and one-half was
sent to a private commercial laboratory routinely used in the monitoring program. The data are
displayed in Table 1.
Assume the data follows a multivariate normal distribution. We are going to answer the question
if there is enough statistical evidence to indicate the two lab analysis procedures are different in
the sense that they produce systematically different results.
(a) Use R to find the p-value for testing the hypothesis H0 : μ1 = μ2 versus H1 : μ1 = μ2,
where μ1 and μ2 are the mean vectors for measurements from commercial and and state
labs, respectively. You can write the function or use a predefined function in R.
1Commercial lab State lab
Sample BOD SS BOD SS
Table 1: Effluent Data.
(b) Use R to find and draw the T2 confidence region for μ1 ∞ μ2 at confidence level 95%. Does
this confidence region confirms the result obtained in part (a)?
Question 2. (principal component analysis) [45 Marks] The dataset ”consum2007.dat” contains
some information about per capita consumption expenditures of urban households in 31 regions
in China in 2007, Lang and Jin (2021). The variables are the consumption expenditures on food
(Food), clothing (Cloth), residence (Resid), household facilities, articles and services (HousF),
health care and medical services (Health), transport and communication (TranC), education, cul
ture and recreation (Educ) and miscellaneous goods (Miscel).
(a) Use R to calculate the correlation between variables. The correlation between which vari
ables is different from zero??
(b) For principal component analysis, do you suggest using the covariance matrix or the corre
lation matrix? Why?
(c) Use R to perform the principal component analysis using the suggested matrix in part (b).
(d) What percentage of the variability of the data does each principal component explain? Also
compute the cumulative percentages of variance and draw a screeplot for these data.
(e) Give explicitly the linear combinations of the original data to create the first and second
2principal components and give an interpretation of these linear combinations, describing
which variables play the biggest roles in the construction of those two PCs.
(f) Draw the biplot for the first 2 principal components. Describe what you can extract from the
plot.
Question 3. (Canonical Analysis)[35 Marks]
(a) Let X and Y be p-variate and q-variate random vectors, respectively. Assume that
Let X? = ATX + u and Y? = BTY + v, where A and B are non-singular matrices with
properly defined dimensions. Show that the first canonical correlation between X? and Y?
is the same as the first canonical correlation between X and Y and the canonical correlation
vectors are given by a? = A?? 1a and b? = B?? 1b, where a and b are the vectors connected
with the first canonical correlation vectors of X and Y, respectively.
(b) Consider the provided data in Question 2. Let X and Y denote the set of variables {Food,
Cloth, Resid, HousF, Miscel} and {Health, TranC, Educ}, respectively. Calculate the canonical
correlations between X and Y and write the the first canonical variables in the explicit form.
Do they have a clear interpretation??
(c) Test the significance of correlation between the two sets; comment on the results. How many
canonical correlations are significant??