7SSGN110 Methods in Environmental Research
Practical 9: Ordination & Classification
This practical runs through the ordination techniques taught in the lecture, and provides a classification example using the K-means clustering in R.
1. Unconstrained Ordination (/Indirect Gradient Analysis)
Recapping our lecture, Indirect Gradient Analysis (unconstrained ordinations) seeks to find underlying patterns in multivariate environmental data (e.g. based on species or environmental patterns). Unconstrained ordinations do not attempt to try to find explanations for community compositions or other factors from environmental data, merely detect general variations of the data.
Introduction to the dataset
To investigate variability in chemical concentrations across the Catskill Mountain area of New York, USA, Lovett et al. (2000) measured water chemistry (10 chemical concentrations) of 39 streams.
The data are available in ‘Lovett.csv’ – this data is from Table 1 in Lovett et al. (2000; explanation of the variables used can be found in the paper). Spend some time familiarising yourself with the data set and how it is presented.
1.1. Inspecting the data
i. First import the lovett.csv data into RStudio.
What command do you use to read csv data?: read.csv(). Remember to have set the working directory and set the correct .csv file in brackets. Make sure R knows the first row is a header. How?: remembering to add header = TRUE You will also need to add row.names = 1 in the brackets to make R note that the first column holds the row names (in this case the site name for each row).
To understand some of the underlying patterns of the data, we can first plot a scatterplot matrix using the pairs() function. Note that this initial inspection is not necessary for an ordination; we are merely trying to show that there are clear environmental patterns in the data.
ii. Create a scatterplot matrix using the pairs() function now. You should end up with a scatterplot matrix like below. Stuck?: use pairs(lovett)
Ø Which variables are strongly correlated? How might you expect these to be demonstrated in the ordination?
Rather than having very complex multivariate data (one for each measurement, i.e. 10 dimensions – beyond easy comprehension), we will be using an unconstrained ordination to reduce this large multivariate data set to fewer dimensions (e.g. 2, to enable us to plot the site variation on a 2D plot).
In doing so we will be able to examine how concentrations of different chemicals are related (i.e. how these variables ‘load’ similarly on the primary axes we produce) and see how the different streams are related to one another in terms of those chemical concentrations.
Recall from the lecture that the unconstrained analysis we perform. is dependent upon the environmental gradient length and whether any undesirable artefacts are present in the dataset. Thus, we can first run a DCA to investigate the gradient lengths.
1.2. Detrended Correspondence Analysis (DCA)
The vegan library is full of useful ecological analysis tools.
i. Start by installing and loading the library with:
install.packages("vegan")
library(vegan)
ii. A DCA can be run in R very easily using the decorana() function included in the vegan library. This should take the form. of:
lovettDCA <- decorana(Data)
iii. After you have successfully run the DCA, print and plot the results:
print(DCA)
plot(DCA)
Recapping from the practical, if the primary axis length is <3 S.D., we can use a linear method – PCA. If the length is >4 S.D., we should use unimodal methods (CA or DCA). The printed axes lengths are in S.D.
Ø What is the primary axis length and what type of analysis should we use?
1.3. Principle Components Analysis (PCA)
PCA can be run in R with the built-in function prcomp() already inbuilt into R.
i. Run the PCA now using a command similar to the below:
lovettPCA <- prcomp(YourDataName, scale = TRUE)
As noted in the help page for prcomp (viewable with help(prcomp)) scale is "a logical value indicating whether the variables should be scaled to have unit variance before the analysis takes place. The default is FALSE for consistency with S, but in general scaling is advisable"
ii. To view the results of the PCA, use the following code:
biplot(lovettPCA)
summary(lovettPCA)
iii. Now rerun the PCA but with Scale = FALSE then plot the results.
For clarification, the scaling standardises according to each unit’s variance. Without scaling on, because DOC has much higher values, the majority of the plot is taken up with minor relative variations in DOC rather than the plot being standardised for relative differences in each chemicals between sites.
1.3.1. How many components are useful?
The first step in our analysis is to assess how useful the components are useful for describing variation across multiple variables. We can use the Eigenvalues and a screeplot for this.
i. To get the eigenvalues from prcomp object created above use:
lovett_ev <- lovettPCA$sd^2
print(lovett_ev)
ii. To create a screeplot use:
screeplot(lovettPCA, type = "lines")
From these results answer the following question:
Ø How many components are ‘useful’? Remember what was defined as ‘useful’ in the lecture Hint: Any eigenvalue above 1 is deemed “useful”
For the first three principal components from the results output so far, complete the following table (use summary(YOUR_DATA_NAME) ):
Component
|
Eigenvalue
|
% of Variance Explained
|
Cumulative % of Variance Explained
|
1
|
|
|
|
2
|
|
|
|
3
|
|
|
|
1.3.2. Which variables are important and how are they grouped on the principal components?
Now that we have assessed which of the principal components are useful, we should check which variables are important for these components and how they are grouped (i.e. loaded) on those components.
i. To get the component loadings use:
lovett_cl <- lovettPCA$rotation
print(lovett_cl)
Larger values will indicate that the environmental variable has a greater influence on the PC axis.
Ø Which variables make the greatest contribution to PC1?
Ø Which variables make the greatest contribution to PC2?
Ø Which variables make the greatest contribution to PC3?
ii. To get the component scores for each site we can use:
lovett_cs <- lovettPCA$x
print(lovett_cs)
Ø Referring back to the component loading and biplot, how do different variables load similarly or differently for PC1?
Ø Referring back to the component loading and biplot, how do different variables load similarly or differently for PC2?
Ø From answering the above questions, make some general statements about what aspects of chemistry the two principal components represent (you may want to refer to Lovett et al., 2000).
PC1:
PC2:
1.4. Correspondence Analysis (CA)
Let’s imagine we observe a horseshoe pattern within our PCA plot (refer back to the lecture to recap what this is). As both our DCA and PCA results indicated a short primary axis (<2 S.D.), we should therefore attempt to use a CA first before choosing a DCA as the test has a stronger statistical power.
We must first call two necessary libraries: FactoMineR and factoextra (you may need to install these first).
i. Call the libraries as necessary (i.e. library(NecessaryLibrary) ). You may need to install the package first (i.e. install.packages(NecessaryLibrary) )
ii. We can plot a Correspondence Analysis using:
nameOfResults <- CA(YourDataset)
This will automatically plot the results. Remember to inspect the plot for the ‘arch effect’ artefact. Assuming the PCA and CA were unsuitable because of unwanted data artefacts, we would revert to using a DCA (which we had run previously).
iii. Continuing with our CA, print the output to understand the various elements this creates:
print(nameOfResults)
iv. We can inspect individual elements by adding a $ immediately after the dataset in the print command, e.g. to print the eigenvalues, we can use:
print(nameOfResults$eig)
1.5. Factor Analysis (FA)
To complete our exploration of unconstrained ordination techniques, we will finish with an example of running Factor Analysis. Factoral analysis is pre-build into R (factanal())
Factoral Analysis attempts to identify different factors leading to patterns in the data. It however does so by deciding on which environmental variables are correlated to which others, and which are irrelevant for particular axes. This is a little more complicated, so let’s run through an example.
i. Run:
FA_output <- factanal(lovett, 3, rotation="none")
ii. Use help(factanal) to identify what the various parts of the command mean.
The number we plug in (in this case 3) is the number of factors we attempt to create. “Factors” can be thought of a little like Principal Components. However, unlike PCs, not all variables are included in all factors. You can think of factors as pseudo-environmental variables. For example, as conductivity and salinity are often highly correlated in aquatic environments, a factor might merge these and other similarly related (colinear) parameters. Dissolved Oxygen however might be unrelated, so might be missed out of that factor.
Ø Print the factor analysis result. How many parameters do Factors 1, 2 & 3 comprise?
iii. To plot the analysis, we must first create a new dataset with the loadings of the primary and secondary axes:
load <- FA_output$loadings[,1:2]
plot(load, type="n")
text(load, labels=names(EnvParams),cex=.7)
iv. Finally, repeat all the Factor Analysis steps with the setting rotation="varimax"
Ø What do you notice about the different plots and what variables are included for each factor? What do you think varimax means?
2. Constrained Ordination/Direct Gradient Analysis
Direct Gradient Analysis (Constrained Ordinations) examine patterns in one dataset (e.g. ecological community) and seeks to explain the variation by a secondary linked dataset (e.g. environmental parameters).
We will demonstrate constrained ordination with paired macrophyte (i.e. aquatic plants) and environmental data taken from ponds in the UK. Familarise yourself with Macro2020.csv and EnvParams.csv
2.1. Determining which constrained analysis
Before running our constrained analysis, we first must run a DCA to determine the gradient length (and therefore which type of analysis is appropriate to run.
i. First, let’s import the two related datasets:
Species <- read.csv("Mac2020.csv", header = TRUE, row.names = 1)
EnvParams <- read.csv("EnvParams.csv", header = TRUE, row.names = 1)
ii. Next, run the DCA on the species data (only). Recall that the command for a DCA is decorana()
Ø What is the length of the longest axis? What does this mean for whether the data should adopt a linear or unimodal analysis?
2.2. Canonical Correspondence Analysis (CCA)
Given the large axis length (i.e. environmental gradient of our data), we can safely assume a unimodal analysis is applicable to our data (i.e. that we can/should use a CCA).
i. A CCA can now be run using the CCA() command (which is part of the vegan package):
pond_cca <- cca(Species, EnvParams)
You should notice that unlike unconstrained ordination, constrained ordinations require 2 input datasets: 1) species community data and 2) environmental parameters. Again, constrained ordinations seek to understand how the variation of one dataset (species) differs also using potential explanators from the variation in a second dataset (environment).
ii. We can construct a CCA plot using the instructions:
plot(pond_cca)
iii. We can also be selective in what we show, by amending what is “display[ed]” (i.e. removing one or more of the variables in speech marks below):
plot(pond_cca,display = c("sp", "wa", "cn"))
Ø Amend the above to remove or show each of sp, wa and cn. What do they each correspond to?
iv. Use print(pond_cca) to inspect the results of our CCA analysis. You will notice Eigenvalues for both constrained and unconstrained axes.
Ø How many constrained axes are provided? Why are there this number of constrained axes?
v. We can use summary(pond_cca) to print all the data associated with the CCA.
2.3. Redundancy Analysis (RDA)
To perform. and plot an RDA, the steps are identical except use the rda() function in place of cca(). Repeat the constrained ordination above using an RDA. Note that as we had established from our DCA, our data has a long gradient (>4 s.d.), hence we should not be using a linear method (such as RDA). We are therefore merely showing this step to show how you can perform. an RDA on your data.
Ø How do the plots of RDA and CCA differ?
Ø As a reminder, RDA is a linear method whilst CCA is a unimodal method. How might these transformations spread the data and contribute to the observed difference?
3. Cluster Analysis
Our ordination plots do not visually show any clear groupings for sites or species. We can however attempt to use statistical analysis and machine learning to identify clusters.
3.1. K-means (Machine Learning Partitioning Analysis)
As an example of clustering, let’s use a very common machine learning technique for identifying clusters: K-means.
i. Start by calling the factoextra library (NB: may need installing first):
library(factoextra)
ii. As discussed in the lecture, the starting point has an important bearing on the result of the clustering, so it is important to note or set the start point (or ‘seed’) of the random number generator. We can do this with the command:
set.seed(123)
iii. We are now ready to run k-means on our data. For now, let’s set the initial group numbers to 4
kmeansEnv <- kmeans(EnvParams, 4)
iv. We can visualise our grouping using:
fviz_cluster(kmeansEnv, data = EnvParams, palette = c("#00AFBB","#2E9FDF", "#E7B800", "#FC4E07"), ggtheme = theme_minimal() )
v. Using print(kmeansEnv) will show results of the analysis including the number of sites in each grouping, the mid-points of each cluster, which group each site is a member of (in order) and importantly, the sum of squares error. Do this now and inspect the results
The relatively high sum of squares indicates the data and model demonstrate quite sizeable variation from the predicted groupings. This should visually be very apparent from the considerable overlap of the groups:
vi. If it looks too cluttered, we can choose to only plot the points adding the command below as an addition within the bracket of fviz_cluster:
… geom = c("point"),…
vii. We can also bind the identified cluster to our original dataset as a new column using:
BoundDataset <- cbind(EnvParams, cluster = kmeansEnv$cluster)
print(BoundDataset)
Congratulations for conducting cluster analysis in R!
As there are many, many options for ordination and cluster analysis we will leave the practical here. The above however covers the most common methods of ordination. In the R script. uploaded to KEATS, I have included examples of how to run AGNES (Agglomerative Nesting) and DIANA (Divisive Analysis), which you can try should you wish.
References Cited
Lovett, G. M., Weathers, K. C., & Sobczak, W. V. (2000). Nitrogen saturation and retention in forested watersheds of the Catskill Mountains, New York. Ecological Applications 10(1) 73–84. https://doi.org/10.1890/1051-0761(2000)010[0073:NSARIF]2.0.CO;2