This appendix explains the hierarchical clustering analyses for Rillig, Ryo, & Lehmann (2021) Classifying human influences on terrestrial ecosystems in Global Change Biology.
The author of this document and R script is Masahiro Ryo.
The data and R script are available here
In Section 2, we read the R packages and dataset.
In Section 3, we build a classification tree for 30 factors based on 29 ‘traits’ we identified as an expert opinion.
In Section 4, we build two classification trees for 10 factors studied in Rillig et al. (2019) in Science: One based on the traits and the other built based on the effects of the factors on several soil ecosystem properties. We compare these trees using a tanglegram approach and correlation test.
In Section 5, we show how to assess the importance of trait groups.
We use the following R packages. If you have not installed any of them, please do so before the next step.
library(knitr)
library(tidyverse)
library(ggplot2)
library(ggdendro)
library(dendextend)
library(splitstackshape)
We use the following dataset. The dataset is available at the same repository. The dataset’s format is xlsx. The file contains three sheets (Description, Traits and Experiment).
df01 <- readxl::read_excel("Rillig_etal_2021_Classifying human influences.xlsx",sheet="Traits") %>%
column_to_rownames("Factor")
df_traits <- df01
# Show the header of the data
kable(df_traits[1:5, 1:7], caption="Traits of human influential factors classified based on an expert opinion")
Nature of factor physical | Nature of factor chemical | Nature of factor biological | Resource plant | Resource soil microbe | Resource soil animal | Prox effect nomin positive plants | |
---|---|---|---|---|---|---|---|
Warming | 1 | 0 | 0 | 0 | 0 | 0 | 1 |
Nitrogen deposition or nutrient enrichment | 0 | 1 | 0 | 1 | 1 | 0 | 1 |
Water - reduction (drought) | 0 | 1 | 0 | 1 | 1 | 1 | 0 |
Elevated CO2 | 0 | 1 | 0 | 1 | 0 | 0 | 1 |
Microplastic -fibers, films | 1 | 1 | 0 | 0 | 0 | 0 | 0 |
df02 <- readxl::read_excel("Rillig_etal_2021_Classifying human influences.xlsx",sheet="Experiment")
df_experiment <- df02
# summarizing the data (taking the mean among replicates and then rescale)
df_experiment_summary <-
df_experiment %>%
group_by(remark) %>%
summarise_each(mean) %>%
data.frame() %>%
column_to_rownames(var="remark") %>%
scale() # note that rescaling is needed for making a distance matrix with equally weighted variables
## Warning: `summarise_each_()` is deprecated as of dplyr 0.7.0.
## Please use `across()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
# Show the header of the data
kable(df_experiment_summary[1:5, 1:7], caption="Experimental result about the impacts of 10 factors on a soil ecosystem")
WSA | wdpt.sec. | Percent.decomposition | CO2_ppm_ThirdWeek | Rich.ASV | Comm.Comp1 | Comm.Disp | |
---|---|---|---|---|---|---|---|
Biocide - Antibiotic | 0.6761036 | -0.2391625 | -0.0144373 | 0.6108439 | 0.7175791 | -0.0613247 | -0.0855631 |
Biocide - Fungicide | 0.0056720 | 0.0023461 | 0.1241066 | 0.1024033 | -1.4543221 | 0.8041317 | 0.7124361 |
Biocide - Herbicide | 1.3278689 | -0.1011576 | 0.0933458 | -0.1558522 | 0.8027517 | -0.0737572 | 0.7796464 |
Biocide - Insecticide | -0.0119032 | -0.4116687 | 0.9317825 | -0.0670768 | 0.3555956 | -0.4810188 | 1.7154769 |
Heavy metals - copper | -0.0231748 | -0.4116687 | -1.0474064 | -0.9225483 | -0.2831989 | 0.7361235 | -1.9056214 |
The following functions are generated for making ggplot2-based dendrogram figures
(esp. for coloring different clusters if the number of clusters is given)
1) dendro_data_k
2) set_labels_params
3) plot_ggdendro
[Note] This part is lengthy and fully reproducible in the main R script and from this link. Therefore, this part is not shown in this document.
# Here some functions are defined for visualization purposes
The dataset is used for hierarchical clustering (hc) based on the dissimilarity matrix (euclidean distance). The parameter with dendro_data_k() can change how many clusters are colored differently.
hc <- df_traits %>%
dist(method='euclidean') %>%
hclust(method='average')
# Visualization
p <- plot_ggdendro(hc %>%
dendro_data_k(6), # how many colors used for clustering visualization
direction = "lr", # visualization formats
label.size = 3.5,
branch.size = 0.1,
expand.y = 3)
p <- p + theme_void() + expand_limits(x = c(-1, 32))
p
In this part we use the dataset of Rillig et al. (2019) in Science. The study investigated the effects of 10 different global change factors on soil microbial biodiversity and ecosystem functions. To test if the expert-opinion-based factor similarity can explain the effect similarity of the 10 factors, we apply a tanglegram with correlation analysis.
Building two dendrograms.
# Hierarchical clustering of the 10 factors
hc_experiment <-
df_experiment_summary %>%
dist(method = 'euclidean') %>%
hclust(method = 'average') %>%
as.dendrogram ()
# Extract the 10 factors from the 30 factors
gP <- c(4,6,7,8,11,12) # plant-related variables to be eliminated from the tanglegram analysis
# because in the soil experiment there was no plant included
hc_traits <-
df_traits[c(1,2,3,6,13,14,18,19,20,21),-gP] %>% # these rows are the corresponding 10 factors
dist(method = 'euclidean') %>%
hclust(method = 'average') %>%
as.dendrogram ()
The two dendrograms are now compared as a tanglegram.
dendlist(hc_experiment, hc_traits) %>%
untangle(method = "step1side") %>% # Find the best alignment layout
tanglegram(lwd = 0.5,
columns_width = c(1, 0.3, 1),
highlight_distinct_edges=F,
highlight_branches_lwd=F,
margin_inner=15) # Draw the two dendrograms
The left dendrogram is based on the experimental results. The right one is based on the expert-based traits. We can observe that the set of Salinity, Heavy metals - copper, and Water - reduction (drought) are connected between the dendrograms in red. This indicates that they belong to the exact same set of clustering in both dendrograms.
We also estimate the Cophenetic correlation coefficient. For estimating the confidence intervals, Masahiro Ryo developed an algorithm that conducts a bootstrap resampling on the replicated samples from the experiment dataset.
# N-times iteration
N <- 1:1000
# Correlation coefficients
cor.cophenetic = vector()
cor.bakers.gamma = vector()
# bootstrap coefficient estimate
for(i in N){
# set seed
set.seed(i)
# stratified bootstrap resampling for each global change factor (8 replicates for each factor)
boot.df <- stratified(df_experiment[,1:7], "remark", 8,replace=T) %>%
group_by(remark) %>%
summarise_each(mean) %>%
data.frame() %>%
column_to_rownames(var="remark")%>%
scale()
# dendrogram
hc_experiment.boot <- dist(boot.df, method = 'euclidean') %>%
hclust(method="average") %>%
as.dendrogram()
# Cophenetic correlation coefficient
cor.cophenetic <- c(cor.cophenetic, cor_cophenetic(hc_traits, hc_experiment.boot))
}
# mean, se, and 95confidence intervals for correlation coefficient
ci <- function(x){c(ci.lower=quantile(x, 0.05), se.lower=mean(x)-sd(x), mean=mean(x), se.upper=mean(x)+sd(x),
ci.upper=quantile(x, 0.95))}
# Cophenetic correlation coefficient
ci(cor.cophenetic)
## ci.lower.5% se.lower mean se.upper ci.upper.95%
## 0.0669489 0.1721592 0.2880399 0.4039206 0.4776521
Here we very shortly explain what was done for dropping out some traits. Drop-out is a popular approach to test the importance of the target variables. If dropping out some variables decreases the correlation strength, it suggests that the dropped out variables are important.
Here is the list of dropped out grops.
# data subset: If we want to drop-out some traits to build the dendrogram.
g1 <- c(1:3,24:29) # Nature of factor (group1)
g2 <- c(7:12) # Proximate Effect Direction (group2)
g3 <- c(4:6,15:19) # Effect mechanism (group1)
g4 <- c(13:14,20:23) # Nature of factor (group1)
Each group can be dropped out before conducting the correlation coefficient evaluation. For instance, if we want to assess the importance of the traits associated with Nature of factor (g1), we just drop them out as follows and conduct the correlation test.
df_traits <- df_traits[-g1]