Appendix: Rillig et al. 2021 Classifying human influences on terrestrial ecosystems

1. Introduction

This appendix explains the hierarchical clustering analyses for Rillig, Ryo, & Lehmann (2021) Classifying human influences on terrestrial ecosystems in Global Change Biology.
The author of this document and R script is Masahiro Ryo.
The data and R script are available here
In Section 2, we read the R packages and dataset.
In Section 3, we build a classification tree for 30 factors based on 29 ‘traits’ we identified as an expert opinion.
In Section 4, we build two classification trees for 10 factors studied in Rillig et al. (2019) in Science: One based on the traits and the other built based on the effects of the factors on several soil ecosystem properties. We compare these trees using a tanglegram approach and correlation test.
In Section 5, we show how to assess the importance of trait groups.

2. R packages and dataset

2.1. Reading R packages

We use the following R packages. If you have not installed any of them, please do so before the next step.

library(knitr)
library(tidyverse)
library(ggplot2)
library(ggdendro)
library(dendextend)
library(splitstackshape)

2.2. Reading the dataset

We use the following dataset. The dataset is available at the same repository. The dataset’s format is xlsx. The file contains three sheets (Description, Traits and Experiment).

2.2.1. Factors and traits based on expert opinion (qualitative)

df01 <- readxl::read_excel("Rillig_etal_2021_Classifying human influences.xlsx",sheet="Traits") %>%
      column_to_rownames("Factor")

df_traits <- df01

# Show the header of the data
kable(df_traits[1:5, 1:7], caption="Traits of human influential factors classified based on an expert opinion")

Traits of human influential factors classified based on an expert opinion
	Nature of factor physical	Nature of factor chemical	Resource plant	Resource soil microbe	Resource soil animal	Prox effect nomin positive plants
Warming	1	0	0	0	0	1
Nitrogen deposition or nutrient enrichment	0	1	1	1	0	1
Water - reduction (drought)	0	1	1	1	1	0
Elevated CO2	0	1	1	0	0	1
Microplastic -fibers, films	1	1	0	0	0	0

2.2.2. The factors tested in the experiment (quantitative)

df02 <- readxl::read_excel("Rillig_etal_2021_Classifying human influences.xlsx",sheet="Experiment") 
df_experiment <- df02

# summarizing the data (taking the mean among replicates and then rescale)
df_experiment_summary <- 
  df_experiment %>%
  group_by(remark) %>%
  summarise_each(mean)  %>%
  data.frame() %>%
  column_to_rownames(var="remark") %>%
  scale() # note that rescaling is needed for making a distance matrix with equally weighted variables

## Warning: `summarise_each_()` is deprecated as of dplyr 0.7.0.
## Please use `across()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.

# Show the header of the data
kable(df_experiment_summary[1:5, 1:7], caption="Experimental result about the impacts of 10 factors on a soil ecosystem")

Experimental result about the impacts of 10 factors on a soil ecosystem
	WSA	wdpt.sec.	Percent.decomposition	CO2_ppm_ThirdWeek	Rich.ASV	Comm.Comp1	Comm.Disp
Biocide - Antibiotic	0.6761036	-0.2391625	-0.0144373	0.6108439	0.7175791	-0.0613247	-0.0855631
Biocide - Fungicide	0.0056720	0.0023461	0.1241066	0.1024033	-1.4543221	0.8041317	0.7124361
Biocide - Herbicide	1.3278689	-0.1011576	0.0933458	-0.1558522	0.8027517	-0.0737572	0.7796464
Biocide - Insecticide	-0.0119032	-0.4116687	0.9317825	-0.0670768	0.3555956	-0.4810188	1.7154769
Heavy metals - copper	-0.0231748	-0.4116687	-1.0474064	-0.9225483	-0.2831989	0.7361235	-1.9056214

3. Hierarchical clustering for the 30 factors using the 29 traits

3.1. Defining functions

The following functions are generated for making ggplot2-based dendrogram figures
(esp. for coloring different clusters if the number of clusters is given)
1) dendro_data_k
2) set_labels_params
3) plot_ggdendro

[Note] This part is lengthy and fully reproducible in the main R script and from this link. Therefore, this part is not shown in this document.

# Here some functions are defined for visualization purposes

3.2. Hierarchical clustering

The dataset is used for hierarchical clustering (hc) based on the dissimilarity matrix (euclidean distance). The parameter with dendro_data_k() can change how many clusters are colored differently.

hc <- df_traits %>%
      dist(method='euclidean') %>%
      hclust(method='average')

# Visualization
p <- plot_ggdendro(hc  %>% 
                   dendro_data_k(6),    # how many colors used for clustering visualization
                   direction   = "lr",  # visualization formats
                   label.size  = 3.5,
                   branch.size = 0.1,
                   expand.y    = 3)

p <- p + theme_void() + expand_limits(x = c(-1, 32))
p

4. Hierarchical clustering for the 10 factors: Traits vs Experimental results

In this part we use the dataset of Rillig et al. (2019) in Science. The study investigated the effects of 10 different global change factors on soil microbial biodiversity and ecosystem functions. To test if the expert-opinion-based factor similarity can explain the effect similarity of the 10 factors, we apply a tanglegram with correlation analysis.

4.1. Building the tanglegram

Building two dendrograms.

# Hierarchical clustering of the 10 factors
hc_experiment <- 
  df_experiment_summary %>%
  dist(method = 'euclidean') %>%
  hclust(method = 'average') %>%
  as.dendrogram ()

# Extract the 10 factors from the 30 factors

gP <- c(4,6,7,8,11,12) # plant-related variables to be eliminated from the tanglegram analysis 
                       # because in the soil experiment there was no plant included

hc_traits <- 
  df_traits[c(1,2,3,6,13,14,18,19,20,21),-gP] %>%  # these rows are the corresponding 10 factors
  dist(method = 'euclidean') %>%
  hclust(method = 'average') %>%
  as.dendrogram ()

The two dendrograms are now compared as a tanglegram.

dendlist(hc_experiment, hc_traits) %>%
  untangle(method = "step1side") %>%   # Find the best alignment layout
  tanglegram(lwd = 0.5, 
             columns_width = c(1, 0.3, 1),
             highlight_distinct_edges=F,
             highlight_branches_lwd=F,
             margin_inner=15)                       # Draw the two dendrograms

The left dendrogram is based on the experimental results. The right one is based on the expert-based traits. We can observe that the set of Salinity, Heavy metals - copper, and Water - reduction (drought) are connected between the dendrograms in red. This indicates that they belong to the exact same set of clustering in both dendrograms.

4.2. Correlation coefficient

We also estimate the Cophenetic correlation coefficient. For estimating the confidence intervals, Masahiro Ryo developed an algorithm that conducts a bootstrap resampling on the replicated samples from the experiment dataset.

# N-times iteration
N <- 1:1000
# Correlation coefficients
cor.cophenetic = vector()
cor.bakers.gamma = vector()

# bootstrap coefficient estimate
for(i in N){
  # set seed
  set.seed(i)
  # stratified bootstrap resampling for each global change factor (8 replicates for each factor)
  boot.df <- stratified(df_experiment[,1:7], "remark", 8,replace=T) %>%
            group_by(remark) %>%
            summarise_each(mean) %>%
            data.frame() %>%
            column_to_rownames(var="remark")%>%
            scale() 
    
  # dendrogram
  hc_experiment.boot   <- dist(boot.df, method = 'euclidean') %>%
                          hclust(method="average") %>%
                          as.dendrogram()
  
  # Cophenetic correlation coefficient
  cor.cophenetic   <- c(cor.cophenetic, cor_cophenetic(hc_traits, hc_experiment.boot))    

  
}

# mean, se, and 95confidence intervals for correlation coefficient
ci <- function(x){c(ci.lower=quantile(x, 0.05), se.lower=mean(x)-sd(x), mean=mean(x), se.upper=mean(x)+sd(x),
                    ci.upper=quantile(x, 0.95))}

# Cophenetic correlation coefficient
ci(cor.cophenetic)

##  ci.lower.5%     se.lower         mean     se.upper ci.upper.95% 
##    0.0669489    0.1721592    0.2880399    0.4039206    0.4776521

5. Drop-out approach

Here we very shortly explain what was done for dropping out some traits. Drop-out is a popular approach to test the importance of the target variables. If dropping out some variables decreases the correlation strength, it suggests that the dropped out variables are important.

Here is the list of dropped out grops.

# data subset: If we want to drop-out some traits to build the dendrogram.
g1 <- c(1:3,24:29)     # Nature of factor (group1)
g2 <- c(7:12)          # Proximate Effect Direction (group2)
g3 <- c(4:6,15:19)     # Effect mechanism (group1)
g4 <- c(13:14,20:23)   # Nature of factor (group1)

Each group can be dropped out before conducting the correlation coefficient evaluation. For instance, if we want to assess the importance of the traits associated with Nature of factor (g1), we just drop them out as follows and conduct the correlation test.

df_traits <- df_traits[-g1]