Session 1.10. Describing microbial community structure#
Goals#
The goal of this exercise session is to practice some of the concepts for describing the taxonomic structure of microbial communities.
You will be guided through an exercise that will cover the main concepts. Embeded in the exercise you will find specific Questions and Tasks that you will need to answer.
Questions should be answered by interpreting the plots or data produced during the exercise.
Tasks will consist on more complex questions for which you will need to produce new plots or analyses. This should be easy by re-using the code used through the exercise.
General procedure for the exercise#
You will need to open RStudio and copy and paste into the Console the pieces of code that you will find when following the exercise.
The data#
You will use an example dataset from the Human Microbiome Project. You will analyze a dataset composed of 2,710 samples corresponding to the microbiome composition of different body sites of 176 subjects.
As the main input information we will use the OTU table that resulted from the sequence processing of 16S amplicon data. It is a text file called otutab.tsv. It consists on a tab-delimited table with the number of reads retrieved for each OTU (columns) in each sample (rows). We will also use a tab-delimited text file containing auxiliary information for the samples (rows), called auxildata.tsv.
Set some default R settings#
The first thing we need to do is to set some R settings for the session. Although the vast majority of standard computations are included in R as we open it, there exists a huge repository of additional functionalities that may be added to it. It is organized in ‘packages’ that contain extra functions related to a very huge list of disciplines. You may find the whole list here. We will exploit the functionalities of several packages. These packages have to be launched for every R session through the library() function.
library(vegan)
library(tidyverse)
library(data.table)
Read the data#
In order to read input data as tables in text format we use the fread()
function.
otutab<-fread("../../masterdata/HMP_data/otutab_min1000reads.tsv",sep="\t",header=T,data.table = F)
auxiltab<-fread("../../masterdata/HMP_data/auxildata_min1000reads.tsv",sep="\t",header=T,stringsAsFactors = T,data.table = F)
rownames(otutab)<-auxiltab$SampleID # Set the rownames
otutab<-otutab[,-1] # Delete the first column
auxiltab$SampleID<-as.factor(auxiltab$SampleID)
auxiltab$Subject<-as.factor(auxiltab$Subject)
auxiltab$VisitNo<-as.factor(auxiltab$VisitNo)
Once we read the two files we can have a look to the top of each data frame with the function head()
or to the their structure with str()
. The auxiltab data frame is composed of 7 variables:
SampleID: the sample ID.
Subject: the subject ID.
VisitNo: the visit nomber for each subject.
Sex: the subject’s sex.
Bodysite: broad body site categories (OR=oral, SK=skin, ST=stool and VG=vagina).
Bodysubsite: fine body site categories.
Bodysubsite.simpl: simplified fine body site categories (Left_Antecubital_fossa and Right_Antecubital_fossa merged; Left_Retroauricular_crease and Right_Retroauricular_crease merged; Supragingival_plaque and Subgingival_plaque merged).
head(auxiltab)
str(auxiltab)
Alpha-diversity (or local diversity)#
In this section we will explore the alpha-diversity, that is the local or sample-specific diversity. This is in contrast to the beta-diversity, which is the diversity that occurs among samples (which we will explore later).
Sequencing depth#
The first thing we may look at is the number of reads of each sample (also called sequencing depth). As our samples corresponds to the rows of the otutab data frame, we just need to get the sum for each row. We do that with the rowSums()
function. We may want to visualize it as an histogram.
seq.depth<-rowSums(otutab)
plot.df<-data.frame(seq.depth,auxiltab)
ggplot(data=plot.df,aes(x=seq.depth)) +
geom_histogram(bins=100)
It is pretty obvious that there is a very variable number of reads/sample. That is, some samples were highly sequenced while others have a small sequencing depth. That may be problematic for studying alpha diversity. We’ll explore this in the following steps.
Species richness and rarefaction#
Species richness is the total number of species of a community of organisms. However, we only have access to a sample representing a fraction of the entire community. The higher the sequencing depth, the better the sample is going to represent the true richness of the community: the samples with a higher number of reads may contain more OTUs just because we sequenced more, and thus, had a greater chance of detecting low-abundance OTUs; these OTUs may be also present in the samples with low sequencing depth but remain undetected.
The easiest estimate of the species richness is the observed species richness, that is, to simply count the number of different OTUs in each of our samples. This will be highly affected by the uneven sequencing depth between samples.
It is possible to use species richness estimators, such as Chao1, which try to correct for the unobserved species due to insufficient sequencing.
We can get both with estimateR()
. We will find them as two different variables (S.obs and S.chao1):
richness<-t(estimateR(otutab))
plot.df<-data.frame(richness,seq.depth,auxiltab)
ggplot(data=plot.df,aes(x=S.obs)) +
geom_histogram(bins=100)
ggplot(data=plot.df,aes(x=S.chao1)) +
geom_histogram(bins=100)
We can explore better the dependence between richness estimates and the sequencing depth with a technique called rarefaction curve. It consists on recording the increasing number of species within a sample as we add more and more reads. We use the rarecurve()
function.
rarecurve(otutab[c(1,2,3,4,5,12,52,100),],step=500,cex=0.4,label = T)
Rarefaction curves are useful for checking whether we used enough sequencing depth in order to detect the diversity within each sample. If sufficient sequencing depth was used (i.e. we sequenced enough) all curves reach an almost flat asymptote where the addition of extra sequences do not translate to an increase in the number of OTUs. This is the case of samples 700106065 and 700110831. However, this is not the case for most of our dataset and it is rarely the case in natural complex communities.
We may note, however, that there are samples with higher richness compared to others just because they contain more reads. This is the case of sample 700016653 compared to sample 700014995. That is what may make the observed number of OTUs a biased estimate of the sample’s richness.
The most common procedure to correct this bias is to either to use the above mentioned species richness estimators or to perform a rarefaction or sub-sampling: to randomly select for all the samples a fixed number of reads. Commonly the minimum number of reads/sample is taken.
You can do that with the rrarefy()
function. However, you will need to load a pre-computed table that we prepared in advance:
otutab.rr<-fread("../../masterdata/HMP_data/otutab.rr.tsv",sep="\t",header=T,data.table = F)
auxiltab.rr<-fread("../../masterdata/HMP_data/auxildata.rr.tsv",sep="\t",header=T,stringsAsFactors = T,data.table = F)
rownames(otutab.rr)<-auxiltab.rr$SampleID # Set the rownames
otutab.rr<-otutab.rr[,-1] # Delete the first column
- Question 1:
How many reads/sample has this new OTU table that we just loaded?
Now, we can recompute the number of observed OTUs in this new sub-sampled OTU table. We will store it as a variable called S.rr. This observed richness underestimates the real richness in the communities as we randomly select only a small amount of reads for each sample (many OTUs will remain undetected). However, as the number of selected reads is equal in all samples, this is the best way of fairly comparing richness values between samples.
richness.rr<-t(estimateR(otutab.rr))
plot.df<-data.frame(richness,S.rr=richness.rr[,1],seq.depth,auxiltab.rr)
An interesting question may be whether we observe differences in the richness of microbial communities between body sites.
- Task 1:
What is the body site with the highest and lowest median observed richness? And with the highest and lowest median richness estimeted with the Chao1 estimator? And with the highest and lowest median richness after rarefaction?
Suggestion*:You should be able to solve it by plotting.
We can also test whether the observed differences are statistically significant. We apply here a Kruskal-Wallis test, which is a non-parametric test analog to ANOVA. The richness values are rarely normally distributed and thus a non-parametric test would be preferable.
kruskal.test(S.rr~Bodysite,data = plot.df)
pairwise.wilcox.test(plot.df$S.rr,plot.df$Bodysite,p.adjust.method="fdr")
There seem to be differences in richness between body sites which are statistically significant.
- Task 2:
Are there differences between males and females in their microbiome’s richness? Could you assign these differences to a specific body site or does richness differ between males and females for all body sites?
Suggestion: Use the observed richness after rarefaction (S.rr).
Eveness#
The richness is only one component of the diversity. Specifically, it is important to consider how abundance varies among species, that is, evenness. One of the most common ways to visualize evenness is with a rank-abundance curve (RAC). A RAC can be constructed by ranking species from the most-to-least abundant. Once ordered a RAC consists on plotting the rank and the relative abundance of each OTU within a sample. Let’s plot a RAC for two samples:
rankabundance<-function(otutab){
ra<-function(x){
x<-x[x>0]
R<-rank(x,ties.method = "random")
R<-(max(R)-R)+1
data.frame(rank=R,abundance=x/sum(x),log.rank=log10(R),log.abundance=log10(x/sum(x)),OTU=names(x))
}
res<-NULL
for (i in 1:nrow(otutab)){
res<-rbind(res,data.frame(ra(unlist(otutab[i,])),sample=rownames(otutab)[i]))
}
res
}
ra.df<-rankabundance(otutab.rr[c(102,1943),])
ggplot(ra.df,aes(x=rank,y=abundance,col=sample)) +
geom_point()
- Question 2:
Which of the two samples is more even and which is more uneven? How many OTUs in each sample have at least a 1% of the total reads?
Several metrics have been developed in ecology to quantify evenness. Most of them share two desirable features: to be bound between 0 an 1 and to be independent of species richness. We will compute two metrics for evenness: the Simpson’s eveness and the Pielou evenness.
SimpE<-function(x,zeros=F){
if (zeros==F) x <- as.vector(x[x>0])
S <- length(x)
x = as.data.frame(x)
D <- diversity(x, "inv")
E <- (D)/S
E}
PielouE<-function(x,zeros=F){
if (zeros==F) x <- as.vector(x[x>0])
H <- diversity(x)
S <- length(x)
J <- H/log(S)
J
}
evenness<-function(x){
res<-rbind(apply(x,1,SimpE),apply(x,1,PielouE))
colnames(res)<-rownames(x)
rownames(res)<-c("SimpE","PielouE")
res
}
even.metrics<-t(evenness(otutab.rr))
plot.df<-data.frame(richness,S.rr=richness.rr[,1],even.metrics,seq.depth,auxiltab.rr)
ggplot(data=plot.df,aes(x=Bodysite,y=SimpE,fill=Bodysite)) +
geom_boxplot() +
theme(axis.text.x = element_text(angle=90,hjust=1,vjust=0.5))
- Question 3:
Explore the eveness metrics that we just computed. Do they agree with your answer for the previous question?
Diversity#
What ecologists call the diversity of a community is a combined measure of richness and evenness. There exist many indices of diversity and their development has been a very active field in ecology for decades. We will calculate two of them:
Shannon’s Diversity (or Shannon’s entropy): Shannon’s diversity metric is derived from Shannon’s information entropy, and is essentially a measure of uncertainty: it measures the chance that the next sampled individual will belong to a different species.
Simpson’s Diversity (or Dominance): represents the probability that two individuals randomly selected from a sample will belong to different species.
plot.df$shan<-diversity(otutab.rr,"shannon")
plot.df$simp<-diversity(otutab.rr,"simpson")
- Task 3:
Are there differences in diversity between body sites? And between the body sub-sites?
Provide graphical and statistical support to your answers.
Suggestion: You can re-use pieces of the code previosuly used to analyze the species Richness.
Beta-diversity (or among sites diversity)#
The beta-diversity is defined as the diversity that occurs among sites.
Dissimilarity indices#
Most of the measures of similarity/dissimilarity may be computed with the vegdist()
function by selecting the appropriate method. The vegdist()
function computes a dissimilarity measure by default (that is, the higher te value the lower the similarity).
Computing dissimilarity matrices for big datasets is computationally demanding. For convinience we will use tw pre-computed distance matrices: Bray-Curtisand Sörensen. We will all need to load a pre-computed rarefied dataset:
bc.rr<-fread("../../masterdata/HMP_data/bc.rr.tsv",sep="\t",header=T,data.table = F)
rownames(bc.rr)<-bc.rr[,1]
bc.rr<-bc.rr[,-1]
bc.rr<-as.dist(bc.rr)
sor.rr<-fread("../../masterdata/HMP_data/sor.rr.tsv",sep="\t",header=T,data.table = F)
rownames(sor.rr)<-sor.rr[,1]
sor.rr<-sor.rr[,-1]
sor.rr<-as.dist(sor.rr)
Ordination#
The primary aim of ordination is to represent multiple objects in a reduced number of orthogonal (i.e., independent) axes. Ordination plots are particularly useful for visualizing the similarity among objects. For example, in the context of beta-diversity, samples that are closer in ordination space have species assemblages that are more similar to one another than samples that are further apart in ordination space.
There are various ordination techniques that can be applied to multivariate biodiversity data. Common methods include: Principal Components Analysis (PCA), Correspondence Analysis (CA), Principal Coordinates Analysis (PCoA), Factor Analysis (FA), and Nonmetric Multidimensional Scaling (NMDS). A thorough free book describing (with an affordable language) most of the ordination techniques jointly with R tutorials may be found here applied to tree communities.
Principal components analysis (PCA) is one of the oldest ordination techniques, which you may already know. It provides graphs that show the Euclidean distance between sites. No other distances can be investigated with PCA. This ordination method is not ideal for analysis of information on species abundances because of the limitations of the Euclidean distance with species abundance data.
NMDS#
While only euclidean distances may be used with PCA, any kind of dissimilarity measure is suitable for the Non-metric multidimensional scaling (NMS or NMDS). Its use is similar to the use of PCA, however, the positions of sites in the ordination are chosen so that rank order only of intersite distances is represented. This means that it should not to be interpreted in terms of two different axes with different degree of importance, but the NMDS plot may be rotated in any manner. The closer two samples are in the plot (in any direction!), the more similar they are.
We can perform and NMDS with the monoMDS()
function by setting the dissimilarity matrix. Be patient: we are dealing with 2,710 samples and ~43,000 OTUs. This may take some time to compute (~2min)
nmds<-monoMDS(bc.rr,pc = T,k=4)
toplot.df<-data.frame(nmds$points,auxiltab.rr)
ggplot(data=toplot.df,aes(x=MDS1,y=MDS2,col=Bodysite)) +
geom_point()
Body sites seem to have clearly different microbial composition as they are well separated in the NMDS plot.
Hypothesis testing:#
While the ordination techniques above mentioned are very useful and are the logical first step to explore high-dimensional data, no statistically supported conclusions can be obtained from them. Ordination technics often are supplemented by hypothesis testing techniques (i.e. statistical tests). The main hypothesis to be tested with sample-by-species matrices is wether the similarity between samples is organized in pre-defined groups. That is, if different clusters of samples exists based on their similarity.
Permutational MANOVA#
Permutational MANOVA (through the adonis()
function) is a technique analog to ANOVA but applicable to multidimensional data, that is, when our response variable is not a single variable but an array of many variables (as is our case in which every sample is characterized by the abundance of several OTUs). We can test, thus, if the difference in composition of the microbial communities is explained by the grouping of these samples in different categories.
For example, you can test if the body site explains a significant portion of between-samples disimilarity. You can also test more than one factor, like body site and the visit number. Be patient. This may also take some time to compute
adonis(bc.rr~auxiltab.rr$Bodysite)
adonis(bc.rr~auxiltab.rr$Bodysite+auxiltab.rr$VisitNo)
We can clearly see that microbial communities differ between body sites and between successive visits (both P-value < 0.05). However, altough significant, differences between visits account for a very small fraction of the variance (R2 = 0.23%) in the dissimilarity matrix compared to the proportion explained by the body site (R2 = 14.09%).
Once identified the main explanatory variables (the body site in this case) it may be interesting to split the dataset and further analyze it by pieces. Let’s create a new dataset only with the skin samples.
otutab.rr.skin<-otutab.rr[auxiltab.rr$Bodysite=="SK",] # select skin samples in the otu table
bc.rr.skin<-as.matrix(bc.rr) # convert distance object to matrix
bc.rr.skin<-bc.rr.skin[auxiltab.rr$Bodysite=="SK",auxiltab.rr$Bodysite=="SK"] # select skin samples in the Bray-Curtis matrix
bc.rr.skin<-as.dist(bc.rr.skin) # convert to distance again
auxiltab.rr.skin<-droplevels(auxiltab.rr[auxiltab.rr$Bodysite=="SK",]) # select skin samples in the auxiliary data table
Now we can perform and NMDS and statistically test differences only for the skin samples.
nmds<-monoMDS(bc.rr.skin,pc = T,k=3)
toplot.df<-data.frame(nmds$points,auxiltab.rr.skin)
ggplot(data=toplot.df,aes(x=MDS1,y=MDS2,col=Bodysubsite.simpl)) +
geom_point()
adonis(bc.rr.skin~auxiltab.rr.skin$Bodysubsite.simpl)
- Task 4:
Identify which of the body sites (OR, SK and VG) have a greater differentiation between body subsites?
Identify which of the body sites (OR, SK and ST) have a greater differentiation between males and females?
Could you show that graphically and give some statistical support?
Suggestion: Re-use the last piece of code to create subsets of the dataset for the 4 body site (OR, SK, VG and ST). Use these to complete the task with graphical (NMDS) and statistical (Permutational MANOVA) evidence.
If you manage, you should be able to complete the following summary table with the R2 and P-value value from the test. For each value in the table you should be able to produce an NMDS plot.
Body site |
Difference between body sub-sites (R2, P-value) |
Diference between sexs (R2, P-value) |
---|---|---|
SK |
R2 = 4.78%, P-value < 0.05 |
|
SK |
(Not computable) |
|
SK |
||
SK |
(Not computable) |