First we need a function to calculate the percentage identity between two sequences. For this easy example we assume that the sequences all have the same length:
perc_id <- function(seq1, seq2){
n_equal = 0
for(i in c(1:nchar(seq1))){
chr1 = substr(seq1, i, i)
chr2 = substr(seq2, i, i)
if(chr1 == chr2){
n_equal = n_equal + 1
}
}
perc = (n_equal/nchar(seq1)) * 100
return(perc)
}
Try to run the code yourself with two sequences.
Let’s try to run the code on some test sequences.
First using the same sequence
perc_id("ATCGTAGCGATCGATTGC","ATCGTAGCGATCGATTGC")
## [1] 100
Which result in 100% identity, as expected as they are exactly the same sequence.
Second, we can try to change one character out of ten:
perc_id("AAAAAAAAAA","AAAAAAAAAT")
## [1] 90
Which result in 90% identity.
What happen if the sequences have different length?
If we try to run the code with seq1
shorter than
seq2
we have:
perc_id("AAAAAA","AAAAAAAAAAT")
## [1] 100
Resulting in 100% identity. The function checks only the first 6 characters (the length of seq1) and ignores the others positions.
If we try to run the code with seq2 shorter than seq1 we have:
perc_id("AAAAAAAAAA","AAAAAA")
## [1] 60
Resulting in 60% idenitity. The function checks 10 characters (the length of seq1), but since seq2 has a length of only 6, the remaining 4 positions that are evaluated are considered as different. Hence 6 out of 10 are the same, resulting in 60% identity.
One solution could be to check that the two sequences have the same length:
if (nchar(seq1) != nchar(seq2)){
return(0)
}
and return zero (or an error) if they do not have the same length.
A better solution would be to be able to compare also sequences of different length, for example calculating the edit distance between the two strings.
Now we can create the code to cluster sequences into OTUs, we will implement it as a function, where the inputs are:
Here is the code:
create_otus <- function(reads,threshold_percentage = 97) {
# 1) Unique high quality reads are sorted by counts (high to low)-----------
# check the number of occourences of all unique sequences
counts = table(reads)
# sort the counts
counts = sort(counts,decreasing = T)
# 2) Read with highest count is centroid of a new OTU (N=1) ----------------
# we create a variable where to save the OTUs selected
OTUs = c()
# we add the read with the highest count
highest_count_read = names(counts)[1]
OTUs = c(OTUs,highest_count_read)
# 3) Next read is compared to all OTU centroids ----------------------------
# go from the top of counts and create OTUs
for(seq in names(counts)){
# first we check if there is already an OTU with a sequence
# that is > 0.97 identical
found = ""
for(otu in OTUs){
if (perc_id(otu,seq) > threshold_percentage){
found = otu
}
}
# b) Centroid sequence and new read are < 97% identical ------------------
# read becomes centroid of a new OTU
if (found == ""){
OTUs = c(OTUs,seq)
}
}
return(OTUs)
}
Let’s try to run the code:
test_sequences = c(
"AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA",
"AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA",
"AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA",
"AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAT",
"AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAT",
"CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC"
)
create_otus(test_sequences)
## [1] "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA"
## [2] "CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC"
Which creates two OTUs. In fact,
"AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA"
and
"AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAT"
are more than 97%
identical:
perc_id("AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA",
"AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAT")
## [1] 97.2973
And as a consequence are grouped into one OTU.
Try to run the code on a test fastq file
We can now load a small test fastq file, directly in R:
# we load the file directly in R
test_fastq_file =read.csv("https://sunagawalab.ethz.ch/share/teaching_milanese/test_amplicon.fastq",header = F)
head(test_fastq_file[,1])
## [1] "@read1"
## [2] "ACGTATTCGAGGCGGGGCATTCAGCTATTCGATCTACGTACGATTCACGAGCTACGTACATTGCATCGGGCGTATTACGATTCTATCGATCGACTTATTGCATTCGATCGATGCTAGCATCGTTTACCGATGCTATCGTACGCATGCAT"
## [3] "+"
## [4] "IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIICCCICICIICIICII)III8)IIIIIIIICICCIICCCCICICIIIICIICICII))))))))CICIIC89BAAABIIIC)CBDDDCABBBCACAGGCAAABBB)A"
## [5] "@read2"
## [6] "ACGTATTCGAGGCGGGGCATTCAGCTATTCGATCTACGTACGATTCACGAGCTACGTACATTGCATCGGGCGTATTACGATTCTATCGATCGACTTATTGCATTCGATCGATGCTAGCATCGTTTACCGATGCTATCATACGCATGCAT"
A fastq file contains 4 lines for each read, with the following information:
@
and the read id+
and sometimes the same
information as in line 1And we need only the sequences in position 2:
# we select only one line every four
only_seq = test_fastq_file[seq(2,400,4),1]
Running the OTU clustering:
create_otus(only_seq)
## [1] "ACCGGTACGGACTTAGCGAGATTCGGATCGATTACGATCACGAAACACGAGCTACGTACATAACATCAGGCGTGTTACGATTCTATTTAGGCATTGCGATTTCGGATTTTCAGAATGCATCGCGGATCGATGCTATCGTACCTATGCAA"
## [2] "ACGTATTCGAGGCGGGGCATTCAGCTATTCGATCTACGTACGATTCACGAGCTACGTACATTGCATCGGGCGTATTACGATTCTATCGATCGACTTATTGCATTCGATCGATGCTAGCATCGTTTACCGATGCTATCGTACGCATGCAT"
## [3] "ACGTATTTGAGGCGCCGCATTCATTCGATGGCACTACGTACGATTCACGAGCTACGTACATTTCATCGGGCGTATTACGATTCTATTTATCGCAGTATTGCATTCGATCGATGAATGCATCGTTTACCGATGCTATCGTACCTATGCAT"
## [4] "ACGTATTTGAGGCGCCGCATTCAGCTATTCGATCTACGTACGATTCACGAGCTACGTACATTTCATCGGGCGTATTACGATTCTATTTATCGACTTATTGCATTCGATCGATGAATGCATCGTTTACCGATGCTATCGTACCTATGCAT"
## [5] "ACGTATTTGAGGCGCCGCATTCATTCGATGGCACTACGTACGAAACACGAGCTACGTACATAACATCAGGCGTATTACGATTCTATTTAGGCATTGCGATTTCGGATTTTCAGAATGCATCGTTTACCGATGCTATCGTACCTATGCAT"
## [6] "ACGTATTTGAGGCGCCGCATTCATTCGATGGCACTACGTACGATTCACGAGCTACGTACATTTCATCGGGCGTATTACGATTCTATTTAGGCATTGCGATTTCGGATTTTCAGAATGCATCGTTTACCGATGCTATCGTACCTATGCAT"
## [7] "ACGTATTTGAGGCGCCGCATTCATTCGATGGCACTACGTACGAAACACGAGCTACGTACATAACATCAGGCGTGTTACGATTCTATTTAGGCATTGCGATTTCGGATTTTCAGAATGCATCGCGGATCGATGCTATCGTACCTATGCAA"
## [8] "ACGTATTTGAGGCGCCGCATTCATTCGATGGCACTACGTACGAAACACGAGCTACGTACATAACATCAGGCAGCTAGCATCGATCGATGCATCATGCGATTTCGGATTTTCAGAATGCATCGCGGATCGATGCTATCGTACCTATGCAA"
## [9] "ACGTATTTGAGGCGCCGCATTCATTCGATGGCACTACGTACGAAACACGAGCTACGTACATAACATCAGGCGTGTTACGATTCTATTTAGGCATTGCGATTTCGGATTTTCAGAATGCATCGCGGATCGATCGATCTAGGCATTTAGCA"
Creates 9 OTUs.
Test the code with different thresholds
Running the OTU clustering with 90% identity:
create_otus(only_seq,90)
## [1] "ACCGGTACGGACTTAGCGAGATTCGGATCGATTACGATCACGAAACACGAGCTACGTACATAACATCAGGCGTGTTACGATTCTATTTAGGCATTGCGATTTCGGATTTTCAGAATGCATCGCGGATCGATGCTATCGTACCTATGCAA"
## [2] "ACGTATTCGAGGCGGGGCATTCAGCTATTCGATCTACGTACGATTCACGAGCTACGTACATTGCATCGGGCGTATTACGATTCTATCGATCGACTTATTGCATTCGATCGATGCTAGCATCGTTTACCGATGCTATCGTACGCATGCAT"
## [3] "ACGTATTTGAGGCGCCGCATTCATTCGATGGCACTACGTACGATTCACGAGCTACGTACATTTCATCGGGCGTATTACGATTCTATTTATCGCAGTATTGCATTCGATCGATGAATGCATCGTTTACCGATGCTATCGTACCTATGCAT"
## [4] "ACGTATTTGAGGCGCCGCATTCATTCGATGGCACTACGTACGAAACACGAGCTACGTACATAACATCAGGCGTATTACGATTCTATTTAGGCATTGCGATTTCGGATTTTCAGAATGCATCGTTTACCGATGCTATCGTACCTATGCAT"
## [5] "ACGTATTTGAGGCGCCGCATTCATTCGATGGCACTACGTACGAAACACGAGCTACGTACATAACATCAGGCAGCTAGCATCGATCGATGCATCATGCGATTTCGGATTTTCAGAATGCATCGCGGATCGATGCTATCGTACCTATGCAA"
## [6] "ACGTATTTGAGGCGCCGCATTCATTCGATGGCACTACGTACGAAACACGAGCTACGTACATAACATCAGGCGTGTTACGATTCTATTTAGGCATTGCGATTTCGGATTTTCAGAATGCATCGCGGATCGATCGATCTAGGCATTTAGCA"
Creates only 6 OTUs. As we decrease the threshold, we cluster more sequences together and as a consequence we decrease the number of OTUs.
Running the OTU clustering with 98% identity:
length(create_otus(only_seq,98))
## [1] 10
Results in 10 clusters (one more than running with 97%
identity)
Running the OTU clustering with 100% identity:
length(create_otus(only_seq,100))
## [1] 27
Result in 27 clusters.
Implement the code to count how many reads map to the different OTUs
We previously created the function to create OTUs, now we can create a function to count the reads that map to the defined OTUs.
Input:
Here is the code to create the function:
count_otus <- function(reads,OTUs,threshold_percentage = 97) {
# we create a list where to save the result
result = rep(0,length(OTUs))
names(result) = OTUs
# we check each read to which OTUs it maps to
for (read in reads){
best_sel_OTU = ""
for (OTU in OTUs){
if(perc_id(read,OTU) > threshold_percentage){
best_sel_OTU = OTU
}
}
# and update the count for the OTU
result[best_sel_OTU] = result[best_sel_OTU] + 1
}
# return the result
return(result)
}
And we can test it on the same fastq sample:
OTUs = create_otus(only_seq)
OTUs_count = count_otus(only_seq,OTUs)
OTUs_count
## ACCGGTACGGACTTAGCGAGATTCGGATCGATTACGATCACGAAACACGAGCTACGTACATAACATCAGGCGTGTTACGATTCTATTTAGGCATTGCGATTTCGGATTTTCAGAATGCATCGCGGATCGATGCTATCGTACCTATGCAA
## 18
## ACGTATTCGAGGCGGGGCATTCAGCTATTCGATCTACGTACGATTCACGAGCTACGTACATTGCATCGGGCGTATTACGATTCTATCGATCGACTTATTGCATTCGATCGATGCTAGCATCGTTTACCGATGCTATCGTACGCATGCAT
## 27
## ACGTATTTGAGGCGCCGCATTCATTCGATGGCACTACGTACGATTCACGAGCTACGTACATTTCATCGGGCGTATTACGATTCTATTTATCGCAGTATTGCATTCGATCGATGAATGCATCGTTTACCGATGCTATCGTACCTATGCAT
## 15
## ACGTATTTGAGGCGCCGCATTCAGCTATTCGATCTACGTACGATTCACGAGCTACGTACATTTCATCGGGCGTATTACGATTCTATTTATCGACTTATTGCATTCGATCGATGAATGCATCGTTTACCGATGCTATCGTACCTATGCAT
## 14
## ACGTATTTGAGGCGCCGCATTCATTCGATGGCACTACGTACGAAACACGAGCTACGTACATAACATCAGGCGTATTACGATTCTATTTAGGCATTGCGATTTCGGATTTTCAGAATGCATCGTTTACCGATGCTATCGTACCTATGCAT
## 9
## ACGTATTTGAGGCGCCGCATTCATTCGATGGCACTACGTACGATTCACGAGCTACGTACATTTCATCGGGCGTATTACGATTCTATTTAGGCATTGCGATTTCGGATTTTCAGAATGCATCGTTTACCGATGCTATCGTACCTATGCAT
## 7
## ACGTATTTGAGGCGCCGCATTCATTCGATGGCACTACGTACGAAACACGAGCTACGTACATAACATCAGGCGTGTTACGATTCTATTTAGGCATTGCGATTTCGGATTTTCAGAATGCATCGCGGATCGATGCTATCGTACCTATGCAA
## 8
## ACGTATTTGAGGCGCCGCATTCATTCGATGGCACTACGTACGAAACACGAGCTACGTACATAACATCAGGCAGCTAGCATCGATCGATGCATCATGCGATTTCGGATTTTCAGAATGCATCGCGGATCGATGCTATCGTACCTATGCAA
## 1
## ACGTATTTGAGGCGCCGCATTCATTCGATGGCACTACGTACGAAACACGAGCTACGTACATAACATCAGGCGTGTTACGATTCTATTTAGGCATTGCGATTTCGGATTTTCAGAATGCATCGCGGATCGATCGATCTAGGCATTTAGCA
## 1
If we change the percentage identity, we obtain a different result:
OTUs = create_otus(only_seq,90)
OTUs_count = count_otus(only_seq,OTUs,90)
head(OTUs_count)
## ACCGGTACGGACTTAGCGAGATTCGGATCGATTACGATCACGAAACACGAGCTACGTACATAACATCAGGCGTGTTACGATTCTATTTAGGCATTGCGATTTCGGATTTTCAGAATGCATCGCGGATCGATGCTATCGTACCTATGCAA
## 18
## ACGTATTCGAGGCGGGGCATTCAGCTATTCGATCTACGTACGATTCACGAGCTACGTACATTGCATCGGGCGTATTACGATTCTATCGATCGACTTATTGCATTCGATCGATGCTAGCATCGTTTACCGATGCTATCGTACGCATGCAT
## 27
## ACGTATTTGAGGCGCCGCATTCATTCGATGGCACTACGTACGATTCACGAGCTACGTACATTTCATCGGGCGTATTACGATTCTATTTATCGCAGTATTGCATTCGATCGATGAATGCATCGTTTACCGATGCTATCGTACCTATGCAT
## 29
## ACGTATTTGAGGCGCCGCATTCATTCGATGGCACTACGTACGAAACACGAGCTACGTACATAACATCAGGCGTATTACGATTCTATTTAGGCATTGCGATTTCGGATTTTCAGAATGCATCGTTTACCGATGCTATCGTACCTATGCAT
## 19
## ACGTATTTGAGGCGCCGCATTCATTCGATGGCACTACGTACGAAACACGAGCTACGTACATAACATCAGGCAGCTAGCATCGATCGATGCATCATGCGATTTCGGATTTTCAGAATGCATCGCGGATCGATGCTATCGTACCTATGCAA
## 1
## ACGTATTTGAGGCGCCGCATTCATTCGATGGCACTACGTACGAAACACGAGCTACGTACATAACATCAGGCGTGTTACGATTCTATTTAGGCATTGCGATTTCGGATTTTCAGAATGCATCGCGGATCGATCGATCTAGGCATTTAGCA
## 6
But note that the sum of the read counts is the same between the two.
Load the data:
load(url("https://zenodo.org/record/7188406/files/human_microbiome_dataset.Rdata"))
which has a tax_profile
matrix:
tax_profile[1:3,1:3]
## 700002_T0 700002_T1 700002_T2
## UNASSIGNED 387 1863 1622
## Bacteroides 4018 1705 1660
## Agathobacter 80 1580 126
Can you find out at which taxonomic rank the table is summarized?
If we go to NCBI taxonomy website (https://www.ncbi.nlm.nih.gov/taxonomy/) we can search some of the clades.
If we search for Bacteroides
, we can see that it is a
Genus rank annotation, for which we can find the full lineage:
cellular organisms; Bacteria; FCB group; Bacteroidetes/Chlorobi group; Bacteroidota; Bacteroidia; Bacteroidales; Bacteroidaceae
Metadata
We have a metadata table, with one row per sample:
head(metadata)
## Subject Timepoint Sex
## 700002_T0 700002 0 M
## 700002_T1 700002 12 M
## 700002_T2 700002 24 M
## 700002_T3 700002 50 M
## 700004_T0 700004 0 F
## 700004_T1 700004 12 F
The first thing we may look at is the number of reads of each sample
(also called sequencing depth). As our samples corresponds to
the columns of the table, we just need to get the sum for each column.
We do that with the colSums()
function. Can you plot a
histogram of the sequencing depth of all samples?
seq_depth = colSums(tax_profile)
hist(seq_depth,50,xlab = "Sequencing depth")
If you want to compare different samples, is it a problem that there are different read counts? Try to divide each value within a sample by the sum of the reads in that sample to normalise the data (also called relative abundance).
rel_ab = prop.table(tax_profile,2)
rel_ab[1:3,1:3]
## 700002_T0 700002_T1 700002_T2
## UNASSIGNED 0.024052206 0.1190187 0.078482605
## Bacteroides 0.249720323 0.1089248 0.080321285
## Agathobacter 0.004972032 0.1009391 0.006096676
If you want to compare different samples, is it a problem that there are different read counts?
It is important to notice that this is not related to the number of prokaryotic cells in the sample. Instead, it is a random process that happen when sequencing short reads. As a consequence, when comparing two samples we need to take this into consideration.
If we have a case as this:
sample_A sample_B
OTU_1 2 4
OTU_2 12 24
OTU_3 8 16
and we look at the OTUs individually, we would conclude that
OTU_2
is double as abundant in sample_B
compared to sample_A
. But if we do the same evaluation with
relative abundance transformed data, we see that they are the same.
Which genus is the most and least prevalent?
We can calculate prevalence as:
prevalence = rowSums(tax_profile > 0)
The prevalence measure in how many samples each clade was profiled.
Three most abundant:
sort(prevalence,decreasing = T)[1:3]
## UNASSIGNED Bacteroides Blautia
## 480 480 480
So Bacteroides
was profiled in 480 samples (out of 480
samples), hence it is present in all samples and it is a highly
prevalent species.
Three least abundant:
sort(prevalence)[1:3]
## CAG-873 possible_genus_Sk018 Cellulosilyticum
## 0 0 0
It can happen that some clades are not measured at all, and hence have a prevalence of zero (there were no reads mapping to any sample).
If we check more:
sort(prevalence)[1:15]
## CAG-873 possible_genus_Sk018 Cellulosilyticum
## 0 0 0
## A2 Coriobacteriaceae_UCG-002 Candidatus_Stoquefichus
## 0 0 0
## Faecalibaculum Acetatifactor Tahibacter
## 0 0 0
## Sphingomonas Rosenbergiella 28-4
## 0 1 1
## Gallicola Sarcina Harryflintia
## 1 1 1
We can see that the genus Harryflintia
is measured only
in one sample.
How many zeros there are per sample and per genus?
We can check the number of zeros per sample:
n_zeros_per_sample = colSums(tax_profile == 0)
hist(n_zeros_per_sample,50,xlab = "Number of zeros in the samples")
We can check the number of zeros per genus:
n_zeros_per_genus = rowSums(tax_profile == 0)
hist(n_zeros_per_genus,50,xlab = "Number of zeros in the genera")
While the number of zeros per sample is normally distributed, the number of zeros per genus is not. In fact, when we cheked the prevalence, we saw that there are few species with high prevalence (the tail on the right) and many species with lower prevalence.
We load the vegan package and rarefy the taxonomic profile table:
# load the library
library("vegan")
## Loading required package: permute
## Loading required package: lattice
## This is vegan 2.6-2
# rarefy taxonomic profiles
set.seed(100)
tax_profile_rarefied = t(rrarefy(t(tax_profile),5000))
Now the colSums(tax_profile_rarefied)
will be 5,000 for
all samples.
Let’s evaluate the richness with the vegan package
richness = estimateR(t(tax_profile_rarefied))
richness = t(richness)
head(richness)
## S.obs S.chao1 se.chao1 S.ACE se.ACE
## 700002_T0 36 36.25000 0.73503387 36.73398 2.990198
## 700002_T1 64 64.25000 0.73583205 64.85848 3.639135
## 700002_T2 59 62.00000 4.17123406 60.73455 3.812244
## 700002_T3 54 54.00000 0.00000000 54.00000 3.291403
## 700004_T0 55 55.33333 0.92436727 55.75885 3.502650
## 700004_T1 55 55.00000 0.09908674 55.33298 3.520647
We are interested in the observed richness, which is the first column. Note that it is the same as doing:
richness_manual = colSums(tax_profile_rarefied > 0)
head(richness_manual)
## 700002_T0 700002_T1 700002_T2 700002_T3 700004_T0 700004_T1
## 36 64 59 54 55 55
Which samples have the highest and lowest richness?
We can plot an histogram for the richness:
hist(richness_manual,50)
The lowest is around 25 and the highest is around 90.
Is there are a difference between males and females?
We create a data frame with the information:
df_richness = data.frame(richness = richness[rownames(metadata),1])
df_richness$Sex = metadata$Sex
df_richness$Timepoint = metadata$Timepoint
And we can plot it:
library(ggplot2)
ggplot(df_richness, aes(Sex,richness)) + geom_boxplot()
It seems there is no difference between Males and Females.
Which we can also test with a t-test:
t.test(
df_richness[df_richness$Sex == "M","richness"],
df_richness[df_richness$Sex == "F","richness"])
##
## Welch Two Sample t-test
##
## data: df_richness[df_richness$Sex == "M", "richness"] and df_richness[df_richness$Sex == "F", "richness"]
## t = 0.86417, df = 477.86, p-value = 0.3879
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1.069216 2.747996
## sample estimates:
## mean of x mean of y
## 56.15079 55.31140
The p-value is greater than 0.05, so we reject the alternative hypothesis. The two distributions are not statistically different, hence there is no difference between males and females.
Evaluate evenness
Functions:
SimpE<-function(x,zeros=F){
if (zeros==F) x <- as.vector(x[x>0])
S <- length(x)
x = as.data.frame(x)
D <- diversity(x, "inv")
E <- (D)/S
E}
PielouE<-function(x,zeros=F){
if (zeros==F) x <- as.vector(x[x>0])
H <- diversity(x)
S <- length(x)
J <- H/log(S)
J
}
evenness<-function(x){
res<-rbind(apply(x,1,SimpE),apply(x,1,PielouE))
colnames(res)<-rownames(x)
rownames(res)<-c("SimpE","PielouE")
res
}
We can calculate the evenness as:
even.metrics <- t(evenness(as.matrix(t(tax_profile_rarefied))))
head(even.metrics)
## SimpE PielouE
## 700002_T0 0.07788573 0.4369012
## 700002_T1 0.23907493 0.7642262
## 700002_T2 0.20338606 0.7191480
## 700002_T3 0.19412747 0.7355137
## 700004_T0 0.26743694 0.7641009
## 700004_T1 0.25191673 0.7536305
Let’s find the sample with the lowest Pielou evenness:
min_v = min(even.metrics[,2])
which(even.metrics[,2] == min_v)
## 700002_T0
## 1
And highest Pielou evenness:
max_v = max(even.metrics[,2])
which(even.metrics[,2] == max_v)
## 700018_T2
## 47
And try to plot a histogram of the distributions of the relative abundances.
First for the sample with low evenness
hist(log10(tax_profile_rarefied[,"700002_T0"]),xlim = c(0,3.5))
Note that we used log relative abundances.
And second for the sample with high evenness:
hist(log10(tax_profile_rarefied[,"700018_T2"]),xlim = c(0,3.5))
As you can observe from the histograms, the sample with low evenness as some low abundant species (on the left) and few abundance species on the right; while the sample with high evenness is more equally distributed.
Evaluate diversity
Finally, we can calculate the diversity of a microbial community, which is a combined measure of richness and evenness. There exist many indices of diversity and their development has been a very active field in ecology for decades. Here we will use the Shannon’s Diversity (or Shannon’s entropy)
We can calculate the Shannon index as:
shannon_indx = diversity(t(tax_profile_rarefied),"shannon")
And we can add it to the data frame df_richness
:
df_richness$shannon = shannon_indx[rownames(df_richness)]
head(df_richness)
## richness Sex Timepoint shannon
## 700002_T0 36 M 0 1.565644
## 700002_T1 64 M 12 3.178327
## 700002_T2 59 M 24 2.932353
## 700002_T3 54 M 50 2.933952
## 700004_T0 55 F 0 3.062007
## 700004_T1 55 F 12 3.020048
We can plot the observed richness and the Shannon index:
plot(df_richness$richness, df_richness$shannon, xlab = "Observed richness", ylab = "Shannon index")
It seems there is a high correlation between the two (which makes sense). We can also evaluate the correlation between the two:
cor(df_richness$richness,df_richness$shannon)
## [1] 0.7752869