Exercise 1: Clustering of sequences into OTUs

First we need a function to calculate the percentage identity between two sequences. For this easy example we assume that the sequences all have the same length:

perc_id <- function(seq1, seq2){
  n_equal = 0
  for(i in c(1:nchar(seq1))){
    chr1 = substr(seq1, i, i)
    chr2 = substr(seq2, i, i)
    if(chr1 == chr2){
      n_equal = n_equal + 1
    }
  }

  perc = (n_equal/nchar(seq1)) * 100
  return(perc)
}

Exercise Task 1

Try to run the code yourself with two sequences.

Let’s try to run the code on some test sequences.

First using the same sequence

perc_id("ATCGTAGCGATCGATTGC","ATCGTAGCGATCGATTGC")

## [1] 100

Which result in 100% identity, as expected as they are exactly the same sequence.

Second, we can try to change one character out of ten:

perc_id("AAAAAAAAAA","AAAAAAAAAT")

## [1] 90

Which result in 90% identity.

What happen if the sequences have different length?

If we try to run the code with seq1 shorter than seq2 we have:

perc_id("AAAAAA","AAAAAAAAAAT")

## [1] 100

Resulting in 100% identity. The function checks only the first 6 characters (the length of seq1) and ignores the others positions.

If we try to run the code with seq2 shorter than seq1 we have:

perc_id("AAAAAAAAAA","AAAAAA")

## [1] 60

Resulting in 60% idenitity. The function checks 10 characters (the length of seq1), but since seq2 has a length of only 6, the remaining 4 positions that are evaluated are considered as different. Hence 6 out of 10 are the same, resulting in 60% identity.

One solution could be to check that the two sequences have the same length:

if (nchar(seq1) != nchar(seq2)){
    return(0)
}

and return zero (or an error) if they do not have the same length.

A better solution would be to be able to compare also sequences of different length, for example calculating the edit distance between the two strings.

Code to create OTUs

Now we can create the code to cluster sequences into OTUs, we will implement it as a function, where the inputs are:

a list of sequences
the threshold percentage identity

Here is the code:

create_otus <- function(reads,threshold_percentage = 97) {
  # 1) Unique high quality reads are sorted by counts (high to low)-----------
  # check the number of occourences of all unique sequences
  counts = table(reads)
  # sort the counts
  counts = sort(counts,decreasing = T)

  # 2) Read with highest count is centroid of a new OTU (N=1) ----------------
  # we create a variable where to save the OTUs selected
  OTUs = c()
  # we add the read with the highest count
  highest_count_read = names(counts)[1]
  OTUs = c(OTUs,highest_count_read)

  # 3) Next read is compared to all OTU centroids ----------------------------
  # go from the top of counts and create OTUs
  for(seq in names(counts)){
    # first we check if there is already an OTU with a sequence
    # that is > 0.97 identical
    found = ""
    for(otu in OTUs){
      if (perc_id(otu,seq) > threshold_percentage){
        found = otu
      }
    }
    # b) Centroid sequence and new read are < 97% identical ------------------
    #    read becomes centroid of a new OTU
    if (found == ""){
      OTUs = c(OTUs,seq)
    }
  }

  return(OTUs)
}

Exercise Task 2

Let’s try to run the code:

test_sequences = c(
  "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA",
  "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA",
  "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA",
  "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAT",
  "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAT",
  "CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC"
)
create_otus(test_sequences)

## [1] "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA"
## [2] "CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC"

Which creates two OTUs. In fact, "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA" and "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAT" are more than 97% identical:

perc_id("AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA",
        "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAT")

## [1] 97.2973

And as a consequence are grouped into one OTU.

Try to run the code on a test fastq file

We can now load a small test fastq file, directly in R:

# we load the file directly in R
test_fastq_file =read.csv("https://sunagawalab.ethz.ch/share/teaching_milanese/test_amplicon.fastq",header = F)
head(test_fastq_file[,1])

## [1] "@read1"                                                                                                                                               
## [2] "ACGTATTCGAGGCGGGGCATTCAGCTATTCGATCTACGTACGATTCACGAGCTACGTACATTGCATCGGGCGTATTACGATTCTATCGATCGACTTATTGCATTCGATCGATGCTAGCATCGTTTACCGATGCTATCGTACGCATGCAT"
## [3] "+"                                                                                                                                                    
## [4] "IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIICCCICICIICIICII)III8)IIIIIIIICICCIICCCCICICIIIICIICICII))))))))CICIIC89BAAABIIIC)CBDDDCABBBCACAGGCAAABBB)A"
## [5] "@read2"                                                                                                                                               
## [6] "ACGTATTCGAGGCGGGGCATTCAGCTATTCGATCTACGTACGATTCACGAGCTACGTACATTGCATCGGGCGTATTACGATTCTATCGATCGACTTATTGCATTCGATCGATGCTAGCATCGTTTACCGATGCTATCATACGCATGCAT"

A fastq file contains 4 lines for each read, with the following information:

A line starting with @ and the read id
The DNA sequence
A line starting with + and sometimes the same information as in line 1
A string of characters that represents the quality score (same number of characters as in line 2)

And we need only the sequences in position 2:

 # we select only one line every four
only_seq = test_fastq_file[seq(2,400,4),1]

Running the OTU clustering:

create_otus(only_seq)

## [1] "ACCGGTACGGACTTAGCGAGATTCGGATCGATTACGATCACGAAACACGAGCTACGTACATAACATCAGGCGTGTTACGATTCTATTTAGGCATTGCGATTTCGGATTTTCAGAATGCATCGCGGATCGATGCTATCGTACCTATGCAA"
## [2] "ACGTATTCGAGGCGGGGCATTCAGCTATTCGATCTACGTACGATTCACGAGCTACGTACATTGCATCGGGCGTATTACGATTCTATCGATCGACTTATTGCATTCGATCGATGCTAGCATCGTTTACCGATGCTATCGTACGCATGCAT"
## [3] "ACGTATTTGAGGCGCCGCATTCATTCGATGGCACTACGTACGATTCACGAGCTACGTACATTTCATCGGGCGTATTACGATTCTATTTATCGCAGTATTGCATTCGATCGATGAATGCATCGTTTACCGATGCTATCGTACCTATGCAT"
## [4] "ACGTATTTGAGGCGCCGCATTCAGCTATTCGATCTACGTACGATTCACGAGCTACGTACATTTCATCGGGCGTATTACGATTCTATTTATCGACTTATTGCATTCGATCGATGAATGCATCGTTTACCGATGCTATCGTACCTATGCAT"
## [5] "ACGTATTTGAGGCGCCGCATTCATTCGATGGCACTACGTACGAAACACGAGCTACGTACATAACATCAGGCGTATTACGATTCTATTTAGGCATTGCGATTTCGGATTTTCAGAATGCATCGTTTACCGATGCTATCGTACCTATGCAT"
## [6] "ACGTATTTGAGGCGCCGCATTCATTCGATGGCACTACGTACGATTCACGAGCTACGTACATTTCATCGGGCGTATTACGATTCTATTTAGGCATTGCGATTTCGGATTTTCAGAATGCATCGTTTACCGATGCTATCGTACCTATGCAT"
## [7] "ACGTATTTGAGGCGCCGCATTCATTCGATGGCACTACGTACGAAACACGAGCTACGTACATAACATCAGGCGTGTTACGATTCTATTTAGGCATTGCGATTTCGGATTTTCAGAATGCATCGCGGATCGATGCTATCGTACCTATGCAA"
## [8] "ACGTATTTGAGGCGCCGCATTCATTCGATGGCACTACGTACGAAACACGAGCTACGTACATAACATCAGGCAGCTAGCATCGATCGATGCATCATGCGATTTCGGATTTTCAGAATGCATCGCGGATCGATGCTATCGTACCTATGCAA"
## [9] "ACGTATTTGAGGCGCCGCATTCATTCGATGGCACTACGTACGAAACACGAGCTACGTACATAACATCAGGCGTGTTACGATTCTATTTAGGCATTGCGATTTCGGATTTTCAGAATGCATCGCGGATCGATCGATCTAGGCATTTAGCA"

Creates 9 OTUs.

Test the code with different thresholds

Running the OTU clustering with 90% identity:

create_otus(only_seq,90)

## [1] "ACCGGTACGGACTTAGCGAGATTCGGATCGATTACGATCACGAAACACGAGCTACGTACATAACATCAGGCGTGTTACGATTCTATTTAGGCATTGCGATTTCGGATTTTCAGAATGCATCGCGGATCGATGCTATCGTACCTATGCAA"
## [2] "ACGTATTCGAGGCGGGGCATTCAGCTATTCGATCTACGTACGATTCACGAGCTACGTACATTGCATCGGGCGTATTACGATTCTATCGATCGACTTATTGCATTCGATCGATGCTAGCATCGTTTACCGATGCTATCGTACGCATGCAT"
## [3] "ACGTATTTGAGGCGCCGCATTCATTCGATGGCACTACGTACGATTCACGAGCTACGTACATTTCATCGGGCGTATTACGATTCTATTTATCGCAGTATTGCATTCGATCGATGAATGCATCGTTTACCGATGCTATCGTACCTATGCAT"
## [4] "ACGTATTTGAGGCGCCGCATTCATTCGATGGCACTACGTACGAAACACGAGCTACGTACATAACATCAGGCGTATTACGATTCTATTTAGGCATTGCGATTTCGGATTTTCAGAATGCATCGTTTACCGATGCTATCGTACCTATGCAT"
## [5] "ACGTATTTGAGGCGCCGCATTCATTCGATGGCACTACGTACGAAACACGAGCTACGTACATAACATCAGGCAGCTAGCATCGATCGATGCATCATGCGATTTCGGATTTTCAGAATGCATCGCGGATCGATGCTATCGTACCTATGCAA"
## [6] "ACGTATTTGAGGCGCCGCATTCATTCGATGGCACTACGTACGAAACACGAGCTACGTACATAACATCAGGCGTGTTACGATTCTATTTAGGCATTGCGATTTCGGATTTTCAGAATGCATCGCGGATCGATCGATCTAGGCATTTAGCA"

Creates only 6 OTUs. As we decrease the threshold, we cluster more sequences together and as a consequence we decrease the number of OTUs.

Running the OTU clustering with 98% identity:

length(create_otus(only_seq,98))

## [1] 10

Results in 10 clusters (one more than running with 97% identity)

Running the OTU clustering with 100% identity:

length(create_otus(only_seq,100))

## [1] 27

Result in 27 clusters.

Implement the code to count how many reads map to the different OTUs

We previously created the function to create OTUs, now we can create a function to count the reads that map to the defined OTUs.

Input:

reads to be mapped
OTUs that we map to
threshold percentage identity

Here is the code to create the function:

count_otus <- function(reads,OTUs,threshold_percentage = 97) {
  # we create a list where to save the result
  result = rep(0,length(OTUs))
  names(result) = OTUs
  # we check each read to which OTUs it maps to
  for (read in reads){
    best_sel_OTU = ""
    for (OTU in OTUs){
      if(perc_id(read,OTU) > threshold_percentage){
        best_sel_OTU = OTU
      }
    }
    # and update the count for the OTU
    result[best_sel_OTU] = result[best_sel_OTU] + 1
  }
  # return the result
  return(result)
}

And we can test it on the same fastq sample:

OTUs = create_otus(only_seq)

OTUs_count = count_otus(only_seq,OTUs)
OTUs_count

## ACCGGTACGGACTTAGCGAGATTCGGATCGATTACGATCACGAAACACGAGCTACGTACATAACATCAGGCGTGTTACGATTCTATTTAGGCATTGCGATTTCGGATTTTCAGAATGCATCGCGGATCGATGCTATCGTACCTATGCAA 
##                                                                                                                                                    18 
## ACGTATTCGAGGCGGGGCATTCAGCTATTCGATCTACGTACGATTCACGAGCTACGTACATTGCATCGGGCGTATTACGATTCTATCGATCGACTTATTGCATTCGATCGATGCTAGCATCGTTTACCGATGCTATCGTACGCATGCAT 
##                                                                                                                                                    27 
## ACGTATTTGAGGCGCCGCATTCATTCGATGGCACTACGTACGATTCACGAGCTACGTACATTTCATCGGGCGTATTACGATTCTATTTATCGCAGTATTGCATTCGATCGATGAATGCATCGTTTACCGATGCTATCGTACCTATGCAT 
##                                                                                                                                                    15 
## ACGTATTTGAGGCGCCGCATTCAGCTATTCGATCTACGTACGATTCACGAGCTACGTACATTTCATCGGGCGTATTACGATTCTATTTATCGACTTATTGCATTCGATCGATGAATGCATCGTTTACCGATGCTATCGTACCTATGCAT 
##                                                                                                                                                    14 
## ACGTATTTGAGGCGCCGCATTCATTCGATGGCACTACGTACGAAACACGAGCTACGTACATAACATCAGGCGTATTACGATTCTATTTAGGCATTGCGATTTCGGATTTTCAGAATGCATCGTTTACCGATGCTATCGTACCTATGCAT 
##                                                                                                                                                     9 
## ACGTATTTGAGGCGCCGCATTCATTCGATGGCACTACGTACGATTCACGAGCTACGTACATTTCATCGGGCGTATTACGATTCTATTTAGGCATTGCGATTTCGGATTTTCAGAATGCATCGTTTACCGATGCTATCGTACCTATGCAT 
##                                                                                                                                                     7 
## ACGTATTTGAGGCGCCGCATTCATTCGATGGCACTACGTACGAAACACGAGCTACGTACATAACATCAGGCGTGTTACGATTCTATTTAGGCATTGCGATTTCGGATTTTCAGAATGCATCGCGGATCGATGCTATCGTACCTATGCAA 
##                                                                                                                                                     8 
## ACGTATTTGAGGCGCCGCATTCATTCGATGGCACTACGTACGAAACACGAGCTACGTACATAACATCAGGCAGCTAGCATCGATCGATGCATCATGCGATTTCGGATTTTCAGAATGCATCGCGGATCGATGCTATCGTACCTATGCAA 
##                                                                                                                                                     1 
## ACGTATTTGAGGCGCCGCATTCATTCGATGGCACTACGTACGAAACACGAGCTACGTACATAACATCAGGCGTGTTACGATTCTATTTAGGCATTGCGATTTCGGATTTTCAGAATGCATCGCGGATCGATCGATCTAGGCATTTAGCA 
##                                                                                                                                                     1

Exercise Task 3

If we change the percentage identity, we obtain a different result:

OTUs = create_otus(only_seq,90)

OTUs_count = count_otus(only_seq,OTUs,90)
head(OTUs_count)

## ACCGGTACGGACTTAGCGAGATTCGGATCGATTACGATCACGAAACACGAGCTACGTACATAACATCAGGCGTGTTACGATTCTATTTAGGCATTGCGATTTCGGATTTTCAGAATGCATCGCGGATCGATGCTATCGTACCTATGCAA 
##                                                                                                                                                    18 
## ACGTATTCGAGGCGGGGCATTCAGCTATTCGATCTACGTACGATTCACGAGCTACGTACATTGCATCGGGCGTATTACGATTCTATCGATCGACTTATTGCATTCGATCGATGCTAGCATCGTTTACCGATGCTATCGTACGCATGCAT 
##                                                                                                                                                    27 
## ACGTATTTGAGGCGCCGCATTCATTCGATGGCACTACGTACGATTCACGAGCTACGTACATTTCATCGGGCGTATTACGATTCTATTTATCGCAGTATTGCATTCGATCGATGAATGCATCGTTTACCGATGCTATCGTACCTATGCAT 
##                                                                                                                                                    29 
## ACGTATTTGAGGCGCCGCATTCATTCGATGGCACTACGTACGAAACACGAGCTACGTACATAACATCAGGCGTATTACGATTCTATTTAGGCATTGCGATTTCGGATTTTCAGAATGCATCGTTTACCGATGCTATCGTACCTATGCAT 
##                                                                                                                                                    19 
## ACGTATTTGAGGCGCCGCATTCATTCGATGGCACTACGTACGAAACACGAGCTACGTACATAACATCAGGCAGCTAGCATCGATCGATGCATCATGCGATTTCGGATTTTCAGAATGCATCGCGGATCGATGCTATCGTACCTATGCAA 
##                                                                                                                                                     1 
## ACGTATTTGAGGCGCCGCATTCATTCGATGGCACTACGTACGAAACACGAGCTACGTACATAACATCAGGCGTGTTACGATTCTATTTAGGCATTGCGATTTCGGATTTTCAGAATGCATCGCGGATCGATCGATCTAGGCATTTAGCA 
##                                                                                                                                                     6

But note that the sum of the read counts is the same between the two.

Exercise 2: Describing microbial community structure

Load the data:

load(url("https://zenodo.org/record/7188406/files/human_microbiome_dataset.Rdata"))

which has a tax_profile matrix:

tax_profile[1:3,1:3]

##              700002_T0 700002_T1 700002_T2
## UNASSIGNED         387      1863      1622
## Bacteroides       4018      1705      1660
## Agathobacter        80      1580       126

Exercise Task 4

Can you find out at which taxonomic rank the table is summarized?

If we go to NCBI taxonomy website (https://www.ncbi.nlm.nih.gov/taxonomy/) we can search some of the clades.

If we search for Bacteroides, we can see that it is a Genus rank annotation, for which we can find the full lineage:

cellular organisms; Bacteria; FCB group; Bacteroidetes/Chlorobi group; Bacteroidota; Bacteroidia; Bacteroidales; Bacteroidaceae

Metadata

We have a metadata table, with one row per sample:

head(metadata)

##           Subject Timepoint Sex
## 700002_T0  700002         0   M
## 700002_T1  700002        12   M
## 700002_T2  700002        24   M
## 700002_T3  700002        50   M
## 700004_T0  700004         0   F
## 700004_T1  700004        12   F

Exercise Task 5

The first thing we may look at is the number of reads of each sample (also called sequencing depth). As our samples corresponds to the columns of the table, we just need to get the sum for each column. We do that with the colSums() function. Can you plot a histogram of the sequencing depth of all samples?

seq_depth = colSums(tax_profile)
hist(seq_depth,50,xlab = "Sequencing depth")

If you want to compare different samples, is it a problem that there are different read counts? Try to divide each value within a sample by the sum of the reads in that sample to normalise the data (also called relative abundance).

rel_ab = prop.table(tax_profile,2)
rel_ab[1:3,1:3]

##                700002_T0 700002_T1   700002_T2
## UNASSIGNED   0.024052206 0.1190187 0.078482605
## Bacteroides  0.249720323 0.1089248 0.080321285
## Agathobacter 0.004972032 0.1009391 0.006096676

If you want to compare different samples, is it a problem that there are different read counts?

It is important to notice that this is not related to the number of prokaryotic cells in the sample. Instead, it is a random process that happen when sequencing short reads. As a consequence, when comparing two samples we need to take this into consideration.

If we have a case as this:

       sample_A  sample_B
OTU_1         2         4
OTU_2        12        24
OTU_3         8        16

and we look at the OTUs individually, we would conclude that OTU_2 is double as abundant in sample_B compared to sample_A. But if we do the same evaluation with relative abundance transformed data, we see that they are the same.

Which genus is the most and least prevalent?

We can calculate prevalence as:

prevalence = rowSums(tax_profile > 0)

The prevalence measure in how many samples each clade was profiled.

Three most abundant:

sort(prevalence,decreasing = T)[1:3]

##  UNASSIGNED Bacteroides     Blautia 
##         480         480         480

So Bacteroides was profiled in 480 samples (out of 480 samples), hence it is present in all samples and it is a highly prevalent species.

Three least abundant:

sort(prevalence)[1:3]

##              CAG-873 possible_genus_Sk018     Cellulosilyticum 
##                    0                    0                    0

It can happen that some clades are not measured at all, and hence have a prevalence of zero (there were no reads mapping to any sample).

If we check more:

sort(prevalence)[1:15]

##                   CAG-873      possible_genus_Sk018          Cellulosilyticum 
##                         0                         0                         0 
##                        A2 Coriobacteriaceae_UCG-002   Candidatus_Stoquefichus 
##                         0                         0                         0 
##            Faecalibaculum             Acetatifactor                Tahibacter 
##                         0                         0                         0 
##              Sphingomonas            Rosenbergiella                      28-4 
##                         0                         1                         1 
##                 Gallicola                   Sarcina              Harryflintia 
##                         1                         1                         1

We can see that the genus Harryflintia is measured only in one sample.

How many zeros there are per sample and per genus?

We can check the number of zeros per sample:

n_zeros_per_sample = colSums(tax_profile == 0)
hist(n_zeros_per_sample,50,xlab = "Number of zeros in the samples")

We can check the number of zeros per genus:

n_zeros_per_genus = rowSums(tax_profile == 0)
hist(n_zeros_per_genus,50,xlab = "Number of zeros in the genera")

While the number of zeros per sample is normally distributed, the number of zeros per genus is not. In fact, when we cheked the prevalence, we saw that there are few species with high prevalence (the tail on the right) and many species with lower prevalence.

Alpha-diversity

We load the vegan package and rarefy the taxonomic profile table:

# load the library
library("vegan")

## Loading required package: permute

## Loading required package: lattice

## This is vegan 2.6-2

# rarefy taxonomic profiles
set.seed(100)
tax_profile_rarefied = t(rrarefy(t(tax_profile),5000))

Now the colSums(tax_profile_rarefied) will be 5,000 for all samples.

Let’s evaluate the richness with the vegan package

richness = estimateR(t(tax_profile_rarefied))
richness = t(richness)
head(richness)

##           S.obs  S.chao1   se.chao1    S.ACE   se.ACE
## 700002_T0    36 36.25000 0.73503387 36.73398 2.990198
## 700002_T1    64 64.25000 0.73583205 64.85848 3.639135
## 700002_T2    59 62.00000 4.17123406 60.73455 3.812244
## 700002_T3    54 54.00000 0.00000000 54.00000 3.291403
## 700004_T0    55 55.33333 0.92436727 55.75885 3.502650
## 700004_T1    55 55.00000 0.09908674 55.33298 3.520647

We are interested in the observed richness, which is the first column. Note that it is the same as doing:

richness_manual = colSums(tax_profile_rarefied > 0)
head(richness_manual)

## 700002_T0 700002_T1 700002_T2 700002_T3 700004_T0 700004_T1 
##        36        64        59        54        55        55

Exercise Task 6

Which samples have the highest and lowest richness?

We can plot an histogram for the richness:

hist(richness_manual,50)

The lowest is around 25 and the highest is around 90.

Is there are a difference between males and females?

We create a data frame with the information:

df_richness = data.frame(richness = richness[rownames(metadata),1])
df_richness$Sex = metadata$Sex
df_richness$Timepoint = metadata$Timepoint

And we can plot it:

library(ggplot2)
ggplot(df_richness, aes(Sex,richness)) + geom_boxplot()

It seems there is no difference between Males and Females.

Which we can also test with a t-test:

t.test(
  df_richness[df_richness$Sex == "M","richness"],
  df_richness[df_richness$Sex == "F","richness"])

## 
##  Welch Two Sample t-test
## 
## data:  df_richness[df_richness$Sex == "M", "richness"] and df_richness[df_richness$Sex == "F", "richness"]
## t = 0.86417, df = 477.86, p-value = 0.3879
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.069216  2.747996
## sample estimates:
## mean of x mean of y 
##  56.15079  55.31140

The p-value is greater than 0.05, so we reject the alternative hypothesis. The two distributions are not statistically different, hence there is no difference between males and females.

Evaluate evenness

Functions:

SimpE<-function(x,zeros=F){
  if (zeros==F) x <- as.vector(x[x>0])
  S <- length(x)
  x = as.data.frame(x)
  D <- diversity(x, "inv")
  E <- (D)/S
  E}
PielouE<-function(x,zeros=F){
  if (zeros==F) x <- as.vector(x[x>0])
  H <- diversity(x)
  S <- length(x)
  J <- H/log(S) 
  J
}
evenness<-function(x){
  res<-rbind(apply(x,1,SimpE),apply(x,1,PielouE))
  colnames(res)<-rownames(x)
  rownames(res)<-c("SimpE","PielouE")
  res
}

We can calculate the evenness as:

even.metrics <- t(evenness(as.matrix(t(tax_profile_rarefied))))
head(even.metrics)

##                SimpE   PielouE
## 700002_T0 0.07788573 0.4369012
## 700002_T1 0.23907493 0.7642262
## 700002_T2 0.20338606 0.7191480
## 700002_T3 0.19412747 0.7355137
## 700004_T0 0.26743694 0.7641009
## 700004_T1 0.25191673 0.7536305

Exercise Task 7

Let’s find the sample with the lowest Pielou evenness:

min_v = min(even.metrics[,2])
which(even.metrics[,2] == min_v)

## 700002_T0 
##         1

And highest Pielou evenness:

max_v = max(even.metrics[,2])
which(even.metrics[,2] == max_v)

## 700018_T2 
##        47

And try to plot a histogram of the distributions of the relative abundances.

First for the sample with low evenness

hist(log10(tax_profile_rarefied[,"700002_T0"]),xlim = c(0,3.5))

Note that we used log relative abundances.

And second for the sample with high evenness:

hist(log10(tax_profile_rarefied[,"700018_T2"]),xlim = c(0,3.5))

As you can observe from the histograms, the sample with low evenness as some low abundant species (on the left) and few abundance species on the right; while the sample with high evenness is more equally distributed.

Evaluate diversity

Finally, we can calculate the diversity of a microbial community, which is a combined measure of richness and evenness. There exist many indices of diversity and their development has been a very active field in ecology for decades. Here we will use the Shannon’s Diversity (or Shannon’s entropy)

Exercise Task 8

We can calculate the Shannon index as:

shannon_indx = diversity(t(tax_profile_rarefied),"shannon")

And we can add it to the data frame df_richness:

df_richness$shannon = shannon_indx[rownames(df_richness)]
head(df_richness)

##           richness Sex Timepoint  shannon
## 700002_T0       36   M         0 1.565644
## 700002_T1       64   M        12 3.178327
## 700002_T2       59   M        24 2.932353
## 700002_T3       54   M        50 2.933952
## 700004_T0       55   F         0 3.062007
## 700004_T1       55   F        12 3.020048

We can plot the observed richness and the Shannon index:

plot(df_richness$richness, df_richness$shannon, xlab = "Observed richness", ylab = "Shannon index")

It seems there is a high correlation between the two (which makes sense). We can also evaluate the correlation between the two:

cor(df_richness$richness,df_richness$shannon)

## [1] 0.7752869

Solution practical Metagenomics 1

Alessio Milanese

2022-10-17