# Tutorial 3: Exploring a mOTUs profile table in R

## Packages required

```r
library(vegan)  # for computing diversity measures

## Warning: package 'permute' was built under R version 4.0.5

library(ggplot2) # for plotting
```


## Load taxonomic profile and metadata

We will load a mOTUs profile consisting of 150 metagenomic samples from
different human body sites (oral cavity, skin and stool) as well as
metadata about those samples. The data is from NIH’s Human Microbiome
Project <https://www.hmpdacc.org/hmp/>.
```r
motus_profile = read.csv("/path/to/file/motus_profile.tsv",header = TRUE, skip = 2, sep = "\t", check.names = F, stringsAsFactors = F,row.names = 1)

metadata <- read.csv("/path/to/file/meta.tsv",header = TRUE, sep = '\t', stringsAsFactors = F)
```

For this tutorial, you can download the data from:

- [profiles](https://sunagawalab.ethz.ch/share/MOTUS_WEBSITE_DATA/tutorial_3/motus_profile.tsv)
- [metadata](https://sunagawalab.ethz.ch/share/MOTUS_WEBSITE_DATA/tutorial_3/meta.tsv)


## First look at the tables

Let us first have a look at the metadata to find out more about our
samples.
```r
head(metadata)

##                                  sample_SAMN_ID subject_id body_site
## LLOY17-1_SAMN00031101-S001_METAG   SAMN00031101  160502038        OR
## LLOY17-1_SAMN00031107-S001_METAG   SAMN00031107  160502038        OR
## LLOY17-1_SAMN00031519-S001_METAG   SAMN00031519  823052294        OR
## LLOY17-1_SAMN00031551-S001_METAG   SAMN00031551  823052294        OR
## LLOY17-1_SAMN00031566-S001_METAG   SAMN00031566  765337473        OR
## LLOY17-1_SAMN00031572-S001_METAG   SAMN00031572  765337473        OR
##                                         body_subsite visit age gender BMI
## LLOY17-1_SAMN00031101-S001_METAG        tonguedorsum     3  26 female  24
## LLOY17-1_SAMN00031107-S001_METAG supragingivalplaque     3  26 female  24
## LLOY17-1_SAMN00031519-S001_METAG        tonguedorsum     2  27   male  22
## LLOY17-1_SAMN00031551-S001_METAG supragingivalplaque     2  27   male  22
## LLOY17-1_SAMN00031566-S001_METAG        tonguedorsum     2  23 female  NA
## LLOY17-1_SAMN00031572-S001_METAG supragingivalplaque     2  23 female  NA
```
The column `body_site` tells us where the samples come from:

`OR`, oral cavity

`SK`, skin

`ST`, stool


## Some simple explorations of the taxonomic profile and metadata

How many samples do we have in each category?
```r
table(metadata$body_site)

## 
## OR SK ST 
## 50 50 50
```
Apart from the location of sampling, we also have information on the
age, gender and body mass index (BMI) of the subject.

Let’s also have a look at our taxonomic profile
```r
head(motus_profile[,1:3])

##                                                            LLOY17-1_SAMN00031101-S001_METAG
## Leptospira alexanderi [ref_mOTU_v3_00001]                                                 0
## Leptospira weilii [ref_mOTU_v3_00002]                                                     0
## Chryseobacterium sp. [ref_mOTU_v3_00004]                                                  0
## Chryseobacterium gallinarum [ref_mOTU_v3_00005]                                           0
## Chryseobacterium indologenes [ref_mOTU_v3_00006]                                          0
## Chryseobacterium artocarpi/ureilyticum [ref_mOTU_v3_00007]                                0
##                                                            LLOY17-1_SAMN00031107-S001_METAG
## Leptospira alexanderi [ref_mOTU_v3_00001]                                                 0
## Leptospira weilii [ref_mOTU_v3_00002]                                                     0
## Chryseobacterium sp. [ref_mOTU_v3_00004]                                                  0
## Chryseobacterium gallinarum [ref_mOTU_v3_00005]                                           0
## Chryseobacterium indologenes [ref_mOTU_v3_00006]                                          0
## Chryseobacterium artocarpi/ureilyticum [ref_mOTU_v3_00007]                                0
##                                                            LLOY17-1_SAMN00031519-S001_METAG
## Leptospira alexanderi [ref_mOTU_v3_00001]                                                 0
## Leptospira weilii [ref_mOTU_v3_00002]                                                     0
## Chryseobacterium sp. [ref_mOTU_v3_00004]                                                  0
## Chryseobacterium gallinarum [ref_mOTU_v3_00005]                                           0
## Chryseobacterium indologenes [ref_mOTU_v3_00006]                                          0
## Chryseobacterium artocarpi/ureilyticum [ref_mOTU_v3_00007]                                0
```
Rows are the taxa and columns are the samples.

What are the dimensions of the taxonomic profile table?
```r
dim(motus_profile)

## [1] 33571   150
```
We have measured 33571 mOTUs (species level taxa) for 150 samples.

Do we have counts or relative abundances? This is important to know for
downstream analysis.
```r
head(colSums(motus_profile))

## LLOY17-1_SAMN00031101-S001_METAG LLOY17-1_SAMN00031107-S001_METAG 
##                            55653                            32510 
## LLOY17-1_SAMN00031519-S001_METAG LLOY17-1_SAMN00031551-S001_METAG 
##                            15719                            39854 
## LLOY17-1_SAMN00031566-S001_METAG LLOY17-1_SAMN00031572-S001_METAG 
##                            22106                            11986
```
In our case, we have counts. Note that if we had relative abundances
instead we would the columns would have summed up to 1.

The numbers shown above (total number of counts in each sample)
represent the sequencing depth of the different samples. We can plot
this as a histogram to better understand the variation in sequencing
depth across the samples.
```r
hist(colSums(motus_profile),50,main = "Histogram of sequencing depth",xlab = "Sequencing depth (reads)")
```
![](../images/seq_depth_hist.png)

How many species (out of 33571) were measured in at least one sample?
```r
sum_per_species <- rowSums(motus_profile)

length(which(sum_per_species > 0))

## [1] 2730
```
There are 2730 species that were profiled. Which of these species are
the most prevalent? Prevalence of a certain species refers to the
fraction of samples that have measured this species.
```r
prevalence = rowSums(motus_profile > 0) / ncol(motus_profile)
# we sort it
prevalence = sort(prevalence,decreasing = T)
# and check the first 10
head(prevalence)

##                                      unassigned 
##                                       1.0000000 
##  Haemophilus parainfluenzae [ref_mOTU_v3_01068] 
##                                       0.8466667 
##           Streptococcus sp. [ref_mOTU_v3_00283] 
##                                       0.8400000 
## Streptococcus parasanguinis [ref_mOTU_v3_00312] 
##                                       0.7466667 
##         Veillonella parvula [ref_mOTU_v3_01938] 
##                                       0.7333333 
##    Streptococcus salivarius [ref_mOTU_v3_01350] 
##                                       0.7066667
```
The term `unassigned` refers to the abundance of reads that did not
match to any known mOTUs. It seems like all our 150 samples have such
reads considering the prevalence of `unassigned` is 1.

Note: It is important to consider whether the unassigned fraction is
relevant for your statistic analysis (like a test of associations) and
remove this fraction if needed. However note that it might be
advantageous to compute relative abundances before removing the
`unassigned` so that the abundances still account for the presence of
unknown species.

The species *Haemophilus parainfluenzae* seems to be present in around
85% of the samples.


## Preprocessing of taxonomic table

Before we can compute various diversity metrics or PCoA plots, we first
need to perform some preprocessing on the taxonomic table.

We first filter samples with low sequencing depth (in this particular
case we require a minimum of 5000 reads).
```r
samples_more_5000 <- which(colSums(motus_profile) > 5000)
motus_profile2 <- motus_profile[,samples_more_5000]
dim(motus_profile2)

## [1] 33571   134
```
16 samples had a sequencing depth of &lt;5000 reads.

Before we can compute diversity metrics or perform PCoA, we need to
normalize the samples to be able to compare them.

We perform rarefaction on the taxonomic profile for that purpose using
the `vegan` package.
```r
set.seed(0)
motus_profile_rarefied = t(rrarefy(t(motus_profile2),5000))
```
We also remove species that were not measured in any sample.
```r
sel_species = which(rowSums(motus_profile_rarefied) > 0)
motus_profile_rarefied = motus_profile_rarefied[sel_species,]

dim(motus_profile_rarefied)

## [1] 2343  134
```
The new normalized taxonomic profile has 2343 species and 134 samples.

We will also keep only the relevant metadata for 134 samples.
```r
metadata = as.data.frame(metadata[colnames(motus_profile_rarefied),])
```


## Explore alpha diversity of the samples.

The alpha-diversity measures are calculated per sample. We can calculate
these indices using the vegan package.
```r
shannon = diversity(t(motus_profile_rarefied), index="shannon")
richness = specnumber(t(motus_profile_rarefied))
pielou_evenness = shannon/log(richness)
```
We can add these measures to the metadata to make it easier to plot and
compare.
```r
metadata2 = metadata
metadata2$shannon = shannon
metadata2$richness = richness
metadata2$pielou_evenness = pielou_evenness
head(metadata2)

##                                  sample_SAMN_ID subject_id body_site
## LLOY17-1_SAMN00031101-S001_METAG   SAMN00031101  160502038        OR
## LLOY17-1_SAMN00031107-S001_METAG   SAMN00031107  160502038        OR
## LLOY17-1_SAMN00031519-S001_METAG   SAMN00031519  823052294        OR
## LLOY17-1_SAMN00031551-S001_METAG   SAMN00031551  823052294        OR
## LLOY17-1_SAMN00031566-S001_METAG   SAMN00031566  765337473        OR
## LLOY17-1_SAMN00031572-S001_METAG   SAMN00031572  765337473        OR
##                                         body_subsite visit age gender BMI
## LLOY17-1_SAMN00031101-S001_METAG        tonguedorsum     3  26 female  24
## LLOY17-1_SAMN00031107-S001_METAG supragingivalplaque     3  26 female  24
## LLOY17-1_SAMN00031519-S001_METAG        tonguedorsum     2  27   male  22
## LLOY17-1_SAMN00031551-S001_METAG supragingivalplaque     2  27   male  22
## LLOY17-1_SAMN00031566-S001_METAG        tonguedorsum     2  23 female  NA
## LLOY17-1_SAMN00031572-S001_METAG supragingivalplaque     2  23 female  NA
##                                   shannon richness pielou_evenness
## LLOY17-1_SAMN00031101-S001_METAG 3.819543      223       0.7063846
## LLOY17-1_SAMN00031107-S001_METAG 3.775816      239       0.6894625
## LLOY17-1_SAMN00031519-S001_METAG 4.007359      279       0.7116335
## LLOY17-1_SAMN00031551-S001_METAG 3.896047      266       0.6977791
## LLOY17-1_SAMN00031566-S001_METAG 3.491617      188       0.6667919
## LLOY17-1_SAMN00031572-S001_METAG 3.532731      181       0.6795677
```
Is there a difference in alpha diversity between body sites? We will use
the package `ggplot2` for plotting the comparisons.
```r
#plotting Shannon-Index across different body sites
ggplot(metadata2, aes(x = body_site,y = shannon)) + geom_boxplot() + ylab("Shannon Index")
```
![](../images/boxplot_alpha_diversity.png)

Samples from skin have a much lower Shannon-Index.

You can similarly plot the other diversity measures too.


## PCoA plot

Let’s visualize the samples in an ordination plot using the Bray-Curtis
dissimilarity. We will perform PCoA as our ordination method.

First we need to calculate beta diversity (distances between the
samples).

```r
all_vs_all_beta = as.matrix(vegdist(t(motus_profile_rarefied),method = "bray"))
```
Using the dissimilarity matrix, we compute the principal coordinates.

```r
data_PCoA = cmdscale(all_vs_all_beta)
```
We can also add this information to the metadata to make it easier to
plot.

```r
metadata2$PCoA1 = data_PCoA[,1]
metadata2$PCoA2 = data_PCoA[,2]
```
Making the PCoA plot:

```r
ggplot(metadata2,aes(x = PCoA1,y = PCoA2,col = body_site))+
    geom_point()+
    xlab(paste0("PCoA1")) +
    ylab(paste0("PCoA2"))+
    theme_bw()
```
![](../images/pcoa.png)

We can see that the samples separate quite well by body site. However,
there seems to be one outlier skin sample among the oral cavity samples.
This might be a mislabeled sample considering all the other skin samples
are separated.