# Tutorial 3: Exploring a mOTUs profile table in R ## Packages required ```r library(vegan) # for computing diversity measures ## Warning: package 'permute' was built under R version 4.0.5 library(ggplot2) # for plotting ``` ## Load taxonomic profile and metadata We will load a mOTUs profile consisting of 150 metagenomic samples from different human body sites (oral cavity, skin and stool) as well as metadata about those samples. The data is from NIH’s Human Microbiome Project . ```r motus_profile = read.csv("/path/to/file/motus_profile.tsv",header = TRUE, skip = 2, sep = "\t", check.names = F, stringsAsFactors = F,row.names = 1) metadata <- read.csv("/path/to/file/meta.tsv",header = TRUE, sep = '\t', stringsAsFactors = F) ``` For this tutorial, you can download the data from: - [profiles](https://sunagawalab.ethz.ch/share/MOTUS_WEBSITE_DATA/tutorial_3/motus_profile.tsv) - [metadata](https://sunagawalab.ethz.ch/share/MOTUS_WEBSITE_DATA/tutorial_3/meta.tsv) ## First look at the tables Let us first have a look at the metadata to find out more about our samples. ```r head(metadata) ## sample_SAMN_ID subject_id body_site ## LLOY17-1_SAMN00031101-S001_METAG SAMN00031101 160502038 OR ## LLOY17-1_SAMN00031107-S001_METAG SAMN00031107 160502038 OR ## LLOY17-1_SAMN00031519-S001_METAG SAMN00031519 823052294 OR ## LLOY17-1_SAMN00031551-S001_METAG SAMN00031551 823052294 OR ## LLOY17-1_SAMN00031566-S001_METAG SAMN00031566 765337473 OR ## LLOY17-1_SAMN00031572-S001_METAG SAMN00031572 765337473 OR ## body_subsite visit age gender BMI ## LLOY17-1_SAMN00031101-S001_METAG tonguedorsum 3 26 female 24 ## LLOY17-1_SAMN00031107-S001_METAG supragingivalplaque 3 26 female 24 ## LLOY17-1_SAMN00031519-S001_METAG tonguedorsum 2 27 male 22 ## LLOY17-1_SAMN00031551-S001_METAG supragingivalplaque 2 27 male 22 ## LLOY17-1_SAMN00031566-S001_METAG tonguedorsum 2 23 female NA ## LLOY17-1_SAMN00031572-S001_METAG supragingivalplaque 2 23 female NA ``` The column `body_site` tells us where the samples come from: `OR`, oral cavity `SK`, skin `ST`, stool ## Some simple explorations of the taxonomic profile and metadata How many samples do we have in each category? ```r table(metadata$body_site) ## ## OR SK ST ## 50 50 50 ``` Apart from the location of sampling, we also have information on the age, gender and body mass index (BMI) of the subject. Let’s also have a look at our taxonomic profile ```r head(motus_profile[,1:3]) ## LLOY17-1_SAMN00031101-S001_METAG ## Leptospira alexanderi [ref_mOTU_v3_00001] 0 ## Leptospira weilii [ref_mOTU_v3_00002] 0 ## Chryseobacterium sp. [ref_mOTU_v3_00004] 0 ## Chryseobacterium gallinarum [ref_mOTU_v3_00005] 0 ## Chryseobacterium indologenes [ref_mOTU_v3_00006] 0 ## Chryseobacterium artocarpi/ureilyticum [ref_mOTU_v3_00007] 0 ## LLOY17-1_SAMN00031107-S001_METAG ## Leptospira alexanderi [ref_mOTU_v3_00001] 0 ## Leptospira weilii [ref_mOTU_v3_00002] 0 ## Chryseobacterium sp. [ref_mOTU_v3_00004] 0 ## Chryseobacterium gallinarum [ref_mOTU_v3_00005] 0 ## Chryseobacterium indologenes [ref_mOTU_v3_00006] 0 ## Chryseobacterium artocarpi/ureilyticum [ref_mOTU_v3_00007] 0 ## LLOY17-1_SAMN00031519-S001_METAG ## Leptospira alexanderi [ref_mOTU_v3_00001] 0 ## Leptospira weilii [ref_mOTU_v3_00002] 0 ## Chryseobacterium sp. [ref_mOTU_v3_00004] 0 ## Chryseobacterium gallinarum [ref_mOTU_v3_00005] 0 ## Chryseobacterium indologenes [ref_mOTU_v3_00006] 0 ## Chryseobacterium artocarpi/ureilyticum [ref_mOTU_v3_00007] 0 ``` Rows are the taxa and columns are the samples. What are the dimensions of the taxonomic profile table? ```r dim(motus_profile) ## [1] 33571 150 ``` We have measured 33571 mOTUs (species level taxa) for 150 samples. Do we have counts or relative abundances? This is important to know for downstream analysis. ```r head(colSums(motus_profile)) ## LLOY17-1_SAMN00031101-S001_METAG LLOY17-1_SAMN00031107-S001_METAG ## 55653 32510 ## LLOY17-1_SAMN00031519-S001_METAG LLOY17-1_SAMN00031551-S001_METAG ## 15719 39854 ## LLOY17-1_SAMN00031566-S001_METAG LLOY17-1_SAMN00031572-S001_METAG ## 22106 11986 ``` In our case, we have counts. Note that if we had relative abundances instead we would the columns would have summed up to 1. The numbers shown above (total number of counts in each sample) represent the sequencing depth of the different samples. We can plot this as a histogram to better understand the variation in sequencing depth across the samples. ```r hist(colSums(motus_profile),50,main = "Histogram of sequencing depth",xlab = "Sequencing depth (reads)") ``` ![](../images/seq_depth_hist.png) How many species (out of 33571) were measured in at least one sample? ```r sum_per_species <- rowSums(motus_profile) length(which(sum_per_species > 0)) ## [1] 2730 ``` There are 2730 species that were profiled. Which of these species are the most prevalent? Prevalence of a certain species refers to the fraction of samples that have measured this species. ```r prevalence = rowSums(motus_profile > 0) / ncol(motus_profile) # we sort it prevalence = sort(prevalence,decreasing = T) # and check the first 10 head(prevalence) ## unassigned ## 1.0000000 ## Haemophilus parainfluenzae [ref_mOTU_v3_01068] ## 0.8466667 ## Streptococcus sp. [ref_mOTU_v3_00283] ## 0.8400000 ## Streptococcus parasanguinis [ref_mOTU_v3_00312] ## 0.7466667 ## Veillonella parvula [ref_mOTU_v3_01938] ## 0.7333333 ## Streptococcus salivarius [ref_mOTU_v3_01350] ## 0.7066667 ``` The term `unassigned` refers to the abundance of reads that did not match to any known mOTUs. It seems like all our 150 samples have such reads considering the prevalence of `unassigned` is 1. Note: It is important to consider whether the unassigned fraction is relevant for your statistic analysis (like a test of associations) and remove this fraction if needed. However note that it might be advantageous to compute relative abundances before removing the `unassigned` so that the abundances still account for the presence of unknown species. The species *Haemophilus parainfluenzae* seems to be present in around 85% of the samples. ## Preprocessing of taxonomic table Before we can compute various diversity metrics or PCoA plots, we first need to perform some preprocessing on the taxonomic table. We first filter samples with low sequencing depth (in this particular case we require a minimum of 5000 reads). ```r samples_more_5000 <- which(colSums(motus_profile) > 5000) motus_profile2 <- motus_profile[,samples_more_5000] dim(motus_profile2) ## [1] 33571 134 ``` 16 samples had a sequencing depth of <5000 reads. Before we can compute diversity metrics or perform PCoA, we need to normalize the samples to be able to compare them. We perform rarefaction on the taxonomic profile for that purpose using the `vegan` package. ```r set.seed(0) motus_profile_rarefied = t(rrarefy(t(motus_profile2),5000)) ``` We also remove species that were not measured in any sample. ```r sel_species = which(rowSums(motus_profile_rarefied) > 0) motus_profile_rarefied = motus_profile_rarefied[sel_species,] dim(motus_profile_rarefied) ## [1] 2343 134 ``` The new normalized taxonomic profile has 2343 species and 134 samples. We will also keep only the relevant metadata for 134 samples. ```r metadata = as.data.frame(metadata[colnames(motus_profile_rarefied),]) ``` ## Explore alpha diversity of the samples. The alpha-diversity measures are calculated per sample. We can calculate these indices using the vegan package. ```r shannon = diversity(t(motus_profile_rarefied), index="shannon") richness = specnumber(t(motus_profile_rarefied)) pielou_evenness = shannon/log(richness) ``` We can add these measures to the metadata to make it easier to plot and compare. ```r metadata2 = metadata metadata2$shannon = shannon metadata2$richness = richness metadata2$pielou_evenness = pielou_evenness head(metadata2) ## sample_SAMN_ID subject_id body_site ## LLOY17-1_SAMN00031101-S001_METAG SAMN00031101 160502038 OR ## LLOY17-1_SAMN00031107-S001_METAG SAMN00031107 160502038 OR ## LLOY17-1_SAMN00031519-S001_METAG SAMN00031519 823052294 OR ## LLOY17-1_SAMN00031551-S001_METAG SAMN00031551 823052294 OR ## LLOY17-1_SAMN00031566-S001_METAG SAMN00031566 765337473 OR ## LLOY17-1_SAMN00031572-S001_METAG SAMN00031572 765337473 OR ## body_subsite visit age gender BMI ## LLOY17-1_SAMN00031101-S001_METAG tonguedorsum 3 26 female 24 ## LLOY17-1_SAMN00031107-S001_METAG supragingivalplaque 3 26 female 24 ## LLOY17-1_SAMN00031519-S001_METAG tonguedorsum 2 27 male 22 ## LLOY17-1_SAMN00031551-S001_METAG supragingivalplaque 2 27 male 22 ## LLOY17-1_SAMN00031566-S001_METAG tonguedorsum 2 23 female NA ## LLOY17-1_SAMN00031572-S001_METAG supragingivalplaque 2 23 female NA ## shannon richness pielou_evenness ## LLOY17-1_SAMN00031101-S001_METAG 3.819543 223 0.7063846 ## LLOY17-1_SAMN00031107-S001_METAG 3.775816 239 0.6894625 ## LLOY17-1_SAMN00031519-S001_METAG 4.007359 279 0.7116335 ## LLOY17-1_SAMN00031551-S001_METAG 3.896047 266 0.6977791 ## LLOY17-1_SAMN00031566-S001_METAG 3.491617 188 0.6667919 ## LLOY17-1_SAMN00031572-S001_METAG 3.532731 181 0.6795677 ``` Is there a difference in alpha diversity between body sites? We will use the package `ggplot2` for plotting the comparisons. ```r #plotting Shannon-Index across different body sites ggplot(metadata2, aes(x = body_site,y = shannon)) + geom_boxplot() + ylab("Shannon Index") ``` ![](../images/boxplot_alpha_diversity.png) Samples from skin have a much lower Shannon-Index. You can similarly plot the other diversity measures too. ## PCoA plot Let’s visualize the samples in an ordination plot using the Bray-Curtis dissimilarity. We will perform PCoA as our ordination method. First we need to calculate beta diversity (distances between the samples). ```r all_vs_all_beta = as.matrix(vegdist(t(motus_profile_rarefied),method = "bray")) ``` Using the dissimilarity matrix, we compute the principal coordinates. ```r data_PCoA = cmdscale(all_vs_all_beta) ``` We can also add this information to the metadata to make it easier to plot. ```r metadata2$PCoA1 = data_PCoA[,1] metadata2$PCoA2 = data_PCoA[,2] ``` Making the PCoA plot: ```r ggplot(metadata2,aes(x = PCoA1,y = PCoA2,col = body_site))+ geom_point()+ xlab(paste0("PCoA1")) + ylab(paste0("PCoA2"))+ theme_bw() ``` ![](../images/pcoa.png) We can see that the samples separate quite well by body site. However, there seems to be one outlier skin sample among the oral cavity samples. This might be a mislabeled sample considering all the other skin samples are separated.