Data preparation

You can find the paper in pubmed using the keywords “Human Microbiome Project Consortium”, “Nature” and “2012”: https://pubmed.ncbi.nlm.nih.gov/?term=Human+Microbiome+Project+Consortium+Nature+2012

and the first entry is: https://pubmed.ncbi.nlm.nih.gov/22699609/

with pubmed ID 22699609. The NIH provide the paper for free at https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3564958/

Normally there would be a “Data Availability Statement” where you can find the information about the sequencing data. Due to the large size of samples available in the HMP project, there is a dedicated website and resources to download samples:

https://www.hmpdacc.org/hmp/resources/download.php

Data processing

Normally, you would download the samples that you identified from the previous step. Here we will work with three mock samples, that were prepared for this project:

wget https://zenodo.org/record/6517497/files/sampleA_1.fastq
wget https://zenodo.org/record/6517497/files/sampleA_2.fastq

wget https://zenodo.org/record/6517497/files/sampleB_1.fastq
wget https://zenodo.org/record/6517497/files/sampleB_2.fastq

wget https://zenodo.org/record/6517497/files/sampleC_1.fastq
wget https://zenodo.org/record/6517497/files/sampleC_2.fastq

Note that we refer to “mock” samples, because they are not real samples. Real samples have few millions reads, and here we work with samples that have less than 100,000 reads to make the computation more manageable.

Note that the sequencing data often comes in pairs (_1 and _2), representing forward and reverse reads.

One important step in data processing is to check the quality of the dataset and remove low quality samples, here we provide you already with high-quality samples.

Taxonomic profiling

We can run mOTUs on the three samples that we downloaded before with the commands:

motus profile -f sampleA_1.fastq -r sampleA_2.fastq -n sampleA -o sampleA.motus
motus profile -f sampleB_1.fastq -r sampleB_2.fastq -n sampleB -o sampleB.motus
motus profile -f sampleC_1.fastq -r sampleC_2.fastq -n sampleC -o sampleC.motus

Note that here we are using mock samples (only few Mb files), but normally the fastq samples generated for a microbiome study are few Gb. In order to do this analysis, you would normally use a high performance computing cluster.

The result of running mOTUs is (here printing only the first 10 lines):

 $ head sampleA.motus
# git tag version 3.0.3 |  motus version 3.0.3 | map_tax 3. [...]
# call: python /nfs/nas22/fs2202/biol_micro_teaching/softwa [...]
#consensus_taxonomy     sampleA
Leptospira alexanderi [ref_mOTU_v3_00001]       0.0000000000
Leptospira weilii [ref_mOTU_v3_00002]   0.0000000000
Chryseobacterium sp. [ref_mOTU_v3_00004]        0.0000000000
Chryseobacterium gallinarum [ref_mOTU_v3_00005] 0.0000000000
Chryseobacterium indologenes [ref_mOTU_v3_00006]        0.0000000000
Chryseobacterium artocarpi/ureilyticum [ref_mOTU_v3_00007]      0.0000000000
Chryseobacterium jejuense [ref_mOTU_v3_00008]   0.0000000000

The first two lines contain information about the version of the tool that was used and the options that were specified.

The third line contains the header.

From the fourth line there are the species and their relative abundance.

Data gathering

In this step we will pull together all the samples we generated before. For this we can use motus:

motus merge -i sampleA.motus,sampleB.motus,sampleC.motus -o all_samples.motus

And we can look at the generated table (head all_samples.motus):

# motus version 3.0.3 | merge 3.0.3 | info merged profiles:  [...]
# call: python /nfs/nas22/fs2202/biol_micro_teaching/softwar [...]
#consensus_taxonomy     sampleA sampleB sampleC
Leptospira alexanderi [ref_mOTU_v3_00001]       0.0000000000    0.0000000000    0.0000000000
Leptospira weilii [ref_mOTU_v3_00002]   0.0000000000    0.0000000000    0.0000000000
Chryseobacterium sp. [ref_mOTU_v3_00004]        0.0000000000    0.0000000000    0.0000000000
Chryseobacterium gallinarum [ref_mOTU_v3_00005] 0.0000000000    0.0000000000    0.0000000000
Chryseobacterium indologenes [ref_mOTU_v3_00006]        0.0000000000    0.0000000000    0.0000000000
Chryseobacterium artocarpi/ureilyticum [ref_mOTU_v3_00007]      0.0000000000    0.0000000000    0.0000000000
Chryseobacterium jejuense [ref_mOTU_v3_00008]   0.0000000000    0.0000000000    0.0000000000

We have a structure similar to the profile we looked at before. Two headers with the information about the parameters that were used for the mOTUs tool and then the header (third row).

There are four columns, the first one contains the names of the species and the other three contain the measure relative abundance.

We can load it into R with the command:

motus_profiles = read.csv("/nfs/course/course_home/alessiom/all_samples.motus",header = T,sep = "\t",skip = 2, row.names = 1)

Note that:

We can check what we imported with the command motus_profiles[1:3,], which results in:

                                          sampleA sampleB sampleC
Leptospira alexanderi [ref_mOTU_v3_00001]       0       0       0
Leptospira weilii [ref_mOTU_v3_00002]           0       0       0
Chryseobacterium sp. [ref_mOTU_v3_00004]        0       0       0

Now that you learned how to create a taxonomic profiling table, and how to load it into R, we will provide you with a real table with profiles from many samples and the relative metadata for the samples.

Note that normally, it would take few hours to process one sample.

For the rest of the project we will not use anymore sampleA, sampleB and sampleC.

You can load the taxonomic table and the metadata of the new profiles directly in R with the command:

load(url("https://sunagawalab.ethz.ch/share/teaching_milanese/HMP.Rdata"))

Let’s check the metadata:

head(hmp_metadata)

The column body_site contains the information about the samples. They belong to one of four categories: OR, oral cavity, SK skin, ST stool and VG vagina.

How many we have for each category:

table(hmp_metadata$body_site)
## 
##  OR  SK  ST  VG 
## 672 186 468 112

We can see that we have few samples from skin (SK) and vagina (VG).

We can also have a look at the taxonomic profiling table:

hmp_profiles[1:4,1:4]

Where as expected we have species as rows and samples as columns.

We can check also the size of the table:

dim(hmp_profiles)
## [1] 33571  1438

There are 33,571 species (mOTUs) and 1,438 samples.

First we can check if these are relative abundances or counts. We can look at the sum of the columns:

head(colSums(hmp_profiles))
## LLOY17-1_SAMN00031101-S001_METAG LLOY17-1_SAMN00031103-S001_METAG 
##                            55653                             2695 
## LLOY17-1_SAMN00031107-S001_METAG LLOY17-1_SAMN00031109-S001_METAG 
##                            32510                            10396 
## LLOY17-1_SAMN00031110-S001_METAG LLOY17-1_SAMN00031113-S001_METAG 
##                             4118                             2083

We have counts. And these counts represents the sequencing depth of the different samples.

Let’s plot it as an histogram:

hist(colSums(hmp_profiles),50,main = "Histogram of sequencing depth")

mOTUs reports the measures for all species, hence many species will be zero across all samples.

Let’s see how many species have at least one read in any of the samples:

sum_per_species = rowSums(hmp_profiles)

length(which(sum_per_species > 0))
## [1] 3651

There are 3,651 species that have been profiled.

Let’s check which species are the most prevalent:

prevalence = rowSums(hmp_profiles > 0) / ncol(hmp_profiles)
# we sort it
prevalence = sort(prevalence,decreasing = T)
# and check the first 10
head(prevalence)
##                                      unassigned 
##                                       0.9881780 
##  Haemophilus parainfluenzae [ref_mOTU_v3_01068] 
##                                       0.8052851 
##           Streptococcus sp. [ref_mOTU_v3_00283] 
##                                       0.7732962 
## Streptococcus parasanguinis [ref_mOTU_v3_00312] 
##                                       0.7239221 
##    Streptococcus salivarius [ref_mOTU_v3_01350] 
##                                       0.6947149 
##         Veillonella parvula [ref_mOTU_v3_01938] 
##                                       0.6835883

The mOTU Haemophilus parainfluenzae is present in 80% of the samples.

The term unassigned is present in almost all samples (98.8%), and it represent the abundance of reads that did not match to any known mOTUs.

Filter samples

As we saw before, there are few samples with low sequencing depth. We filter out samples with less than 5,000 reads:

samples_more_5000 = which(colSums(hmp_profiles) > 5000)
motus_study2 = hmp_profiles[,samples_more_5000]

We can check the size of the new table:

dim(motus_study2)
## [1] 33571   973

We went from 1,438 samples to 973.

We will also rarefy the table to be able to compare between different samples:

library("vegan")
## Loading required package: permute
## Loading required package: lattice
## This is vegan 2.6-4
set.seed(0)
motus_study_rarefied = t(rrarefy(t(motus_study2),5000))

We can also remove all the species that were not measured:

sel_species = which(rowSums(motus_study_rarefied) > 0)
motus_study_rarefied = motus_study_rarefied[sel_species,]

And the new taxonomic table has:

dim(motus_study_rarefied)
## [1] 3233  973

has 3,233 mOTUs and 973 samples.

In order to make it easier, we will also remove these samples from the metadata:

metadata = as.data.frame(hmp_metadata[colnames(motus_study_rarefied),])

Data analysis: alpha-diversity

The alpha-diversity measures are calculated per sample. We can calculate these indices using the vegan package.

shannon = diversity(t(motus_study_rarefied), index="shannon")
richness = specnumber(t(motus_study_rarefied))
pielou_evenness = shannon/log(richness)

We can add these measure to the metadata, to make it easier to compare:

metadata2 = metadata
metadata2$shannon = shannon
metadata2$richness = richness
metadata2$pielou_evenness = pielou_evenness
head(metadata2)

Let’s see if there is a difference in alpha-indices between body sites

library(ggplot2)
ggplot(metadata2, aes(body_site,shannon)) + geom_boxplot()

ggplot(metadata2, aes(body_site,richness)) + geom_boxplot()

ggplot(metadata2, aes(body_site,pielou_evenness)) + geom_boxplot()

We see that the different body sites are quite different in term of alpha-diversity. Richness and Shannon index are highly correlated, and as we can see skin and vagina have a much lower richness and Shannon index. This means that on average there are many more species present in the oral cavity and the human gut, compared to skin and vagina, which have a lower biodiversity.

We can check few other metadata parameters. We will use only Shannon here:

ggplot(metadata2, aes(gender,richness)) + geom_boxplot()

It seems that males have a slightly higher Shannon index on average, but it doesn’t seems there are significant differences.

Also, importantly, vagina samples will be available only for females, and since they have a low Shannon index, they will lower the median and the overall box-plot. We can try to do the same plot removing the vagina samples:

ggplot(metadata2[metadata2$body_site != "VG",], aes(gender,richness)) + geom_boxplot()

And we can see that there is a smaller difference between males and females.

We can check the correlation to continuous values:

cor.test(metadata2$age,metadata2$shannon)
## 
##  Pearson's product-moment correlation
## 
## data:  metadata2$age and metadata2$shannon
## t = 1.9336, df = 727, p-value = 0.05355
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.001090211  0.143396781
## sample estimates:
##        cor 
## 0.07152851

There is no correlation to age.

cor.test(metadata2$shannon,metadata2$BMI,use="complete.obs")
## 
##  Pearson's product-moment correlation
## 
## data:  metadata2$shannon and metadata2$BMI
## t = 1.7303, df = 699, p-value = 0.08402
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.008786466  0.138685228
## sample estimates:
##        cor 
## 0.06530596

There is no correlation to BMI.

Data analysis: beta-diversity and ordination

We can evaluate the properties of the beta diversity:

all_vs_all_beta = as.matrix(vegdist(t(motus_study_rarefied),method = "jaccard"))

# prepare data to save the results
m_ST_VG = c()
m_ST_SK = c()
m_ST_OR = c()
m_SK_VG = c()
m_SK_OR = c()
m_OR_VG = c()
m_ST = c()
m_SK = c()
m_OR = c()
m_VG = c()

# we can go through all the elements in the upper diagonal of the matrix using
for(i in c(1:(ncol(all_vs_all_beta)-1) )){
    for(j in c((i+1) : ncol(all_vs_all_beta) )){
        # sample 1
        s1 = rownames(all_vs_all_beta)[i]
        # sample 2
        s2 = rownames(all_vs_all_beta)[j]
        # this distance
        this_distance = all_vs_all_beta[i,j]
        # save the data or do comparisons...
        if(metadata[s1,]$body_site == "ST" & metadata[s2,]$body_site == "ST"){
          m_ST = c(m_ST,this_distance)
        }
        # ... and now do for all
    }
}

Note that it can take a while to calculate the all vs all distances, as there are almost 1,000 samples.

We can also do it in a different way:

# first we find all the samples from the different body sites
ST_sample = rownames(metadata[metadata$body_site == "ST",])
OR_sample = rownames(metadata[metadata$body_site == "OR",])
VG_sample = rownames(metadata[metadata$body_site == "VG",])
SK_sample = rownames(metadata[metadata$body_site == "SK",])

# and then we select them directly from the beta-diversity matrix
# to make it easier, we select random 4,000 distances.
# Otherwise there are too many points
set.seed(16)
df_beta = data.frame(
  m_ST_VG = sample(all_vs_all_beta[ST_sample,VG_sample],4000),
  m_ST_SK = sample(all_vs_all_beta[ST_sample,SK_sample],4000),
  m_ST_OR = sample(all_vs_all_beta[ST_sample,OR_sample],4000),
  m_SK_OR = sample(all_vs_all_beta[SK_sample,OR_sample],4000),
  m_OR_VG = sample(all_vs_all_beta[OR_sample,VG_sample],4000),
  m_ST = sample(all_vs_all_beta[ST_sample,ST_sample],4000),
  m_SK = sample(all_vs_all_beta[ST_sample,ST_sample],4000),
  m_OR = sample(all_vs_all_beta[ST_sample,ST_sample],4000),
  m_VG = sample(all_vs_all_beta[ST_sample,ST_sample],4000)
)

We can check the median for each of the evaluations, and sort it:

medians = sort(apply(df_beta,2,median))
medians
##      m_ST      m_SK      m_OR      m_VG   m_ST_OR   m_SK_OR   m_OR_VG   m_ST_SK 
## 0.8579260 0.8590987 0.8602690 0.8623436 0.9824990 0.9941662 0.9944696 0.9954797 
##   m_ST_VG 
## 0.9959839

From this we can see that the samples from skin are the most similar between each others. And stool and vagina are the most different between each other.

Note that the vegan package calculates dissimilarities by default, hence here we measured the Jaccard dissimilarity (and not the Jaccard index as explained in the lectures).

We can plot it:

df = data.frame(
  measures = c(df_beta$m_ST_VG,df_beta$m_ST_SK,df_beta$m_ST_OR,df_beta$m_SK_OR,df_beta$m_OR_VG,df_beta$m_ST,df_beta$m_SK,df_beta$m_OR,df_beta$m_VG),
  type = factor(c(rep("m_ST_VG",4000),rep("m_ST_SK",4000),rep("m_ST_OR",4000),rep("m_SK_OR",4000),rep("m_OR_VG",4000),rep("m_ST",4000),rep("m_SK",4000),rep("m_OR",4000),rep("m_VG",4000)),levels = names(medians))
)
ggplot(df,aes(type,measures))+
  geom_boxplot() +
  ylab("Beta-diversity") +
  ylim(c(0.3,1))+
  ggtitle("Jaccard dissimilarity")
## Warning: Removed 49 rows containing non-finite values (stat_boxplot).

Overall, we can see that the within body site distances are lower than the between body sites distances. Which means that stool samples are much more similar to other stool samples, compared to any other body site. The same is true for the skin to skin samples, vagina to vagina samples and oral to oral samples. When comparing between different body sites, we can see that stool and oral are the most similar to each other.

Plot the samples in a PCoA plot

Let’s visualize the samples in an ordination plot using the Jaccard index (a PCoA plot).

First we calculate the PCoA:

all_vs_all_beta = as.matrix(vegdist(t(motus_study_rarefied),method = "jaccard"))
data_PCoA = cmdscale(all_vs_all_beta)

And we add the information to the metadata, to make it easier to visualise

metadata2 = metadata
metadata2$PCoA1 = data_PCoA[,1]
metadata2$PCoA2 = data_PCoA[,2]

We can now plot the PCoA using the different metadata information:

ggplot(metadata2,aes(PCoA1,PCoA2,col = body_site))+
  geom_point()+
  xlab(paste0("PCoA1")) +
  ylab(paste0("PCoA2"))+
  theme_bw()

We can use a different measure for the beta-diversity:

all_vs_all_beta = as.matrix(vegdist(t(motus_study_rarefied),method = "canberra"))
data_PCoA = cmdscale(all_vs_all_beta)

And we add the information to the metadata, to make it easier to visualise

metadata2 = metadata
metadata2$PCoA1 = data_PCoA[,1]
metadata2$PCoA2 = data_PCoA[,2]

and plot:

ggplot(metadata2,aes(PCoA1,PCoA2,col = body_site))+
  geom_point()+
  xlab(paste0("PCoA1")) +
  ylab(paste0("PCoA2"))+
  theme_bw()

We see that we can separate between the different body sites.

We can also check other metadata measure:

ggplot(metadata2,aes(PCoA1,PCoA2,col = gender))+
  geom_point()+
  xlab(paste0("PCoA1")) +
  ylab(paste0("PCoA2"))+
  theme_bw()

There is not a good separation for gender.

We can zoom into the Oral samples and check if there is a division by body sub-site.

ggplot(metadata2[metadata2$body_site == "OR",],aes(PCoA1,PCoA2,col = body_subsite))+
  geom_point()+
  xlab(paste0("PCoA1")) +
  ylab(paste0("PCoA2"))+
  theme_bw()

We can see that even within one body site, we have separation between sub-body sites.

Statistical analysis

We compare first stool to oral, because they are the body sites with the highest amount of samples.

site1 = "ST"
site2 = "OR"
metadata_st_analysis = metadata[metadata$body_site == site1 | metadata$body_site == site2,]
motus_st_analysis = motus_study_rarefied[,rownames(metadata_st_analysis)]

# calculate prevalence, as a percentage
prevalence = (rowSums(motus_st_analysis > 0) / ncol(motus_st_analysis)) * 100
# select only the species with at least a prevalence of 10%
sel_species = names(which(prevalence > 10))
length(sel_species)
## [1] 667

We have 667 species to check.

For each species, we will do a wilcoxon test to compare stool to oral samples, and save the result in t_test_results:

t_test_results = rep(NA,length(sel_species))
names(t_test_results) = sel_species
for(species in sel_species){
  site1_measure = motus_st_analysis[species,which(metadata_st_analysis$body_site == site1)]
  site2_measure = motus_st_analysis[species,which(metadata_st_analysis$body_site == site2)]
  t_test_results[species] = wilcox.test(site1_measure,site2_measure)$p.value
}

# we sort the 
t_test_results = sort(t_test_results)
head(t_test_results)
##                Bacteroides dorei/vulgatus [ref_mOTU_v3_02367] 
##                                                 5.338539e-163 
##                     [Eubacterium] rectale [ref_mOTU_v3_03657] 
##                                                 3.959227e-160 
##                           Bacteroides sp. [ref_mOTU_v3_03475] 
##                                                 4.418403e-159 
## Flavobacteriaceae species incertae sedis [meta_mOTU_v3_12274] 
##                                                 4.672520e-153 
##                     Bacteroides uniformis [ref_mOTU_v3_00855] 
##                                                 6.545883e-150 
##              Faecalibacterium prausnitzii [ref_mOTU_v3_06108] 
##                                                 2.435261e-147

We can see that Bacteroides dorei/vulgatus is the one with the lowest value, and it is enriched in stool samples.

On the other hand, Flavobacteriaceae species incertae sedis [meta_mOTU_v3_12274] is the species with the lowest p-value that is enriched in oral samples.

Let’s check these two species:

species1 = names(t_test_results[1])
species2 = names(t_test_results[4])
df_1 = data.frame(
  site = c(metadata_st_analysis$body_site,metadata_st_analysis$body_site),
  species = c(rep(species1,nrow(metadata_st_analysis)), rep(species2,nrow(metadata_st_analysis))),
  measures = c(motus_st_analysis[species1,],motus_st_analysis[species2,])
)

library(ggplot2)
ggplot(df_1,aes(site,log10(measures))) +
  geom_boxplot() +
  geom_jitter()+
  facet_grid(. ~ species)
## Warning: Removed 918 rows containing non-finite values (stat_boxplot).
## Warning: Removed 918 rows containing missing values (geom_point).

We can see that, as expected, there is a different distribution of these two species in the two body sites.

Since we did a test for each species, we need to correct for multiple testing correction:

adjusted_pvals = sort(p.adjust(t_test_results,method = "fdr"))
head(adjusted_pvals)
##                Bacteroides dorei/vulgatus [ref_mOTU_v3_02367] 
##                                                 3.560805e-160 
##                     [Eubacterium] rectale [ref_mOTU_v3_03657] 
##                                                 1.320402e-157 
##                           Bacteroides sp. [ref_mOTU_v3_03475] 
##                                                 9.823582e-157 
## Flavobacteriaceae species incertae sedis [meta_mOTU_v3_12274] 
##                                                 7.791427e-151 
##                     Bacteroides uniformis [ref_mOTU_v3_00855] 
##                                                 8.732208e-148 
##              Faecalibacterium prausnitzii [ref_mOTU_v3_06108] 
##                                                 2.355780e-145

How many are significantly associated?

length(which(adjusted_pvals<0.05))
## [1] 667

There are 667 species that are differentially abundant between stool and oral samples, which are all the species that we selected after the prevalence filter. This is in agreement with what we saw from the beta-diversity results. In fact, the the distances between samples from different body-sites were higher than the within-samples.

Moreover, the distances between stool and oral samples were on average close to 1 (which is the highest possible value for the Jaccard index). And this result means that there is no great overlap between the species from the two samples. And the test that we did confirm this.

Statistical modeling

We load the libraries:

library(randomForest)
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
## 
##     margin
library(ROCR)

And we can build a random forest model:

labels = as.factor(metadata$body_site)
features = t(log10(motus_study_rarefied+0.5))
RF<-randomForest(x=features,y=labels,importance=TRUE, proximity=TRUE)
RF
## 
## Call:
##  randomForest(x = features, y = labels, importance = TRUE, proximity = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 56
## 
##         OOB estimate of  error rate: 0.72%
## Confusion matrix:
##     OR SK  ST VG class.error
## OR 512  0   0  0 0.000000000
## SK   1 36   1  0 0.052631579
## ST   1  0 412  0 0.002421308
## VG   2  1   1  6 0.400000000

We can check the importance of the features:

importance<-as.data.frame(RF$importance)
head(importance[order(importance$MeanDecreaseAccuracy,decreasing = T),],n=10)

As we can see there a high overlap with the results from the statistical analysis. The species lavobacteriaceae species incertae sedis [meta_mOTU_v3_12274], Bacteroides dorei/vulgatus [ref_mOTU_v3_02367] and [Eubacterium] rectale [ref_mOTU_v3_03657] were identified also in the statistical analysis as the top differentially abundant species.

We can load the useful function we saw in the exercises:

getROC<-function(RF,newdat,fct){
  prediction_for_roc_curve <- predict(RF,newdat,type="prob")
true.values<-as.character(fct)
pred.values<-as.character(apply(prediction_for_roc_curve,1,function(x){names(x)[which.max(x)]}))
tmp<-prediction_for_roc_curve
colnames(tmp)<-paste("prob.",colnames(tmp),sep="")
pred.mat<-data.frame(true=true.values,predicted=pred.values,tmp)
conf.mat<-table(true.values,pred.values)

fpr<-NULL
tpr<-NULL
auc<-NULL
fct_lev<-NULL
for (i in levels(fct)){
  true_values <- ifelse(fct==i,1,0)
  pred <- prediction(prediction_for_roc_curve[,i],true_values)
  perf <- performance(pred, "tpr", "fpr")
  fpr<-c(fpr,unlist(perf@x.values))
  tpr<-c(tpr,unlist(perf@y.values))
  auc<-c(auc,unlist(performance(pred, measure = "auc")@y.values))
  fct_lev<-c(fct_lev,rep(i,length(unlist(perf@x.values))))
}
roc.df<-data.frame(fpr,tpr,fct_lev)
names(auc)<-levels(fct)
list(conf.mat=conf.mat,pred.mat=pred.mat,ROC=roc.df,AUC=auc)
}

And try to predict exactly on the same samples that we use for training the random forest:

res<-getROC(RF,features,labels)
res$conf.mat
##            pred.values
## true.values  OR  SK  ST  VG
##          OR 512   0   0   0
##          SK   0  38   0   0
##          ST   0   0 413   0
##          VG   0   0   0  10
res$AUC
## OR SK ST VG 
##  1  1  1  1

Plot:

ggplot(data=res$ROC,aes(x=fpr,y=tpr,col=fct_lev)) +
  geom_abline(linetype=2) +
  geom_line() +
  scale_color_discrete(label=paste(names(res$AUC)," (AUC: ",round(res$AUC,3),")",sep="")) +
  theme(legend.title=element_blank())

As expected the AUC is 1 and all the prediction are perfect. As you can see also from the confusion matrix, all samples are perfectly classified to the correct body site.

In order to do a real comparison, we need to remove some samples, do the training and then test if we correctly predict the removed samples.

For this exercise we will remove 10% of the samples from each body site, and check if we can correctly predict it:

# choose random 
sel_test = sample(ST_sample,floor(length(ST_sample) * 0.1),replace = F)
sel_test = c(sel_test,sample(OR_sample,floor(length(OR_sample) * 0.1),replace = F))
sel_test = c(sel_test,sample(SK_sample,floor(length(SK_sample) * 0.1),replace = F))
sel_test = c(sel_test,sample(VG_sample,floor(length(VG_sample) * 0.1),replace = F))

sel_train = colnames(motus_study_rarefied)[! colnames(motus_study_rarefied) %in% sel_test]

# select the labels and features
labels_test = as.factor(metadata[sel_test,]$body_site)
features_test = t(log10(motus_study_rarefied[,sel_test]+0.5))

labels_train = as.factor(metadata[sel_train,]$body_site)
features_train = t(log10(motus_study_rarefied[,sel_train]+0.5))

# train
RF<-randomForest(x=features_train,y=labels_train,importance=TRUE, proximity=TRUE)
# predict
res<-getROC(RF,features_test,labels_test)
res$conf.mat
##            pred.values
## true.values OR SK ST
##          OR 51  0  0
##          SK  0  3  0
##          ST  0  0 41
##          VG  1  0  0

As we can see, we can still perfectly predict all samples. We remove 51 samples from oral and we all correctly classify them to belong to oral cavity. Similarly for skin (n = 3 samples), stool (n = 41 samples) and vagina (n = 1 sample).

res$AUC
##        OR        SK        ST        VG 
## 0.9932462 1.0000000 1.0000000 0.9000000

And the AUC is 1 as expected, which means it is a perfect classifier.

From this results, we can say that it is trivial to distinguish between different body sites. This is in agreement with the beta-diversity results that we saw before (there is a high difference between body sites), and the results from the statistical analysis (all species are differentially abundant).

In order to do a proper evaluation of the model, we would have to do a cross validation, where we repeat the process of removing some samples (test set) multiple times.

Study the functional properties of the species associated to cancer

Comparing Stool (ST) to Oral cavity (OR) samples, we identified Bacteroides dorei/vulgatus [ref_mOTU_v3_02367] as one of the most differentially abundant species, as it is present almost only in stool samples. On the other hand, the species that is differentially the most abundant in oral cavity samples is Flavobacteriaceae species incertae sedis [meta_mOTU_v3_12274].

For this project, we collected the available reference genomes for “Bacteroides dorei/vulgatus”, but for “meta_mOTU_v3_12274” there are no available reference genomes. Additionally, we built metagenome assembled genomes (MAGs) directly from the metagenomic samples. Finally, we predicted and annotated the genes for each genome. You can download the annotations directly in R using the following command:

We load the data:

load(url("https://sunagawalab.ethz.ch/share/teaching_milanese/motus_02367_genomes.Rdata"))
load(url("https://sunagawalab.ethz.ch/share/teaching_milanese/motus_12274_genomes.Rdata"))

We can have a look at the genomes_information:

motus_02367_genomes[1:4,1:5]

and:

motus_12274_genomes[1:4,1:5]

Where the rownames are the genome IDs. The column genome_type is either “MAG” or “Reference_genome”, which describes the type of assembly for the genomes (from a metagenomic sample or an isolate species respectively). The second column describes the length of the genome, and from the third to the last there are the names of the genes that were annotated. For the genes we have 1 when the gene is present in the genome and 0 when the gene is absent.

We select the data for the COX gene from the Bacteroides dorei/vulgatus:

data_stool_bacteroides = motus_02367_genomes[,c("genome_type","genome_length","COX")]
head(data_stool_bacteroides)

How many genomes there are?

nrow(data_stool_bacteroides)
## [1] 114

How many genomes have the COX gene?

sum(data_stool_bacteroides$COX)
## [1] 1

1 out of 114 genomes have the COX gene. Since in the human gut there is no oxygen, it makes sense that the most abundant species is not using oxygen. We can investigate more the genome for which the COX gene is present.

We sort the genomes by length:

tail(data_stool_bacteroides[order(data_stool_bacteroides$genome_length),])

We can see that the only MAG that contain the COX gene is significantly longer than the other genomes (6.2Mb compared to 5.2Mb). Hence, it can also be that the reconstructed MAG GUT_GENOME001627 contains some contamination (i.e. DNA sequences from another species).

We can check the meta-mOTUs species identified in the oral samples:

data_oral_flavobacteriaceae = motus_12274_genomes[,c("genome_type","genome_length","COX")]
head(data_oral_flavobacteriaceae)

How many genomes there are?

nrow(data_oral_flavobacteriaceae)
## [1] 44

How many genomes have the COX gene?

sum(data_oral_flavobacteriaceae$COX)
## [1] 30

30 out of 44 genomes have the COX gene. At this point, it seems that COX is not present in all genomes. But we can investigate this more.

In general, MAGs have a lower quality compared to reference genomes. From the table we have also the information about the length of the genomes, we can try to plot length of the genomes and presence absence of the FadA gene:

ggplot(data_oral_flavobacteriaceae,aes(factor(COX),genome_length)) + 
  geom_boxplot() +
  geom_jitter() +
  xlab("Is COX present") +
  ylab("Genome length")

We can see that genomes that are missing the COX gene are shorter. And genomes where the COX gene is present are on average longer.

From this, we understand that it is highly possible that the genomes that are missing the COX gene, are just incomplete genomes. If we would have a complete genome, we would probably have the COX gene.

One solution, is to use long reads to reconstruct complete MAGs.