Data preparation

You can find the paper in pubmed using the keywords “Wirbel”, “nature medicine” and “2019”: https://pubmed.ncbi.nlm.nih.gov/?term=wirbel+nature+medicine+2019

and the first entry is: https://pubmed.ncbi.nlm.nih.gov/30936547/

with pubmed ID 30936547. The NIH provide the paper for free at https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7984229/

Under the “Data Availability Statement” we can find the information about the sequencing data:

“The raw sequencing data for the samples in the DE study that had not been published before (see Methods), are made available in the European Nucleotide Archive (ENA) under the study identifier PRJEB27928. Metadata for these samples are available as Supplementary Table S6.”

From the study identifier, we can go into ENA and search for the project ID: https://www.ebi.ac.uk/ena/browser/text-search?query=PRJEB27928

From there we could select one Experiment: https://www.ebi.ac.uk/ena/browser/view/ERX2739789?show=reads

And at the bottom of the page there are the links to ftp to download the fastq files.

Data processing

Normally, you would download the samples that you identified from the previous step. From the sample selected before, it would be:

wget http://ftp.sra.ebi.ac.uk/vol1/fastq/ERR272/007/ERR2726417/ERR2726417_1.fastq.gz
wget http://ftp.sra.ebi.ac.uk/vol1/fastq/ERR272/007/ERR2726417/ERR2726417_2.fastq.gz

Here we will work with three mock samples, that were prepared for this project:

wget https://zenodo.org/record/6517497/files/sampleA_1.fastq
wget https://zenodo.org/record/6517497/files/sampleA_2.fastq

wget https://zenodo.org/record/6517497/files/sampleB_1.fastq
wget https://zenodo.org/record/6517497/files/sampleB_2.fastq

wget https://zenodo.org/record/6517497/files/sampleC_1.fastq
wget https://zenodo.org/record/6517497/files/sampleC_2.fastq

Note that we refer to “mock” samples, because they are not real samples. Real samples have few millions reads, and here we work with samples that have less than 100,000 reads to make the computation more manageable.

Note that the sequencing data often comes in pairs (_1 and _2), representing forward and reverse reads.

One important step in data processing is to check the quality of the dataset and remove low quality samples, here we provide you already with high-quality samples.

Taxonomic profiling

We can run mOTUs on the three samples that we downloaded before with the commands:

motus profile -f sampleA_1.fastq -r sampleA_2.fastq -n sampleA -o sampleA.motus
motus profile -f sampleB_1.fastq -r sampleB_2.fastq -n sampleB -o sampleB.motus
motus profile -f sampleC_1.fastq -r sampleC_2.fastq -n sampleC -o sampleC.motus

Note that here we are using mock samples (only few Mb files), but normally the fastq samples generated for a microbiome study are few Gb. In order to do this analysis, you would normally use a high performance computing cluster.

The result of running mOTUs is (here printing only the first 10 lines):

 $ head sampleA.motus
# git tag version 3.0.3 |  motus version 3.0.3 | map_tax 3. [...]
# call: python /nfs/nas22/fs2202/biol_micro_teaching/softwa [...]
#consensus_taxonomy     sampleA
Leptospira alexanderi [ref_mOTU_v3_00001]       0.0000000000
Leptospira weilii [ref_mOTU_v3_00002]   0.0000000000
Chryseobacterium sp. [ref_mOTU_v3_00004]        0.0000000000
Chryseobacterium gallinarum [ref_mOTU_v3_00005] 0.0000000000
Chryseobacterium indologenes [ref_mOTU_v3_00006]        0.0000000000
Chryseobacterium artocarpi/ureilyticum [ref_mOTU_v3_00007]      0.0000000000
Chryseobacterium jejuense [ref_mOTU_v3_00008]   0.0000000000

The first two lines contain information about the version of the tool that was used and the options that were specified.

The third line contains the header.

From the fourth line there are the species and their relative abundance.

Data gathering

In this step we will pull together all the samples we generated before. For this we can use motus:

motus merge -i sampleA.motus,sampleB.motus,sampleC.motus -o all_samples.motus

And we can look at the generated table (head all_samples.motus):

# motus version 3.0.3 | merge 3.0.3 | info merged profiles:  [...]
# call: python /nfs/nas22/fs2202/biol_micro_teaching/softwar [...]
#consensus_taxonomy     sampleA sampleB sampleC
Leptospira alexanderi [ref_mOTU_v3_00001]       0.0000000000    0.0000000000    0.0000000000
Leptospira weilii [ref_mOTU_v3_00002]   0.0000000000    0.0000000000    0.0000000000
Chryseobacterium sp. [ref_mOTU_v3_00004]        0.0000000000    0.0000000000    0.0000000000
Chryseobacterium gallinarum [ref_mOTU_v3_00005] 0.0000000000    0.0000000000    0.0000000000
Chryseobacterium indologenes [ref_mOTU_v3_00006]        0.0000000000    0.0000000000    0.0000000000
Chryseobacterium artocarpi/ureilyticum [ref_mOTU_v3_00007]      0.0000000000    0.0000000000    0.0000000000
Chryseobacterium jejuense [ref_mOTU_v3_00008]   0.0000000000    0.0000000000    0.0000000000

We have a structure similar to the profile we looked at before. Two headers with the information about the parameters that were used for the mOTUs tool and then the header (third row).

There are four columns, the first one contains the names of the species and the other three contain the measure relative abundance.

We can load it into R with the command:

motus_profiles = read.csv("/nfs/course/course_home/alessiom/all_samples.motus",header = T,sep = "\t",skip = 2, row.names = 1)

Note that:

We can check what we imported with the command motus_profiles[1:3,], which results in:

                                          sampleA sampleB sampleC
Leptospira alexanderi [ref_mOTU_v3_00001]       0       0       0
Leptospira weilii [ref_mOTU_v3_00002]           0       0       0
Chryseobacterium sp. [ref_mOTU_v3_00004]        0       0       0

Now that you learned how to create a taxonomic profiling table, and how to load it into R, we will provide you with a real table with profiles from many samples and the relative metadata for the samples.

Note that normally, it would take few hours to process one sample.

For the rest of the project we will not use anymore sampleA, sampleB and sampleC.

You can load the taxonomic table and the metadata of the new profiles directly in R with the command:

load(url("https://zenodo.org/record/6524317/files/motus_profiles_study1.Rdata"))

Let’s check the metadata:

head(meta_study1)

The column Group contains the information about the samples. If they are colorectal cancer (CRC) or healthy people (controls as CTR).

How many we have for each category:

table(meta_study1$Group)
## 
## CRC CTR 
##  46  60

We can also have a look at the taxonomic profiling table:

motus_study1[1:4,1:4]
##                                                 sample_1 sample_2 sample_3
## Leptospira alexanderi [ref_mOTU_v3_00001]              0        0        0
## Leptospira weilii [ref_mOTU_v3_00002]                  0        0        0
## Chryseobacterium sp. [ref_mOTU_v3_00004]               0        0        0
## Chryseobacterium gallinarum [ref_mOTU_v3_00005]        0        0        0
##                                                 sample_4
## Leptospira alexanderi [ref_mOTU_v3_00001]              0
## Leptospira weilii [ref_mOTU_v3_00002]                  0
## Chryseobacterium sp. [ref_mOTU_v3_00004]               0
## Chryseobacterium gallinarum [ref_mOTU_v3_00005]        0

We can check also the size of the table:

dim(motus_study1)
## [1] 33571   106

There are 33,571 species (mOTUs) and 106 samples.

First we can check if these are relative abundances or counts. We can look at the sum of the columns:

head(colSums(motus_study1))
## sample_1 sample_2 sample_3 sample_4 sample_5 sample_6 
##    23740     5081    31832    22608    13650    23555

We have counts. And these counts represents the sequencing depth of the different samples.

Let’s plot it as an histogram:

hist(colSums(motus_study1),50,main = "Histogram of sequencing depth")

mOTUs reports the measures for all species, hence many species will be zero across all samples.

Let’s see how many species have at least one read in any of the samples:

sum_per_species = rowSums(motus_study1)

length(which(sum_per_species > 0))
## [1] 3340

There are 3,340 species that have been profiled.

Let’s check which species are the most prevalent:

prevalence = rowSums(motus_study1 > 0) / ncol(motus_study1)
# we sort it
prevalence = sort(prevalence,decreasing = T)
# and check the first 10
head(prevalence)
##      Bacteroides dorei/vulgatus [ref_mOTU_v3_02367] 
##                                           1.0000000 
##                                          unassigned 
##                                           1.0000000 
##          Blautia obeum/wexlerae [ref_mOTU_v3_02154] 
##                                           0.9811321 
##      Parabacteroides distasonis [ref_mOTU_v3_03640] 
##                                           0.9811321 
## Ruthenibacterium lactatiformans [ref_mOTU_v3_04716] 
##                                           0.9811321 
##             Anaerostipes hadrus [ref_mOTU_v3_00857] 
##                                           0.9716981

The mOTU Bacteroides dorei/vulgatus is present in all samples.

The term unassigned is also present in all samples, and it represent the abundance of reads that did not match to any known mOTUs.

The mOTU Blautia obeum/wexlerae is present in 98% of the samples.

Filter samples

As we saw before, there are few samples with low sequencing depth. We filter out samples with less than 5,000 reads:

samples_more_5000 = which(colSums(motus_study1) > 5000)
motus_study2 = motus_study1[,samples_more_5000]

We can check the size of the new table:

dim(motus_study2)
## [1] 33571   102

We went from 106 samples to 102.

We will also rarefy the table to be able to compare between different samples:

library("vegan")
## Loading required package: permute
## Loading required package: lattice
## This is vegan 2.6-4
set.seed(0)
motus_study_rarefied = t(rrarefy(t(motus_study2),5000))

We can also remove all the species that were not measured:

sel_species = which(rowSums(motus_study_rarefied) > 0)
motus_study_rarefied = motus_study_rarefied[sel_species,]

And the new taxonomic table has:

dim(motus_study_rarefied)
## [1] 2888  102

has 2,888 mOTUs and 102 samples.

In order to make it easier, we will also remove these samples from the metadata:

metadata = as.data.frame(meta_study1[colnames(motus_study_rarefied),])

Data analysis: alpha-diversity

The alpha-diversity measures are calculated per sample. We can calculate these indices using the vegan package.

shannon = diversity(t(motus_study_rarefied), index="shannon")
richness = specnumber(t(motus_study_rarefied))
pielou_evenness = shannon/log(richness)

We can add these measure to the metadata, to make it easier to compare:

metadata2 = metadata
metadata2$shannon = shannon
metadata2$richness = richness
metadata2$pielou_evenness = pielou_evenness
head(metadata2)

Let’s see if there is a difference in alpha-indices between healthy and controls

library(ggplot2)
ggplot(metadata2, aes(Group,shannon)) + geom_boxplot()

ggplot(metadata2, aes(Group,richness)) + geom_boxplot()

ggplot(metadata2, aes(Group,pielou_evenness)) + geom_boxplot()

Overall, we don’t see a big difference. We can also test it:

wilcox.test(metadata2[metadata2$Group == "CTR",]$shannon,
            metadata2[metadata2$Group == "CRC",]$shannon)$p.value
## [1] 0.6467465
wilcox.test(metadata2[metadata2$Group == "CTR",]$richness,
            metadata2[metadata2$Group == "CRC",]$richness)$p.value
## [1] 0.1787333
wilcox.test(metadata2[metadata2$Group == "CTR",]$pielou_evenness,
            metadata2[metadata2$Group == "CRC",]$pielou_evenness)$p.value
## [1] 0.8345069

None is significant.

We can check few other metadata parameters. We will use only Shannon here:

ggplot(metadata2, aes(Gender,richness)) + geom_boxplot()

It seems that Males have a slightly higher Shannon index on average, but it doesn’t seems there are significant differences.

We can check the correlation to continuous values:

cor.test(metadata2$Age,metadata2$shannon)
## 
##  Pearson's product-moment correlation
## 
## data:  metadata2$Age and metadata2$shannon
## t = -0.11406, df = 100, p-value = 0.9094
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.2054241  0.1834770
## sample estimates:
##        cor 
## -0.0114048

There is no correlation to age.

cor.test(metadata2$shannon,metadata2$BMI,use="complete.obs")
## 
##  Pearson's product-moment correlation
## 
## data:  metadata2$shannon and metadata2$BMI
## t = -1.1521, df = 98, p-value = 0.2521
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.30508640  0.08270056
## sample estimates:
##        cor 
## -0.1155953

There is no correlation to BMI.

So overall there are no big differences between the different subjects, in terms of alpha-diversity. Which means that the human gut microibiome from colorectal cancer patients has the same diversity (and richness and evenness) as healthy individuals.

Data analysis: beta-diversity and ordination

We can evaluate the properties of the beta diversity:

all_vs_all_beta = as.matrix(vegdist(t(motus_study_rarefied),method = "jaccard"))

# prepare data to save the results
within_CRC = c()
within_CTR = c()
between_CRC_and_CTR = c()

# we can go through all the elements in the upper diagonal of the matrix using
for(i in c(1:(ncol(all_vs_all_beta)-1) )){
    for(j in c((i+1) : ncol(all_vs_all_beta) )){
        # sample 1
        s1 = rownames(all_vs_all_beta)[i]
        # sample 2
        s2 = rownames(all_vs_all_beta)[j]
        # this distance
        this_distance = all_vs_all_beta[i,j]
        # save the data or do comparisons...
        if(metadata[s1,]$Group == "CRC" & metadata[s2,]$Group == "CRC"){
          within_CRC = c(within_CRC,this_distance)
        }
        if(metadata[s1,]$Group == "CTR" & metadata[s2,]$Group == "CTR"){
          within_CTR = c(within_CTR,this_distance)
        }
        if(metadata[s1,]$Group == "CTR" & metadata[s2,]$Group == "CRC"){
          between_CRC_and_CTR = c(between_CRC_and_CTR,this_distance)
        }
    }
}

We can check the median:

median(within_CRC)
## [1] 0.8944168
median(within_CTR)
## [1] 0.8726043
median(between_CRC_and_CTR)
## [1] 0.8865382

Note that the vegan package calculates dissimilarities by default, hence here we measured the Jaccard dissimilarity (and not the Jaccard index as explained in the lectures).

We can plot it:

df = data.frame(
  measures = c(within_CRC,within_CTR,between_CRC_and_CTR),
  type = c(rep("within_CRC",length(within_CRC)),rep("within_CTR",length(within_CTR)),rep("between_CRC_and_CTR",length(between_CRC_and_CTR)))
)
ggplot(df,aes(type,measures))+
  geom_boxplot() +
  ylab("Beta-diversity")+
  ggtitle("Jaccard dissimilarity")

Overall, there are no big differences. This has a similar message to what we saw for the alpha-diversity. The samples are homogeneous and there is not a big difference between colorectal cancer samples and controls.

Plot the samples in a PCoA plot

Let’s visualize the samples in an ordination plot using the Jaccard index (a PCoA plot).

First we calculate the PCoA:

all_vs_all_beta = as.matrix(vegdist(t(motus_study_rarefied),method = "jaccard"))
data_PCoA = cmdscale(all_vs_all_beta)

And we add the information to the metadata, to make it easier to visualise

metadata2 = metadata
metadata2$PCoA1 = data_PCoA[,1]
metadata2$PCoA2 = data_PCoA[,2]

We can now plot the PCoA using the different metadata information:

ggplot(metadata2,aes(PCoA1,PCoA2,col = Group))+
  geom_point()+
  xlab(paste0("PCoA1")) +
  ylab(paste0("PCoA2"))+
  theme_bw()

Overall, it seems that there is not a big separation between controls and colorectal cancer samples. Which is in agreement with the beta-diversity results.

We can also check other metadata measure:

ggplot(metadata2,aes(PCoA1,PCoA2,col = Gender))+
  geom_point()+
  xlab(paste0("PCoA1")) +
  ylab(paste0("PCoA2"))+
  theme_bw()

There is not a good separation also for gender.

Similarly to the alpha diversity, from the analysis of the beta-diversity we can see that there are no extreme differences between the microbiome of CRC patients and controls.

Statistical analysis

Here we want to identify which species are enriched in carcer or healthy individuals.

We first find which species have a prevalence above 10%:

# calculate prevalence, as a percentage
prevalence = (rowSums(motus_study_rarefied > 0) / ncol(motus_study_rarefied)) * 100
# select only the species with at least a prevalence of 10%
sel_species = names(which(prevalence > 10))
length(sel_species)
## [1] 906

There are 906 species with a prevalence of at least 10%, which we will use for testing.

Here we will test which species are associated to colorectal cancer. We will do a wilcoxon test for each species:

t_test_results = rep(NA,length(sel_species))
names(t_test_results) = sel_species
for(species in sel_species){
  crc = motus_study_rarefied[species,which(metadata$Group == "CRC")]
  ctr = motus_study_rarefied[species,which(metadata$Group == "CTR")]
  t_test_results[species] = wilcox.test(crc,ctr)$p.value
}

Let’s check which one have the lowest p-value:

head(sort(t_test_results))
##                  Dialister pneumosintes [ref_mOTU_v3_03630] 
##                                                6.806358e-05 
## Fusobacterium nucleatum subsp. animalis [ref_mOTU_v3_01001] 
##                                                8.816910e-05 
##   Clostridiales species incertae sedis [meta_mOTU_v3_12371] 
##                                                1.312165e-04 
##                    Hungatella hathewayi [ref_mOTU_v3_03435] 
##                                                2.492116e-04 
##                    [Eubacterium] hallii [ref_mOTU_v3_03632] 
##                                                3.116905e-04 
##    Clostridiales species incertae sedis [ext_mOTU_v3_16398] 
##                                                3.309982e-04

Since we did a test for each species, we need to correct for multiple testing correction:

adjusted_pvals = sort(p.adjust(t_test_results,method = "fdr"))
head(adjusted_pvals)
## Fusobacterium nucleatum subsp. animalis [ref_mOTU_v3_01001] 
##                                                  0.03962740 
##                  Dialister pneumosintes [ref_mOTU_v3_03630] 
##                                                  0.03962740 
##   Clostridiales species incertae sedis [meta_mOTU_v3_12371] 
##                                                  0.03962740 
##                    Hungatella hathewayi [ref_mOTU_v3_03435] 
##                                                  0.04998072 
##                    [Eubacterium] hallii [ref_mOTU_v3_03632] 
##                                                  0.04998072 
##    Clostridiales species incertae sedis [ext_mOTU_v3_16398] 
##                                                  0.04998072

How many are significantly associated?

length(which(adjusted_pvals<0.05))
## [1] 6

There are 6 species that are significantly associated after multiple testing correction.

And one of the top associated is Fusobacterium nucleatum subsp. animalis [ref_mOTU_v3_01001], which we will use in the last part of this project.

The fact that there are only 6 species that are significantly associated, out of 906, is in agreement with what we saw for the alpha- and beta-diversity. Since there are no big differences between the two groups, we also expect that the majority of the species have a similar relative abundance.

Statistical modeling

Here we want to build a model that can classify helthy and disease patients based on the microbial species composition.

We load the libraries:

library(randomForest)
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
## 
##     margin
library(ROCR)

And we can build a random forest model:

labels = as.factor(metadata$Group)
features = t(log10(motus_study_rarefied+0.5))
RF<-randomForest(x=features,y=labels,importance=TRUE, proximity=TRUE)
RF
## 
## Call:
##  randomForest(x = features, y = labels, importance = TRUE, proximity = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 53
## 
##         OOB estimate of  error rate: 22.55%
## Confusion matrix:
##     CRC CTR class.error
## CRC  26  19  0.42222222
## CTR   4  53  0.07017544

We can check the importance of the features:

importance<-as.data.frame(RF$importance)
head(importance[order(importance$MeanDecreaseAccuracy,decreasing = T),],n=10)

As we can see there a high overlap with the results from the statistical analysis.

We can load the useful function we saw in the exercises:

getROC<-function(RF,newdat,fct){
  prediction_for_roc_curve <- predict(RF,newdat,type="prob")
true.values<-as.character(fct)
pred.values<-as.character(apply(prediction_for_roc_curve,1,function(x){names(x)[which.max(x)]}))
tmp<-prediction_for_roc_curve
colnames(tmp)<-paste("prob.",colnames(tmp),sep="")
pred.mat<-data.frame(true=true.values,predicted=pred.values,tmp)
conf.mat<-table(true.values,pred.values)

fpr<-NULL
tpr<-NULL
auc<-NULL
fct_lev<-NULL
for (i in levels(fct)){
  true_values <- ifelse(fct==i,1,0)
  pred <- prediction(prediction_for_roc_curve[,i],true_values)
  perf <- performance(pred, "tpr", "fpr")
  fpr<-c(fpr,unlist(perf@x.values))
  tpr<-c(tpr,unlist(perf@y.values))
  auc<-c(auc,unlist(performance(pred, measure = "auc")@y.values))
  fct_lev<-c(fct_lev,rep(i,length(unlist(perf@x.values))))
}
roc.df<-data.frame(fpr,tpr,fct_lev)
names(auc)<-levels(fct)
list(conf.mat=conf.mat,pred.mat=pred.mat,ROC=roc.df,AUC=auc)
}

And try to predict on the training data:

res<-getROC(RF,features,labels)
res$conf.mat
##            pred.values
## true.values CRC CTR
##         CRC  45   0
##         CTR   0  57

As we can see from the confusion matrix, we perfectly predict all sample. There are 45 CRC samples and we classify all of them to be CRC. Similarly, for the controls (CTR) we correctly classify all of them.

res$AUC
## CRC CTR 
##   1   1

Plot:

ggplot(data=res$ROC,aes(x=fpr,y=tpr,col=fct_lev)) +
  geom_abline(linetype=2) +
  geom_line() +
  scale_color_discrete(label=paste(names(res$AUC)," (AUC: ",round(res$AUC,3),")",sep="")) +
  theme(legend.title=element_blank())

As expected the AUC is 1 and all the prediction are perfect.

In order to do a real comparison, we need to remove some samples, do the training and then test if we correctly predict the removed samples.

For this exercise we will remove 10 samples, and check if we can correctly predict it:

# choose random 
sel_test = sample(colnames(motus_study_rarefied),10,replace = F)
sel_train = colnames(motus_study_rarefied)[! colnames(motus_study_rarefied) %in% sel_test]

# select the labels and features
labels_test = as.factor(metadata[sel_test,]$Group)
features_test = t(log10(motus_study_rarefied[,sel_test]+0.5))

labels_train = as.factor(metadata[sel_train,]$Group)
features_train = t(log10(motus_study_rarefied[,sel_train]+0.5))

# train
RF<-randomForest(x=features_train,y=labels_train,importance=TRUE, proximity=TRUE)
# predict
res<-getROC(RF,features_test,labels_test)
res$conf.mat
##            pred.values
## true.values CRC CTR
##         CRC   2   1
##         CTR   1   6
res$AUC
##       CRC       CTR 
## 0.8571429 0.8571429

We can plot the AUCs:

ggplot(data=res$ROC,aes(x=fpr,y=tpr,col=fct_lev)) +
  geom_abline(linetype=2) +
  geom_line() +
  scale_color_discrete(label=paste(names(res$AUC)," (AUC: ",round(res$AUC,3),")",sep="")) +
  theme(legend.title=element_blank())

From this results, we can say that there is a microbiome signature in colorectal cancer. In fact, the AUC is well above 0.5, which would means that the classifier is not a random classifier.

We saw from the alpha- and beta-diversity analysis that there is not a big difference between the samples. And, from the statistical testing, we saw that there are only few species that are differentially abundant. As a consequence, we also expect that the classification will not be easy. Nevertheless, there is a microbiome signal, which allowed us to correctly predict 8 out of the 10 test samples.

In order to do a proper evaluation of the model, we would have to do a cross validation, where we repeat the process of removing some samples (test set) multiple times.

Study the functional properties of the species associated to cancer

Previously, we identified Fusobacterium nucleatum subsp. animalis [ref_mOTU_v3_01001] as one of the species that showed the highest association to colorectal cancer.

For this project, we collected the available reference genomes for “Fusobacterium nucleatum subsp. animalis”. Additionally, we built metagenome-assembled genomes (MAGs) directly from the metagenomic samples. Finally, we predicted and annotated the genes for each genome.

We load the data:

load(url("https://sunagawalab.ethz.ch/share/teaching_milanese/CRC_genomes_data_functions.Rdata"))

We can have a look at the genomes_information:

genomes_information[1:4,1:5]

Where the rownames are the genome IDs. The column genome_type is either “MAG” or “Reference_genome”, which describes the type of assembly for the genomes (from a metagenomic sample or an isolate species respectively). The second column describes the length of the genome, and from the third to the last there are the names of the genes that were annotated. For the genes we have 1 when the gene is present in the genome and 0 when the gene is absent.

We select the data for the FadA gene:

data_fadA = genomes_information[,c("genome_type","genome_length","FadA")]
head(data_fadA)

How many genomes there are?

nrow(data_fadA)
## [1] 70

How many genomes have FadA?

sum(data_fadA$FadA)
## [1] 45

45 out of 70 genomes have the FadA gene. At this point, it seems that FadA is not present in all genomes. But we can investigate this more.

Let’s see how many Reference genomes and MAGs there are:

table(data_fadA$genome_type)
## 
##              MAG Reference_genome 
##               51               19

We can check if there is a difference between Reference genomes and MAGs:

# how many Reference genomes have the FadA?
sum(data_fadA[data_fadA$genome_type == "Reference_genome",]$FadA)
## [1] 17

17 out of 19 Reference genomes

# how many MAGs have the FadA?
sum(data_fadA[data_fadA$genome_type == "MAG",]$FadA)
## [1] 28

28 out of 51 MAGs have FadA.

The majority of the Reference genomes have the FadA gene, while only half of the MAGs have the FadA gene.

In general, MAGs have a lower quality compared to reference genomes. Maybe there is something else going on here. From the table we have also the information about the length of the genomes, we can try to plot length of the genomes and presence absence of the FadA gene:

ggplot(data_fadA,aes(factor(FadA),genome_length)) + 
  geom_boxplot() +
  geom_jitter() +
  xlab("Is FadA present") +
  ylab("Genome length")

We can see that genomes that are missing the FadA gene are shorter. And genomes where the FadA gene is present are on average longer.

From this, we understand that it is highly possible that the genomes that are missing the FadA gene, are just incomplete genomes. If we would have a complete genome, we would probably have the FadA gene.

One solution, is to use long reads to reconstruct complete MAGs.