## GTDB taxonomy for mOTUs We annotated all genomes used for mOTUs with GTDB-Tk. We then merge the annotation of the genomes at the mOTUs level, which represent clusters of genomes (see below for details). Here is the taxonomy: | mOTUs version | File | Annotation tool | | :--------------------: |:-------------:| :---------------:| | mOTUs 3.0.0 - 3.0.3 | [mOTUs_3.0.0_GTDB_tax.tsv](https://sunagawalab.ethz.ch/share/MOTU_GTDB/mOTUs_3.0.0_GTDB_tax.tsv) | GTDB-Tk version 2.1 on database release [207](https://data.ace.uq.edu.au/public/gtdb/data/releases/release207/) (4.4 MB) | ## Annotation of mOTUs Each mOTU cluster is composed of 1 or more genomes and for each genome we have a GTDB annotation that looks like: ``` GUT_GENOME002602 d__Bacteria;p__Bacteroidota;c__Bacteroidia;o__Bacteroidales;f__Bacteroidaceae;g__Bacteroides;s__Bacteroides fragilis GUT_GENOME002402 d__Bacteria;p__Firmicutes;c__Bacilli;o__RF39;f__UBA660;g__;s__ ``` Note that for each taxonomic level there is either an annotation (example: `s__Bacteroides fragilis`) or a missing annotation (example: `s__`). For the evaluation that we are doing here we consider `s__` (or `g__`, etc.) as `NA`. Each taxonomic level in a mOTU can have three annotations: ### `Agreeing` If at least 80% of the genomes agree to one annotation, then that annotation is selected. Note that we consider only annotations that are not `NA`. So for example, if we have 20 genomes in a mOTU, in all these cases there is an "agreeing" annotation: ``` Species: s__Bacteroides fragilis # of genomes: 20 ``` Annotated as `s__Bacteroides fragilis` as 100% of the genomes (20/20) agree at species level. ``` Species: s__Bacteroides fragilis NA # of genomes: 11 9 ``` Annotated as `s__Bacteroides fragilis` as 100% of the genomes (11/11) agree at species level. ``` Species: s__Bacteroides fragilis s__Bacteroides vulgatus NA # of genomes: 11 1 8 ``` Annotated as `s__Bacteroides fragilis` as 91.6% of the genomes (11/12) agree at species level. ### `Not annotated` If all genomes at that taxonomic level do not have an annotation. Example: ``` Species: NA # of genomes: 20 ``` Note that in the mOTUs taxonomy we report it as `Not_annotated []`. For example if a mOTUs is composed of 3 genomes with annotation: ``` GUT_GENOME002402 d__Bacteria;p__Firmicutes;c__Bacilli;o__RF39; ;g__;s__ GUT_GENOME002403 d__Bacteria;p__Firmicutes;c__Bacilli;o__RF39;f__UBA660;g__;s__ GUT_GENOME002404 d__Bacteria;p__Firmicutes;c__Bacilli;o__RF39;f__UBA660;g__;s__ ``` The mOTU annotation will be: ``` ref_mOTU_v3_00002 d__Bacteria;p__Firmicutes;c__Bacilli;o__RF39;f__UBA660;Not_annotated [f__UBA660];Not_annotated [f__UBA660] ``` ### `Incongruent` If the genomes do not agree (<80% agreement) at one specific taxonomic level. Example: ``` Species: s__Bacteroides fragilis s__Bacteroides vulgatus NA # of genomes: 11 7 2 ``` Here the one with the highest agreement is `s__Bacteroides fragilis`, but only 11 out of 18 (11+7, note that we do not count the `NA`), which is 61% (below 80%), agree. Hence this level will be annotated as `Incongruent []`. Here is an example with a mOTUs with 5 genomes: ``` GUT_GENOME002602 d__Bacteria;p__Bacteroidota;c__Bacteroidia;o__Bacteroidales;f__Bacteroidaceae;g__Bacteroides;s__Bacteroides fragilis GUT_GENOME002603 d__Bacteria;p__Bacteroidota;c__Bacteroidia;o__Bacteroidales;f__Bacteroidaceae;g__Bacteroides;s__Bacteroides fragilis GUT_GENOME002604 d__Bacteria;p__Bacteroidota;c__Bacteroidia;o__Bacteroidales;f__Bacteroidaceae;g__Bacteroides;s__Bacteroides vulgatus GUT_GENOME002605 d__Bacteria;p__Bacteroidota;c__Bacteroidia;o__Bacteroidales;f__Bacteroidaceae;g__Bacteroides;s__Bacteroides vulgatus GUT_GENOME002606 d__Bacteria;p__Bacteroidota;c__Bacteroidia;o__Bacteroidales;f__Bacteroidaceae;g__Bacteroides;s__Bacteroides vulgatus ``` Where at species level we have 3 `s__Bacteroides vulgatus` and 2 `s__Bacteroides fragilis`. The mOTUs annotation is: ``` d__Bacteria;p__Bacteroidota;c__Bacteroidia;o__Bacteroidales;f__Bacteroidaceae;g__Bacteroides;Incongruent [g__Bacteroides] ``` Note: when a level is incongruent, all levels underneath are set to incongruent. If we don't do this we can have a situation where from phylum to level it is incongruent and at species level it is incongruent (if some annotations are NA at species level). For example say we have a mOTUs with two genomes: ``` {'d__Bacteria': 2} {'p__Bacteroidota': 1, 'p__Riflebacteria': 1} {'c__Bacteroidia': 1, 'c__Ozemobacteria': 1} {'o__Bacteroidales': 1, 'o__Ozemobacterales': 1} {'f__Bacteroidaceae': 1, 'f__Ozemobacteraceae': 1} {'g__Prevotella': 1, 'g__RUG334': 1} {'NA': 1, 's__RUG334': 1} ``` One of the two genomes is not annotated at species level, hence `s__RUG334` would have a 100% agreement and it would not be "Incongruent" like the genus level. But we prevent this, hence the mOTUs annotation is: ``` ext_mOTU_v3_22969 d__Bacteria Incongruent [d__Bacteria] Incongruent [d__Bacteria] Incongruent [d__Bacteria] Incongruent [d__Bacteria] Incongruent [d__Bacteria] Incongruent [d__Bacteria] ```