Taxonomic profiling using mOTUs

Tutorials

Generate mOTU.v1.padded database

This tutorial describes how to generate the mOTU.v1.padded database. The first steps are to download the required databases and software, and install the required software. The required databases and software have been compressed into a tar.gz archive file. This package contains the following:

Software:

FetchMG

cdbyank, cdbfasta

HMMer3

Databases:

3496 Reference genomes

263 Metagenomes

All of the commands listed below (from 3 and onwards) are executed from within the folder, in which the mOTU.tar.gz file was extracted.

Download & Installation

1. Download the Tutorial Package

2. Extract the archive:

> tar -zxvf mOTU.tar.gz

3. Run the installation script to install and setup the software:

> ./setup.sh

Extract MGs from Genomes

4. To extract marker genes form the provided 3496 genomes:

> software/fetchMG/FetchMGs.v1.pl -m extraction -o fetched_MGs_from_ReferenceGenomes -t 1 -x software/fetchMG/bin databases/ReferenceGenomes.3496.faa

Here the option '-t' can be changed from 1 to the desired number of CPU cores used. We recommend using more CPU cores, if available.

Extract MGs from Metagenomes

5. To extract marker genes form the provided 263 metagenomes:

> software/fetchMG/FetchMGs.v1.pl -m extraction -o fetched_MGs_from_MetaGenomes -t 1 -x software/fetchMG/bin databases/MetaGenomes.complete.263.faa

Here the option '-t' can be changed from 1 to the desired number of CPU cores used. We recommend using more CPU cores, if available.

Cluster Sequences

6. Download & install USEARCH. We used version 4.1.93 (32-bit). Because USEARCH requires a licence key, this software could not be shipped with the tutorial or fetchMG packages.

7. Cluster the genes at 100% identity (see example command for USEARCH below) to remove redundant genes.

8. The sequences for each marker gene are clustered using USEARCH, with the specific cutoffs specific in the Supplementary Tables. An example command:

> usearch --cluster <input> --id <MG specific cutoff> --uc <UCLUST output file> --usersort --seedsout <seed output file>
> usearch --uc2clstr <UCLUST output file> --output <output file>

Add ±100 bp

9. The resulting seed sequences are then (when possible) extended by 100 base pairs from each end by extracting these nucleotides from the original assembled sequences of the 263 metagenomes, and the 3496 genomes. We call these sequences padded mOTU sequences.

10. In the final step, the padded mOTU sequences for each MG, from 10 out of the 40 marker genes, are merged into the 'mOTU.nr.padded' file.