mOTUs-db - Downloads

All data generated to build the mOTUs-db is or will gradually become made publicly available on standardized data repositories such as the European Nucleotide Archive. The following page describes the content of the downloadble files and how to best access the resource beyond using the website.

Metadata

Information on source data used to build the mOTUs-db can be found at Zenodo and includes:

Map between genomes and their source sample

A map between each of the genomes in mOTUs-db (3’747’151), the associated study and, in the case of MAGs, the associated metagenomic sample.

Columns:

GENOME             → Unique mOTUs-db genome identifier
STUDY              → Unique mOTUs-db study identifier
IS_MAG             → True if genome is a MAG, otherwise False
METAGENOMIC_SAMPLE → Unique mOTUs-db metagenomic sample identifier or NA in case of non-MAG genomes

Example:

GENOME                                     STUDY        IS_MAG    METAGENOMIC_SAMPLE
---------------------------------------------------------------------------------------------
ACIN21-1_SAMN05421555_MAG_00000001         ACIN21-1     True      ACIN21-1_SAMN05421555_METAG
RSGB23-1_GCA-006096615-V1_GENO_10000001    RSGB23-1     False     NA

Map between all non-MAGs and their source link

A map between all non-MAG genomes (919’090) and their source link (e.g. Refseq or JGI).

Columns:

GENOME             → Unique mOTUs-db genome identifier
SOURCE_SAMPLE_LINK → Link to the original location of this genome

Example:

#GENOME                                     SOURCE_SAMPLE_LINK
--------------------------------------------------------------------------------------------------------
JGIG23-1_GA0055041_GENO_10000001            https://gold.jgi.doe.gov/analysis_project?id=Ga0055041
RSGB23-1_GCA-006717865-V1_GENO_10000001     https://www.ncbi.nlm.nih.gov/datasets/genome/GCA_006717865.1

Overview of metagenomic studies used

A list of all metagenomic studies processed for the mOTUs-db, their number of samples, the number of reconstructed MAGs, and the associated publication.

Columns:

STUDY       → Unique mOTUs-db study identifier
BIOPROJECT  → Public identifier of metagenomic sequencing project
SAMPLES     → Number of metagenomic samples
MAGs        → Number of reconstructed MAGs
PUBLICATION → Link to publication

Example:

STUDY        BIOPROJECT    SAMPLES    MAGs     PUBLICATION
-------------------------------------------------------------------------------------------------
ACIN21-1     PRJEB44456    58         1,110    https://www.nature.com/articles/s42003-021-02112-2

Environment information of metagenomic samples

Mapping between the mOTUs-db sample identifier, the associated biosample, and the environment.

Columns:

SAMPLE             → Unique mOTUS-db sample identifier
BIOSAMPLE          → Public identifier of metagenomic sample
STUDY              → Unique mOTUs-db study identifier
ENVIRONMENT        → Environment of metagenomic sample
SOURCE_SAMPLE_LINK → Link to the original location of this sample

Example:

#SAMPLE                      BIOSAMPLE     STUDY    ENVIRONMENT                             SOURCE_SAMPLE_LINK
------------------------------------------------------------------------------------------------------------------------------------------------
ACIN21-1_SAMN05421555_METAG  SAMN05421555  ACIN21-1 marine metagenome, seawater metagenome  https://www.ncbi.nlm.nih.gov/biosample/SAMN05421555/

NCBI taxonomy information of environments

A list of environments covered in the mOTUs-db mapped to the respective NCBI taxonomic identifier (if available).

Columns:

TERM             → Unique environment name
NCBI TAXONOMY ID → Link to the NCBI taxonomic identifier

Example:

TERM                          NCBI TAXONOMY ID
----------------------------------------------
activated sludge metagenome   NCBI:txid942017
air metagenome                NCBI:txid655179

Genomes

Bulk download of genome sequences via the browser is limited to a maximum of 200 genomes. Larger batches of genome sequences can be downloaded in one of the following ways:

1) All genomes within mOTUs-db can be downloaded as a tar file (warning, this file is 2.7TB).

2) By using the mOTUs tool for more fine-grained access. Tool installation is described on the official page.

How to download genomes that correspond to a specific mOTU or specific taxonomic clade.

1) Navigate to folder in which the motus.py executable is located (or provide full path to executable instead of motus.py).

2) Run the following command:

python motus.py download -w [insert keyword] -s [insert file name] -o [insert folder name]

Where -w is followed by a keyword that is used to select genomes, -s is followed by the output file for downloaded genome metadata, and -o is followed by the folder in which the downloaded FASTA files will be stored. The keyword for selecting genomes can be:

For example, running the command: python motus.py download -s Angelakisella.genomes -w Angelakisella -o Angelakisella_genomes_folder/

should yield the following progress report:

mOTU tool starting
Loading database ...
Initialising the mOTUs search database.
Finished initialising the mOTUs search database. Found 124288 mOTUs, 3747151 genomes and 82452 taxonomy search words.
Searching for keyword: Angelakisella.
Found: 5485 hits.
Found 5485 genomes. Writing genome information to Angelakisella.genomes
Finished writing genome information to Angelakisella.genomes
Downloading genomes to Angelakisella_genomes_folder
Downloading genome (1 / 5485) ANDE20-1_SAMEA4688840_MAG_00000115 to Angelakisella_genomes_folder/ANDE20-1_SAMEA4688840_MAG_00000115.fa.gz
...
Downloading genome (5484 / 5485) ZHUJ18-1_SAMN08993540_MAG_00000012 to Angelakisella_genomes_folder/ZHUJ18-1_SAMN08993540_MAG_00000012.fa.gz
Downloading genome (5485 / 5485) ZHUJ18-1_SAMN08993547_MAG_00000060 to Angelakisella_genomes_folder/ZHUJ18-1_SAMN08993547_MAG_00000060.fa.gz
Finished downloading genomes

In order to list the genomes without downloading the corresponding sequences, add -l to the command, for example:

python motus.py download -s Angelakisella.genomes -w Angelakisella -l

In order to download only representative genomes, add -r to the command, for example:

python motus.py download -s Angelakisella.representative.genomes -w Angelakisella -o Angelakisella_representative_genomes_folder/ -r

Assemblies

Assemblies (and MAGs) are gradually uploaded to ENA and made public after they have been accessioned. For resources of this size (>500 bioprojects, >100k samples, ~3m biosamples, ~10TB data), this procces can take time.

A changelog with the current upload status of assemblies and MAGs can be found below:

Sept 2024

The first set of 9'970 assemblies has been uploaded and will become public once accessioned by ENA. A summary table with bioprojects and sample identifiers can be downloaded here.