All data generated to build the mOTUs-db is or will gradually become made publicly available on standardized data repositories such as the European Nucleotide Archive. The following page describes the content of the downloadble files and how to best access the resource beyond using the website.
Information on source data used to build the mOTUs-db can be found at Zenodo and includes:
A map between each of the genomes in mOTUs-db (3’747’151), the associated study and, in the case of MAGs, the associated metagenomic sample.
Columns:
GENOME → Unique mOTUs-db genome identifier
STUDY → Unique mOTUs-db study identifier
IS_MAG → True if genome is a MAG, otherwise False
METAGENOMIC_SAMPLE → Unique mOTUs-db metagenomic sample identifier or NA in case of non-MAG genomes
Example:
GENOME STUDY IS_MAG METAGENOMIC_SAMPLE
---------------------------------------------------------------------------------------------
ACIN21-1_SAMN05421555_MAG_00000001 ACIN21-1 True ACIN21-1_SAMN05421555_METAG
RSGB23-1_GCA-006096615-V1_GENO_10000001 RSGB23-1 False NA
A map between all non-MAG genomes (919’090) and their source link (e.g. Refseq or JGI).
Columns:
GENOME → Unique mOTUs-db genome identifier
SOURCE_SAMPLE_LINK → Link to the original location of this genome
Example:
#GENOME SOURCE_SAMPLE_LINK
--------------------------------------------------------------------------------------------------------
JGIG23-1_GA0055041_GENO_10000001 https://gold.jgi.doe.gov/analysis_project?id=Ga0055041
RSGB23-1_GCA-006717865-V1_GENO_10000001 https://www.ncbi.nlm.nih.gov/datasets/genome/GCA_006717865.1
A list of all metagenomic studies processed for the mOTUs-db, their number of samples, the number of reconstructed MAGs, and the associated publication.
Columns:
STUDY → Unique mOTUs-db study identifier
BIOPROJECT → Public identifier of metagenomic sequencing project
SAMPLES → Number of metagenomic samples
MAGs → Number of reconstructed MAGs
PUBLICATION → Link to publication
Example:
STUDY BIOPROJECT SAMPLES MAGs PUBLICATION
-------------------------------------------------------------------------------------------------
ACIN21-1 PRJEB44456 58 1,110 https://www.nature.com/articles/s42003-021-02112-2
Mapping between the mOTUs-db sample identifier, the associated biosample, and the environment.
Columns:
SAMPLE → Unique mOTUS-db sample identifier
BIOSAMPLE → Public identifier of metagenomic sample
STUDY → Unique mOTUs-db study identifier
ENVIRONMENT → Environment of metagenomic sample
SOURCE_SAMPLE_LINK → Link to the original location of this sample
Example:
#SAMPLE BIOSAMPLE STUDY ENVIRONMENT SOURCE_SAMPLE_LINK
------------------------------------------------------------------------------------------------------------------------------------------------
ACIN21-1_SAMN05421555_METAG SAMN05421555 ACIN21-1 marine metagenome, seawater metagenome https://www.ncbi.nlm.nih.gov/biosample/SAMN05421555/
A list of environments covered in the mOTUs-db mapped to the respective NCBI taxonomic identifier (if available).
Columns:
TERM → Unique environment name
NCBI TAXONOMY ID → Link to the NCBI taxonomic identifier
Example:
TERM NCBI TAXONOMY ID
----------------------------------------------
activated sludge metagenome NCBI:txid942017
air metagenome NCBI:txid655179
Bulk download of genome sequences via the browser is limited to a maximum of 200 genomes. Larger batches of genome sequences can be downloaded in one of the following ways:
1) All genomes within mOTUs-db can be downloaded as a tar file (warning, this file is 2.7TB).
2) By using the mOTUs tool for more fine-grained access. Tool installation is described on the official page.
1) Navigate to folder in which the motus.py executable is located (or provide full path to executable instead of motus.py
).
2) Run the following command:
python motus.py download -w [insert keyword] -s [insert file name] -o [insert folder name]
Where -w
is followed by a keyword that is used to select genomes, -s
is followed by the output file for downloaded genome metadata, and -o
is followed by the folder in which the downloaded FASTA files will be stored. The keyword for selecting genomes can be:
mOTUv4.0_000000
)ARTA20-1_SAMN17006400_MAG_00000010
)Faecalibacterium
)For example, running the command:
python motus.py download -s Angelakisella.genomes -w Angelakisella -o Angelakisella_genomes_folder/
should yield the following progress report:
mOTU tool starting
Loading database ...
Initialising the mOTUs search database.
Finished initialising the mOTUs search database. Found 124288 mOTUs, 3747151 genomes and 82452 taxonomy search words.
Searching for keyword: Angelakisella.
Found: 5485 hits.
Found 5485 genomes. Writing genome information to Angelakisella.genomes
Finished writing genome information to Angelakisella.genomes
Downloading genomes to Angelakisella_genomes_folder
Downloading genome (1 / 5485) ANDE20-1_SAMEA4688840_MAG_00000115 to Angelakisella_genomes_folder/ANDE20-1_SAMEA4688840_MAG_00000115.fa.gz
...
Downloading genome (5484 / 5485) ZHUJ18-1_SAMN08993540_MAG_00000012 to Angelakisella_genomes_folder/ZHUJ18-1_SAMN08993540_MAG_00000012.fa.gz
Downloading genome (5485 / 5485) ZHUJ18-1_SAMN08993547_MAG_00000060 to Angelakisella_genomes_folder/ZHUJ18-1_SAMN08993547_MAG_00000060.fa.gz
Finished downloading genomes
In order to list the genomes without downloading the corresponding sequences, add -l
to the command, for example:
python motus.py download -s Angelakisella.genomes -w Angelakisella -l
In order to download only representative genomes, add -r
to the command, for example:
python motus.py download -s Angelakisella.representative.genomes -w Angelakisella -o Angelakisella_representative_genomes_folder/ -r
Assemblies (and MAGs) are gradually uploaded to ENA and made public after they have been accessioned. For resources of this size (>500 bioprojects, >100k samples, ~3m biosamples, ~10TB data), this procces can take time.
A changelog with the current upload status of assemblies and MAGs can be found below:
The first set of 9'970 assemblies has been uploaded and will become public once accessioned by ENA. A summary table with bioprojects and sample identifiers can be downloaded here.