The graphical user interface allows for easy and fast inspection of individual genomes, associated annotations, studies and samples. Access to multiple datasets is also possible by using the FTP data backend of OMDB.
Alternatively data can also be downloaded via the commandline using the OMDB links file. After downloading the file (8MB, MD5=c1b5f14c9b7899f7300ccf41e62f8681
) users have access to links to all genome and genome annotation files on OMDB.
#download the file. Either click on the link above or download with curl
$ curl -O https://sunagawalab.ethz.ch/share/microbiomics/ocean/db/2.0/data/catalogs/OMDBv2.0_data.tsv.gz
$ gunzip OMDBv2.0_data.tsv.gz
The file has one line per genome and contains public links to the OMDB data:
Example:
GENOME: GARB21-1_SAMN12799101_MAG_00000001
SAMPLE: GARB21-1_SAMN12799101_METAG
STUDY: GARB21-1
GENOME_FILE: https://sunagawalab.ethz.ch/share/microbiomics/ocean/db/2.0/data/genomes/genomes/GARB21-1/GARB21-1_SAMN12799101_METAG/GARB21-1_SAMN12799101_MAG_00000001/GARB21-1_SAMN12799101_MAG_00000001.fa.gz
GENES_NT_FILE: https://sunagawalab.ethz.ch/share/microbiomics/ocean/db/2.0/data/genomes/genomes/GARB21-1/GARB21-1_SAMN12799101_METAG/GARB21-1_SAMN12799101_MAG_00000001/GARB21-1_SAMN12799101_MAG_00000001.genes.fna.gz
GENES_AA_FILE: https://sunagawalab.ethz.ch/share/microbiomics/ocean/db/2.0/data/genomes/genomes/GARB21-1/GARB21-1_SAMN12799101_METAG/GARB21-1_SAMN12799101_MAG_00000001/GARB21-1_SAMN12799101_MAG_00000001.genes.faa.gz
GENES_GFF_FILE: https://sunagawalab.ethz.ch/share/microbiomics/ocean/db/2.0/data/genomes/genomes/GARB21-1/GARB21-1_SAMN12799101_METAG/GARB21-1_SAMN12799101_MAG_00000001/GARB21-1_SAMN12799101_MAG_00000001.genes.gff.gz
ANTISMASH_FILE: https://sunagawalab.ethz.ch/share/microbiomics/ocean/db/2.0/data/genomes/genomes/GARB21-1/GARB21-1_SAMN12799101_METAG/GARB21-1_SAMN12799101_MAG_00000001/GARB21-1_SAMN12799101_MAG_00000001-antismash.tar.gz
Those links can be used for download using curl/wget.
E.g. To download the Antismash file from the Genome GARB21-1_SAMN12799101_MAG_00000001
$ curl -O (<grep "GARB21-1_SAMN12799101_MAG_00000001" OMDBv2.0_data.tsv | cut -f 8)
https://sunagawalab.ethz.ch/share/microbiomics/ocean/db/2.0/data/genomes/genomes/GARB21-1/GARB21-1_SAMN12799101_METAG/GARB21-1_SAMN12799101_MAG_00000001/GARB21-1_SAMN12799101_MAG_00000001-antismash.tar.gz
$ curl -O https://sunagawalab.ethz.ch/share/microbiomics/ocean/db/2.0/data/genomes/genomes/GARB21-1/GARB21-1_SAMN12799101_METAG/GARB21-1_SAMN12799101_MAG_00000001/GARB21-1_SAMN12799101_MAG_00000001-antismash.tar.gz
Alternatively, use download.file
in R or the requests
module in Python to download data in a more systematic way.
OMDB genomes and derived genes have been compiled into several catalogs and are released on this page:
Complete genes of all OMDB genomes were called, aggregated and clustered in nucleotide space at different levels.
Catalog | Genes | Clustering Threshold | Singletons | Sequences | Clusters |
---|---|---|---|---|---|
OMDBv2.0_NT_G_R | 508,832,278 | No clustering | 100% | Sequences - 128GB | Clusters - 5GB |
OMDBv2.0_NT_G_NR100 | 325,384,975 | 100% | 85% | Sequences - 88GB | Clusters - 4GB |
OMDBv2.0_NT_G_NR95 | 103,044,829 | 95% | 57% | Sequences - 27GB | Clusters - 3GB |
Complete genes of all OMDB genomes were called, aggregated and clustered in amino acid space at different levels.
Catalog | Genes | Clustering Threshold | Singletons | Sequences | Clusters |
---|---|---|---|---|---|
OMDBv2.0_AA_G_R | 508,832,278 | No clustering | 100% | Sequences - 88GB | Clusters - 5GB |
OMDBv2.0_AA_G_NR100 | 249,518,434 | 100% | 79% | Sequences - 46GB | Clusters - 4GB |
OMDBv2.0_AA_G_NR50 | 28,862,112 | 50% | 53% | Sequences - 4GB | Clusters - 4GB |
OMDBv2.0_AA_G_NR30 | 18,342,415 | 30% | 53% | Sequences - 2GB | Clusters - 4GB |
All OMDB genomes were compiled into a single file and dereplicated at 100%.
Catalog | Genomes | Clustering Threshold | Singletons | Sequences | Clusters |
---|---|---|---|---|---|
OMDBv2.0_SC_G_R | 69,280,421 | No clustering | 100% | Sequences - 150GB | Clusters - 1GB |
OMDBv2.0_SC_G_NR100 | 68,726,394 | 100% | 99% | Sequences - 145GB | Clusters - 1GB |
All catalogs were named with the same structure:
OMDBv2.0_XX_Y_Z where
Redundant catalogs and the catalogs dereplicated at 100% were generated with custom scripts.
The OMDBv2.0_NT_G_NR95 catalog was clustered using mmseqs2
with the following parameters:
mmseqs createdb OMDBv2.0_NT_G_R.fna OMDBv2.0_NT_G_NR95.mmseqs.db --dbtype 2 --shuffle 0
mmseqs cluster OMDBv2.0_NT_G_NR95.mmseqs.db OMDBv2.0_NT_G_NR95.mmseqs.db.9590.cluster mmseqs_tmp --kmer-per-seq-scale 0 --kmer-per-seq 1000 -s 4 --max-seq-len 80000 --remove-tmp-files 0 --cluster-mode 2 --min-seq-id 0.95 --threads 96 --cov-mode 1 -c 0.9 --spaced-kmer-mode 0 --alignment-mode 3 --cluster-reassign 1
mmseqs createtsv OMDBv2.0_NT_G_NR95.mmseqs.db OMDBv2.0_NT_G_NR95.mmseqs.db OMDBv2.0_NT_G_NR95.mmseqs.db.9590.cluster OMDBv2.0_NT_G_NR95.mmseqs.9590.cluster.tsv
The OMDBv2.0_AA_G_NR50 catalog was clustered using mmseqs2
with the following parameters:
mmseqs easy-cluster OMDBv2.0_AA_G_R.faa mmseqs_dir mmseqs_tmp --min-seq-id 0.5 -c 0.9 --cov-mode 1 --threads 96
The OMDBv2.0_AA_G_NR30 catalog was clustered using mmseqs2
with the following parameters:
mmseqs easy-cluster OMDBv2.0_AA_G_R.faa mmseqs_dir mmseqs_tmp --min-seq-id 0.3 -c 0.9 --cov-mode 1 --threads 96