Taxonomic profiling using mOTUs

Tutorials

Generate taxonomic profiles using MGs with MOCAT

Quick guide

MOCAT v1.3 ships with a simple "runMOCAT.sh" executer, which can be used to very easily generate taxonomic and mOTU profiles. This menas there are now 3 ways of generating the profiles using MOCAT: 1. using the runMOCAT script. 2. Execute each individual MOCAT command manually. 3. Run the generic bash example script below. For a detailed tutorial, see below, and perhaps also have a look at the very detailed tutorial.

METHOD 1 - runMOCAT.sh:

After installing MOCAT, create a project folder and copy the MOCAT.cfg file into this folder. For each sample, place the .fq(.gz) files in a subfolder, and create a sample file containing the names of the samples you wish to process. Here's an exmaple using the simulated metagenome with 101 species (provided with MOCAT).

> runMOCAT.sh

##############################################################################
# WELCOME TO THE MOCAT EXECUTER v1.3 #
##############################################################################

This shell script is used to execute a number of MOCAT commands in a row.
Typically this is used to process raw reads up to final taxonomic or mOTU
profiles. Of course you can process each step individually using MOCAT.pl
but we have created this software for your ease to execute these commands
with ease without prior knowledge of how to run MOCAT. Enjoy!

Usage: runMOCAT.sh [-sf SAMPLE_FILE -cfg CONFIG_FILE]

SAMPLE_FILE not specified with option -sf SAMPLE_FILE
Looking for valid sample files in the current folder:
Getting files...
Getting folders...
Processing files................

SELECT A SAMPLE FILE:
- sample

ENTER SAMPLE FILE:

--- Type 'sample' and press enter ---

Then choose a script to run, here we choose number 3.

AVAILABLE SCRIPTS:
1: assemble_revise_predict_genes_no_hg19_screen
process raw reads, assemble, revise assembly and predict genes
2: assemble_revise_predict_genes_with_hg19_screen
process raw reads, remove human contaminants, assemble, revise assembly and predict genes
3: taxonomic_and_motu_profiles_no_hg19_screen
First process raw reads and then generate taxonomic and mOTU profiles
4: taxonomic_and_motu_profiles_with_hg19_screen
First process raw reads, remove humans reads and generate taxonomic and mOTU profiles

STEP TO EXECUTE (enter number):
--- Type '3' and press enter ---

Then MOCAT starts and hopefully (if your mail client is installed correctly) you'll get an email once it's done. You can also check the status of the job using the listed overview and log files, once the job has been submitted.

METHOD 2 - using MOCAT.pl

Executing these MOCAT commands will generate the mOTU and taxonomic profiles and save them in the RESULTS folder:

# Initial sample processing #
> MOCAT.pl -sf samples -rtf

# Generate mOTU profiles #
> MOCAT.pl -sf samples -s mOTU.v1.padded -identity 97
> MOCAT.pl -sf samples -f mOTU.v1.padded -identity 97
> MOCAT.pl -sf samples -p mOTU.v1.padded -identity 97 -mode mOTU -o RESULTS

# Generate taxonomic profiles #
> MOCAT.pl -sf samples -s RefMG.v1.padded -r mOTU.v1.padded -e -identity 97
> MOCAT.pl -sf samples -f RefMG.v1.padded -r mOTU.v1.padded -e -identity 97
> MOCAT.pl -sf samples -p RefMG.v1.padded -r mOTU.v1.padded -e -identity 97
-mode RefMG -previous_db_calc_tax_stats_file -o RESULTS

METHOD 3 - generic bash script

Here you can download a generic bash script that runs these commands for a project.

Introduction

This is most easily done using MOCAT (or the stand-alone version). It is also possible to manually download the databases and then map metagenomic reads to them and use (your own) customized scripts for summarizing the abundances. However, we recommend using MOCAT, which can generate two types of taxonomic profiles, one using the mOTU.v1.padded database, and another using the RefMG.v1.padded database.

What does 'padded' mean in the database name? 'padded' refers to that the marker gene sequences in the database has been extended with up to 100 bp at each end. It is done this way to ensure that reads close to the end of gene sequences also map to the genes. The '.coord' file has information where the genes start and end, and this file is parsed when calculating the exact coverages.

mOTU.v1.padded database profiles: These profiles are abundances of mOTUs. These mOTUs were extracted from both reference genomes and metagenomes, and roughly represent species clusters. However there is no taxonomic information, in the sense of common species names. The clusters are named Cluster1, Cluster2, etc. Using these clusters will give a more accurate and reliable estimate of the species in your metagenomic samples, compared to, for example, mapping reads to a set of reference genomes and generating profiles from these mappings.

RefMG.v1.padded database profiles: This database is a subset of the mOTU.v1.padded database, and contains only MGs from NCBI reference genomes. By mapping metagenomic reads to this database it is possible to estimate the abundance of the taxa (and closely related taxa) in the database, and summarize these abundances into NCBI taxonomic levels (species up to kingdom). This is yet more accurate than mapping reads to complete reference genomes, however, species without a reference in the database cannot be detected.

Here we describe how to generate mOTU and taxonomic profiles for the HMP mock community, which is shipped with MOCAT.

Initial setup

1. Download MOCAT v 1.3 and follow the installation instructions. Make sure you select 'yes' to the following options (you need the databases to generate the profiles):
- download the mOTU.v1.padded database
- download the RefMG.v1.padded database

For the HMP mock community, the following steps can all be executed by running the script below (from the MOCAT/article_datasets/mock_community folder). However, this tutorial aim to explain what each step does, so that you can reproduce this on any metagenome.

> GENERATE_mOTU_and_taxonomic_profiles.sh

2. Once MOCAT has been installed, setup the samples and sample folders correctly, by storing the lanes of each sample in a separate folder, and then the names of those samples you wish to process in the sample file. Also required is the MOCAT.cfg file. This has already been correctly setup for the mock community. It looks like this:

> ls
even_sample/SRR172902.fq.gz : folder with the one lane
MOCAT.cfg : config file
sample : sample file

There are some additional files in the folder, but they do not interest us for this tutorial. The sample file in this case only contains the name of the single sample we wish to process.

Generate mOTU profiles

3. Trim and filter the raw reads. This is the initial step when starting a new project in MOCAT. The reads are quality filtered and too short reads are removed.

> MOCAT.pl -sf sample -rtf

4. To generate the mOTU profiles, we first need to map the high quality reads against the mOTU.v1.padded database. The first time this is done, the database is indexed, and this indexing will only be performed once. The database was installed into the MOCAT/data folder during the installation. Reads are mapped, filtered and profiled at an identity of 97%. The reads are mapped using this command:

> MOCAT.pl -sf sample -s mOTU.v1.padded -identity 97

5. After the reads have been mapped (screened) against the database, they have to be filtered. This holds true for any mapping that is done using MOCAT. In this case, the filtering step doesn't have any specific effect, but in other cases the filtering is important (see the MOCAT manual). Filtering is done by running this:

> MOCAT.pl -sf sample -f mOTU.v1.padded -identity 97

6. The final mOTU profiles are generating by running the 'p|profile' command. The filtered reads are summarized at mOTU level by running:

> MOCAT.pl -sf sample -p mOTU.v1.padded -identity 97 -mode mOTU -o RESULTS

The mOTU profiles are saved in the following folder:

motu.profiles/mOTU.v1.padded/

Inside the motu.profiles folder the file ending with '.insert.mm.dist.among.unique.scaled.mOTU.gz' are number of inserts matching the different COGs. This file is parsed by an R script to generate the 4 .tab files. These files has the number of inserts mapping to annotated species cluster, and species clusters. The 'fractions' files has these abundances as fractions of the total sum. The fractions files are also available (as symbolic links) inside the RESULTS folder.

AND, an easy to use links are saved in the RESULTS folder

RESULTS/annotated.mOTU.abundances.gz
RESULTS/mOTU.abundances.gz
RESULTS/annotated.mOTU.counts.gz
RESULTS/mOTU.counts.gz

Generate taxonomic (RefMG) profiles

7. To generate the taxonomic profiles (summarized at NCBI taxa levels), we first need to map the reads that matched the mOTU.v1.padded database against the RefMG.v1.padded database. We could sue the HQ reads, but since the RefMG.v1.padded database is a subset of the mOTU.v1.padded database, using the reads matching that database will be faster, and also make it possible to calculate a -1 fraction. The first time this is done, the database is indexed, and this indexing will only be performed once. The database was installed into the MOCAT/data folder during the installation. Reads are mapped, filtered and profiled at an identity of 97%. Note the -e flag, that is specified because we want to use the extracted reads from the previous database screen. Extracted reads are those reads that match the database. screened reads are those that didn't match. The reads are mapped using this command:

> MOCAT.pl -sf sample -s RefMG.v1.padded -r mOTU.v1.padded -e -identity 97

8. Same as above for the mOTU database, filtering is done by running this:

> MOCAT.pl -sf sample -f RefMG.v1.padded -r mOTU.v1.padded -e -identity 97

9. The final taxonomic profiles are generating by running the 'p|profile' command. The filtered reads are summarized at species up to kingdom level by running:

> MOCAT.pl -sf sample -p RefMG.v1.padded -r mOTU.v1.padded -e -identity 97 -mode RefMG -previous_db_calc_tax_stats_file -o RESULTS

The taxonomic profiles are saved in the following folder:

taxonomic.profiles/RefMG.v1.padded/

AND, an easy to use link is saved in the RESULTS folder

RESULTS/NCBI.species.abundances.gz
RESULTS/NCBI.species.counts.gz