Processing Metagenomes using MOCAT

Table of Contents

1 Preface

This document is up to date as of 2013-10-22 Tue 11:15.

This guide describes how metagenomes can be processed using MOCAT to calculate mOTU and taxonomic profiles.

2 Initial comments and questions

What does 'padded' mean in the database name? 'padded' refers to that the marker gene sequences in the database has been extended with up to 100 bp at each end. It is done this way to ensure that reads close to the end of gene sequences also map to the genes. The '.coord' file has information where the genes start and end, and this file is parsed when calculating the exact coverages.

3 Part I - Summary of commands

# basic processing

MOCAT.pl -sf FILE -rtf
MOCAT.pl -sf FILE -sff adapter -r reads.processed
MOCAT.pl -sf FILE -s hg19 -r adapter -memory 21G -identity 90


# assembly and gene prediction

MOCAT.pl -sf FILE -a -r screened.adapter.on.hg19 -memory 250G
MOCAT.pl -sf FILE -ar -r screened.adapter.on.hg19 -memory 250G
MOCAT.pl -sf FILE -gp assembly.revised -r screened.adapter.on.hg19


# mOTU profiling

MOCAT.pl -sf FILE -s mOTU.nr.padded -r screened.adapter.on.hg19 -cpus 4 -identity 97 -no_screened_files -config screen_soap_max_mm=5
MOCAT.pl -sf FILE -f mOTU.nr.padded -r screened.adapter.on.hg19 -identity 97 -config filter_psort_buffer=2G -memory 5G
MOCAT.pl -sf FILE -p mOTU.nr.padded -r screened.adapter.on.hg19 -mode mOTU -identity 97 -memory 3G -o RESULTS


# taxonomic profiling

MOCAT.pl -sf FILE -s Ref10MGv9.padded -r screened.adapter.on.hg19.on.mOTU.nr.padded -e -only_map -cpus 4 identity 97 -config screen_soap_max_mm=5
MOCAT.pl -sf FILE -f Ref10MGv9.padded -r screened.adapter.on.hg19.on.mOTU.nr.padded -e -identity 97 -config filter_psort_buffer=2G -memory 5G
MOCAT.pl -sf FILE -p Ref10MGv9.padded -r screened.adapter.on.hg19.on.mOTU.nr.padded -e -identity 97 -mode RefMG -previous_db_calc_tax_stats_file -memory 3G -o RESULTS

4 Part II - Generating profiles

4.1 Main end products of all processing steps

The 3 main end products of all these processing steps in MOCAT are:

  • predicted genes on revised assemblies for each sample
  • mOTU profiles based on marker gene abundances for each project
  • taxonomic profiles based on marker gene abundances for each project

4.2 Main processing steps

Creating the final output can be divided into 3 major processing steps:

  • basic (3 steps)
  • assembly and gene prediction (3 steps)
  • taxonomic profiling (taxonomic and mOTU profiles) (7 steps)

4.3 Basic

4.3.1 Read trim filter

First the samples are process by read trim filter, removing low quality reads.

  • Command
    mc -sf FILE -rtf -config MOCAT_LSF_memory_limit=3000
    
  • Memory requirements and other recommendations
    Less than 1GB. This step uses 1 CPU.
  • Parameters
    None.
  • Input
    fq.gz files in the sample folder, example:
    lane1.1.fq.gz
    lane1.2.fq.gz
    lane2.1.fq.gz
    lane2.2.fq.gz
    lane8.1.fq.gz
    lane8.2.fq.gz
    
  • Output

    Pair 1, 2 and single fq.gz files in

    reads.processed.solexaqa
    

4.3.2 Screen reads against adapters

Using USearch, reads are mapped against a fasta file with adapter sequences. Any read matching any of these sequences is removed.

  • Standard Illumina adapter sequences

    For our examples we have pasted these sequences into a single file called 'adapter' and placed it in the data folder.

    >fwd
    AGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCG
    >fwd_rc
    CGGTCTCGGCATTCCTGCTGAACCGCTCTTCCGATCT
    >rev
    AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT
    >rev_rc
    ACACTCTTTCCCTACACGACGCTCTTCCGATCT
    
  • Command
    mc -sf FILE -sff adapter -r reads.processed
    
  • Memory requirements and other recommendations
    <1GB. This step uses 3 CPUs.
  • Parameters
    • -sff adapter: This means we map reads against the 'adapter' file.
    • -r reads.processed: this could actually be omitted. But this means we use the processed reads to map against the DB.
  • Input

    Processed reads in:

    reads.processed.solexaqa
    
  • Output

    Note that each screen step produces three output folders: mapped, screened and extracted folders.

    The files with aligned reads is here:

    reads.mapped.adapter.solexaqa
    

    Then, from the processed reads, reads NOT matching the database are stored in the screened folder:

    reads.screened.adapter.solexaqa
    

    And reads MATCHING the database are stored in the extracted folder:

    reads.extracted.adapter.solexaqa
    

    IMPORTANT COMMENT: YES, this does produce redundancy, in that the same information is stored over and over again, e.g. we have first raw reads and then processed reads and then screened reads. This is solved by REMOVING the reads.processed.solexaqa folders (and also other folders).

    After this step is complete, we can remove the processed reads (has to be done manually):

    reads.processed.solexaqa
    

4.3.3 Remove eukaryotic reads

Using SOAPaligner we map reads to the custom made hg19 database at 90% identity. Any read matching the database is removed.

  • Command
    mc -sf FILE -s hg19 -r adapter -memory 21G -identity 90
    
  • Memory requirements and other recommendations
    This depends on the sample. The actual mapping requires not too much memory <10G, but there's a made hash being made inside Perl to extract the reads that match and didn't match. So if a sample has many eukaryotic reads, this hash will be large and the sample will require more memory. This step uses 8 CPUs.

    As an example, with the current setting presented here a sample with many eukaryotic reads will probably crash and needs to be restarted with more memory.

  • Parameters
    • -s hg19: we're screening against this database.
    • -r adapter: we want to use the reads that had been screened against the adapter DB.
    • -memory 21G: This could be set higher if needed

    Note that there are some hidden parameters in the config file that play an important role

    • - [screen_soap_max_mm : 10] : This is the max number of mismatches allowed for a read. Since we want to map at 90%, this has to be set to at least 10, because, in theory we could have reads up to 100bp.
  • Input

    The adapter screened reads

    reads.screened.adapter.solexaqa
    
  • Output

    The output is again three folders with the mapping files (SOAP format this time), and the screened and extracted reads.

    Reads NOT matching the DB (these are the ones we want to use):

    reads.screened.screened.adapter.on.hg19.solexaqa
    

    Reads matching the DB:

    reads.extracted.screened.adapter.on.hg19.solexaqa
    

    SOAP output files:

    reads.mapped.screened.adapter.on.hg19.solexaqa
    

    These SOAP files are NOT filtered at 90% identity, but include all reads that matched at the set mismatch level. However, after the reads are mapped and the extracted and screened folders are generated, the reads that don't match at minimum least 90% are filtered out.

4.4 Assembly and gene prediction

4.4.1 Assembly

We assemble the adapter and human screened reads.

  • Command
    mc -sf FILE -a screened.adapter.on.hg19 -memory 250G
    
  • Memory requirements and other recommendations
    This step is very varying. Som samples can be assembled using 60GB RAM. Some samples require more than 600GB RAM. A setting of 250GB will assemble MOST samples. This step uses 8 cores.
  • Parameters
    • -a screened.adapter.on.hg19: this defines the reads that did NOT match the adapters and also did NOT match the hg19 DB.
    • -memory 250G: This gives us 250G of RAM
  • Input

    The reads not matching any of the two DBs.

    reads.screened.screened.adapter.on.hg19.solexaqa
    
  • Output

    The output is assembled contigs, scaffolds and scaftigs in the assembly folder. We always the the scaftigs.

    assembly.screened.adapter.on.hg19.solexaqa.K35/
    

    Note that the number 35 here will vary for each sample.

4.4.2 Assembly revision

The assembled scaftigs are revised.

  • Command
    mc -sf FILE -ar screened.adapter.on.hg19 -memory 250G
    
  • Memory requirements and other recommendations
    Same as for assembly. Some samples require more memory than others. This step uses 8 CPUs.
  • Parameters
    • -ar screened.adapter.on.hg19: This will know that we want to use the assembly based on these reads. the scaftigs are used by default (cannot use the contigs).
  • Input
    Files located in:
    assembly.screened.adapter.on.hg19.solexaqa.K35
    
  • Output

    Revised scaftigs, located:

    assembly.revised.screened.adapter.on.hg19.solexaqa.K35
    

4.4.3 Gene prediction

We then predict genes on the revised scaftigs. This is done using MetaGeneMark (can be changed in config file).

  • Command
    mc -sf FILE -gp assembly.revised -r screened.adapter.on.hg19
    
  • Memory requirements and other recommendations
    Doesn't use much memory. 1 CPU is used.
  • Parameters
    • -gp assembly.revised: we want to use the revised assemblies
    • -r screened.adapter.on.hg19: we also need to specify that these revised assemblies were based on these reads.
  • Input

    revised scaftigs in:

    assembly.revised.screened.adapter.on.hg19.solexaqa.K35/
    
  • Output

    Predicted genes in fasta (nucleotide and protein) format in:

    gene.prediction.assembly.revised.screened.adapter.on.hg19.solexaqa.K35.MetaGeneMark.500
    

    Note that this is the 'end' of this fork. Eg. there are no further functions in MOCAT that processes these gene predictions.

4.5 Taxonomic profiles

The mOTU profiles have to be created before the taxonomic profiles, because the taxonomic profiles requires files generated in the previous steps (creating the mOTU profiles).

4.5.1 Map reads against mOTU database

The first step is to map the high quality reads against the mOTU database. This database varies, but for the cancer project it is the mOTU.nr.padded database.

  • Command
    mc -sf FILE -s mOTU.nr.padded -r screened.adapter.on.hg19 -cpus 4 -identity 97 -no_screened_files -config screen_soap_max_mm=5
    
  • Memory requirements and other recommendations
    These databases are relatively small because they only have marker genes, and therefore this step can be run on most computers. Memory requirement is usually <4GB. Normally screens uses 8 CPUs, but we change this to only 4, because we can then run more samples at the same time.
  • Parameters
    • -s mOTU.nr.padded: this is the name of the database we want to map against
    • -r screened.adapter.on.hg19: we're using the high quality reads
    • -cpus 4: we want to use 4 CPUs instead of 8
    • -identity 97: in the next steps we want to use reads that have at least 97% identity to the genes in the DB
    • -no_screened_files: normally both the extracted and screened files are generated. When mapping against this DB, because most reads wouldn't match the DB and because we don't use those reads, we don't want tor produce the extracted folder.
    • -config screen_soap_max_mm=5: By setting this to 5 instead of 10 as previously when screening against hg19 we speed up the mapping a bit. This means that a read is maximum 100bp and 5 of those can missmatch, meaning the lowest possible identity when mapping now is 95%. Had we set it to 10 mm, it's be 90%, but these reads would anyway be filtered out. Thus setting it to 5 speeds things up a little.
  • Input

    The HQ processed, adapter and hg19 screened reads:

    reads.screened.screened.adapter.on.hg19.solexaqa
    
  • Output

    We only get the mapped data files and also the extracted reads. The extracted reads are the ones that did MATCH the database,

    reads.extracted.screened.screened.adapter.on.hg19.on.mOTU.nr.padded.solexaqa
    reads.mapped.screened.screened.adapter.on.hg19.on.mOTU.nr.padded.solexaqa
    

4.5.2 Filter reads mapped against mOTU.nr.padded

We want to produce mOTU profiles based on the reads matching this database.

The next step in MOCAT is a filering step. This is a design feature of MOCAT, and in theory, yes this step could be excluded, but we have to use this because sometimes we want to combine multiple databases (but not in this case, but we still have to run it).

This step filters reads from the SOAP mapping files into a filtered file.

  • Command
    mc -sf FILE -f mOTU.nr.padded -r screened.adapter.on.hg19 -identity 97 -config filter_psort_buffer=2G -memory 5G
    
  • Memory requirements and other recommendations
    Because small DB, this step can be run using <5GB. 3 CPUs are used.
  • Parameters
    • -f mOTU.nr.padded: this is the name of the database we mapped against
    • -r screened.adapter.on.hg19: we're using the high quality reads
    • -identity 97: in the next steps we want to use reads that have at least 97% identity to the genes in the DB. here we could filter at a higher % than we originally mapped, but for the applications and because we later use some of these reads to map against a second DB, it's important that for our purposes we map (previous step) and filter (this step) on the same %.
    • -config filter_psort_buffer=2G: psort is used to sort reads is a specific order. Because the result files are usually small, we specify only a little memory here. For other application we sould set a higher value here. Specifying 'config' first means we want to override this setting from the config file.
    • -memory 5G: because we set 2GB for psort, we still need another few GB for other processes.
  • Input

    The SOAP mapping file from the previous step:

    reads.mapped.screened.screened.adapter.on.hg19.on.mOTU.nr.padded.solexaqa
    
  • Output

    The output is a custom sorted BAM file in this folder:

    reads.filtered.mOTU.nr.padded.solexaqa
    

4.5.3 Generate mOTU profiles

The filtered reads are passed to a custom script that calculates the insert and base coverages of each gene in the database, and also summarizes these into mOTU profiles.

  • Command
    mc -sf FILE -p mOTU.nr.padded -r screened.adapter.on.hg19 -identity 97 -mode mOTU -memory 3G -o RESULTS
    
  • Memory requirements and other recommendations
    This step can be run on most computers. 1 CPU is used.
  • Parameters
    • -p mOTU.nr.padded: this is the name of the database we mapped against
    • -r screened.adapter.on.hg19: we're using the high quality reads
    • -identity 97: We need to specify the same % cutoff as above
    • -mode mOTU: setting this to mOTU will generate the mOTU profiles in addition to only gene coverages
  • Input

    The BAM file in:

    reads.filtered.mOTU.nr.padded.solexaqa
    
  • Output

    running -p without the -mode mOTU will generate these folders with gene abundances:

    base.coverage.mOTU.nr.padded.solexaqa
    insert.coverage.mOTU.nr.padded.solexaqa/
    

    By setting the mode to mOTU, we also generate these mOTU abundances:

    motu.profiles.mOTU.nr.padded.solexaqa
    

    The MAIN output are combined tables that are stored in the PROJECT FOLDER, not in the individual sample folders:

    ../motu.profiles
    ../abundance.tables
    

    This said, the most used files, the actual profiles, are stored in the RESULTS folder:

    ../RESULTS/annotated.mOTU.abundances.gz
    ../RESLUTS/mOTU.abundances.gz
    ../RESULTS/annotated.mOTU.counts.gz
    ../RESLUTS/mOTU.counts.gz
    ../RESULTS/mOTU.v1.map.txt
    ../RESULTS/mOTU-LG.v1.annotations.txt
    

    Inside the motu.profiles folder the file ending with '.insert.mm.dist.among.unique.scaled.mOTU.gz' are number of inserts matching the different COGs. This file is parsed by an R script to generate the 4 .tab files. These files has the number of inserts mapping to annotated species cluster, and species clusters. The 'fractions' files has these abundances as fractions of the total sum.

4.5.4 Map reads from mOTU database against Ref10MGv9 DB

The mOTU profiles are taxonomic profiles, but are not based on standard NCBI taxonomic levels. Because we want to generate taxonomic profiles summarized on species and genus level, we have to mapped marker genes from the mOTU database against a database with only taxonomically annotated marker genes.

  • Command
    mc -sf FILE -s Ref10MGv9.padded -r screened.screened.adapter.on.hg19.on.mOTU.nr.padded -e -only_map -cpus 4 -identity 97 -config screen_soap_max_mm=5
    
  • Memory requirements and other recommendations
    This mapping step requires less memory than the previous one and can be run on any server. We also only use 4 CPUs.
  • Parameters
    • -s Ref10MGv9.padded: this is the name of the database with marker genes from only taxonomically annotated sequences
    • -r screened.screened.adapter.on.hg19.on.mOTU.nr.padded: we want to use not the "normal" high quality reads, but actually the reads that mapped to the previous database.

    This database is actually only a subset of the previous database, so any read not matching the previous database won't match this database. So using ALL original HQ reads isn't necessary.

    • -e: this option IS VERY IMPORTANT, by specifying this it means we want the extracted, and not screened, reads from the previous DB mapping. When we mapped reads against hg19 long time ago we wanted the reads that did NOT match the database, but now we want the reads that DO MATCH, thus we specify -e to use these reads instead.
    • -only_map: specifying this doesn't generate the extracted and screened reads when doing the mapping, because we will not use these files that otherwise would have been generated. The next step, filtering, uses the mapping file, and not the actual reads. but we couldn't specify this in the previous screen against the 779MetaRef10MG DB, because we needed the extracted reads from that DB to mapagainst this DB. :)
    • -cpus 4: Again, only 4 CPUs
    • -identity 97: We also map at 97% here
    • -config screen_soap_max_mm=5: because we map at 97% we can, again, speed things up a little by overriding this setting and set it to 5 instead of 10
  • Input

    Input are the reads that matched the mOTU.nr.padded DB

    reads.extracted.screened.screened.adapter.on.hg19.on.mOTU.nr.padded.solexaqa
    
  • Output

    Output is only the mapped file, because we don't generate the screened or extracted folders:

    reads.mapped.extracted.screened.screened.adapter.on.hg19.on.mOTU.nr.padded.on.Ref10MGv9.padded.solexaqa
    

4.5.5 Filter the mapped reads to Ref10MGv9.padded

Next is to filter these reads. Again, technically unnecessary for this application, because we already mapped at 97%, but the SOAP file has to be converted into a sorted BAM file.

  • Command
    mc -sf FILE -f Ref10MGv9.padded -r screened.screened.adapter.on.hg19.on.mOTU.nr.padded -e -identity 97 -config filter_psort_buffer=2G -memory 5G
    
  • Memory requirements and other recommendations
    This filtering step requires less than 5GB RAM and uses 3 CPUs. It can be run on epsilon.
  • Parameters
    By now you should know them.
  • Input

    The mapped reads:

    reads.mapped.extracted.screened.screened.adapter.on.hg19.on.mOTU.nr.padded.on.Ref10MGv9.padded.solexaqa
    
  • Output

    Sorted BAM file in:

    reads.filtered.Ref10MGv9.padded.solexaqa
    

4.5.6 Create taxonomic (RefMG) profiles

The filtered BAM file is processed to generate gene abundances and also the taxonomic profiles. Again, this is done in two steps, first with -no_paste, and then when we have processed a larger number of samples like this, we can summarize them (done in the next step).

  • Command
    mc -sf FILE -p Ref10MGv9.padded -r screened.screened.adapter.on.hg19.on.mOTU.nr.padded -e -identity 97 -mode RefMG -previous_db_calc_tax_stats_file -memory 3G -o RESULTS
    
  • Memory requirements and other recommendations
    The required memory here is less than 3GB. 1 CPU is used.
  • Parameters
    • -mode RefMG: by specifying this mode we generate the taxonomic profiles. Note that we could not run mode mOTU here, because there are specific files attached to each of these two databases. And the databases are formatted in specific ways so that these different modes function for different databases.
    • -previous_db_calc_tax_stats_file: This is a important technical detail. By setting this we change the number of total mapped reads from the total that matched the previous database, to the total that mapped within the coordinates of the genes of the previous database. The difference will not be very big, if this is omitted.
  • Input

    The BAM file in:

    reads.filtered.Ref10MGv9.padded.solexaqa
    
  • Output

    We produce both gene abundance tables, and the taxonomic profiles in this step. The gene abundance tables are never used, but are implicitly used in the algorithm to create the taxonomic abundance tables.

    taxonomic.profiles.Ref10MGv9.padded.solexaqa
    base.coverage.Ref10MGv9.padded.solexaqa
    insert.coverage.Ref10MGv9.padded.solexaqa
    

    In the PROJECT FOLDER we get the MAIN output; summarized taxonomic profiles, and also gene abundance tables (that we're not using):

    ../taxonomic.profiles
    ../abundance.tables
    

    This said, the most used files, the actual profiles, are stored in the RESULTS folder:

    ../RESULTS/NCBI.species.abundances.tab.gz
    

5 Part III - useful tips

5.1 Log files

Everything MOCAT does is stored in log files. Good things is that it's fairly easy to trace an error. Bad thing is that there'll be a lot of files. I personally use the following log files:

5.1.1 samples specific output

<PROJECT>/logs/<STEP>/samples/<SAMPLE LOG>.log

These files describes output for each sample individually. This is usually where errors are reported.

5.1.2 job specific output

<PROJECT>/logs/<STEP>/commands/<JOB ID>.log

These files contains information about the specific run. When and where it was started, and most importantly if it failed or succeeded. And if, failed, usually the failed samples are reported.

5.1.3 resource output

<PROJECT>/logs/resources

These are the resources that were required for a specific job. Very useful when we want to run where we can run a specific job, to optimize the pipeline.

5.2 MOCAT features

Some of the MOCAT features I use most often are:

  • -noqueue: adding this option runs the job outside of the queue, without changing the config file
  • -x: setting this will only create the job files, but not actually submit the jobs to the queue
  • -cpus: changes the number of CPUs for the job. Setting this to a 1 for e.g. assembly will obviously slow down that step
  • -tc: sets the maximum number of jobs that are being run in the queue at the same time (only works on SGE)
  • -config XXX=aaa: overrides config setting XXX and sets it to aaa

Author: Jens Roat Kultima <kultima@jens.embl.de>

Date: 2013-10-22 11:28:43 CEST

HTML generated by org-mode 6.33x in emacs 23