Processing Metagenomes using MOCAT

1 Preface
2 Initial comments and questions
3 Part I - Summary of commands
4 Part II - Generating profiles
5 Part III - useful tips
- 5.1 Log files
- 5.2 MOCAT features

1 Preface

This document is up to date as of 2013-10-22 Tue 11:15.

This guide describes how metagenomes can be processed using MOCAT to calculate mOTU and taxonomic profiles.

2 Initial comments and questions

What does 'padded' mean in the database name? 'padded' refers to that the marker gene sequences in the database has been extended with up to 100 bp at each end. It is done this way to ensure that reads close to the end of gene sequences also map to the genes. The '.coord' file has information where the genes start and end, and this file is parsed when calculating the exact coverages.

3 Part I - Summary of commands

# basic processing

MOCAT.pl -sf FILE -rtf
MOCAT.pl -sf FILE -sff adapter -r reads.processed
MOCAT.pl -sf FILE -s hg19 -r adapter -memory 21G -identity 90


# assembly and gene prediction

MOCAT.pl -sf FILE -a -r screened.adapter.on.hg19 -memory 250G
MOCAT.pl -sf FILE -ar -r screened.adapter.on.hg19 -memory 250G
MOCAT.pl -sf FILE -gp assembly.revised -r screened.adapter.on.hg19


# mOTU profiling

MOCAT.pl -sf FILE -s mOTU.nr.padded -r screened.adapter.on.hg19 -cpus 4 -identity 97 -no_screened_files -config screen_soap_max_mm=5
MOCAT.pl -sf FILE -f mOTU.nr.padded -r screened.adapter.on.hg19 -identity 97 -config filter_psort_buffer=2G -memory 5G
MOCAT.pl -sf FILE -p mOTU.nr.padded -r screened.adapter.on.hg19 -mode mOTU -identity 97 -memory 3G -o RESULTS


# taxonomic profiling

MOCAT.pl -sf FILE -s Ref10MGv9.padded -r screened.adapter.on.hg19.on.mOTU.nr.padded -e -only_map -cpus 4 identity 97 -config screen_soap_max_mm=5
MOCAT.pl -sf FILE -f Ref10MGv9.padded -r screened.adapter.on.hg19.on.mOTU.nr.padded -e -identity 97 -config filter_psort_buffer=2G -memory 5G
MOCAT.pl -sf FILE -p Ref10MGv9.padded -r screened.adapter.on.hg19.on.mOTU.nr.padded -e -identity 97 -mode RefMG -previous_db_calc_tax_stats_file -memory 3G -o RESULTS

4 Part II - Generating profiles

4.1 Main end products of all processing steps

The 3 main end products of all these processing steps in MOCAT are:

predicted genes on revised assemblies for each sample
mOTU profiles based on marker gene abundances for each project
taxonomic profiles based on marker gene abundances for each project

4.2 Main processing steps

Creating the final output can be divided into 3 major processing steps:

basic (3 steps)
assembly and gene prediction (3 steps)
taxonomic profiling (taxonomic and mOTU profiles) (7 steps)

4.3 Basic

4.3.1 Read trim filter

First the samples are process by read trim filter, removing low quality reads.

Command

mc -sf FILE -rtf -config MOCAT_LSF_memory_limit=3000

Memory requirements and other recommendations
Less than 1GB. This step uses 1 CPU.
Parameters
None.

Input
fq.gz files in the sample folder, example:

lane1.1.fq.gz
lane1.2.fq.gz
lane2.1.fq.gz
lane2.2.fq.gz
lane8.1.fq.gz
lane8.2.fq.gz

Output

Pair 1, 2 and single fq.gz files in
```
reads.processed.solexaqa
```

4.3.2 Screen reads against adapters

Using USearch, reads are mapped against a fasta file with adapter sequences. Any read matching any of these sequences is removed.

Standard Illumina adapter sequences

For our examples we have pasted these sequences into a single file called 'adapter' and placed it in the data folder.

>fwd
AGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCG
>fwd_rc
CGGTCTCGGCATTCCTGCTGAACCGCTCTTCCGATCT
>rev
AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT
>rev_rc
ACACTCTTTCCCTACACGACGCTCTTCCGATCT

Command

mc -sf FILE -sff adapter -r reads.processed

Memory requirements and other recommendations
<1GB. This step uses 3 CPUs.
Parameters
- -sff adapter: This means we map reads against the 'adapter' file.
- -r reads.processed: this could actually be omitted. But this means we use the processed reads to map against the DB.
Input

Processed reads in:
```
reads.processed.solexaqa
```
Output

Note that each screen step produces three output folders: mapped, screened and extracted folders.

The files with aligned reads is here:
```
reads.mapped.adapter.solexaqa
```
Then, from the processed reads, reads NOT matching the database are stored in the screened folder:
```
reads.screened.adapter.solexaqa
```
And reads MATCHING the database are stored in the extracted folder:
```
reads.extracted.adapter.solexaqa
```
IMPORTANT COMMENT: YES, this does produce redundancy, in that the same information is stored over and over again, e.g. we have first raw reads and then processed reads and then screened reads. This is solved by REMOVING the reads.processed.solexaqa folders (and also other folders).

After this step is complete, we can remove the processed reads (has to be done manually):
```
reads.processed.solexaqa
```

4.3.3 Remove eukaryotic reads

Using SOAPaligner we map reads to the custom made hg19 database at 90% identity. Any read matching the database is removed.

Command

mc -sf FILE -s hg19 -r adapter -memory 21G -identity 90

Memory requirements and other recommendations
This depends on the sample. The actual mapping requires not too much memory <10G, but there's a made hash being made inside Perl to extract the reads that match and didn't match. So if a sample has many eukaryotic reads, this hash will be large and the sample will require more memory. This step uses 8 CPUs.
As an example, with the current setting presented here a sample with many eukaryotic reads will probably crash and needs to be restarted with more memory.
Parameters
- -s hg19: we're screening against this database.
- -r adapter: we want to use the reads that had been screened against the adapter DB.
- -memory 21G: This could be set higher if needed
Note that there are some hidden parameters in the config file that play an important role
- - [screen_soap_max_mm : 10] : This is the max number of mismatches allowed for a read. Since we want to map at 90%, this has to be set to at least 10, because, in theory we could have reads up to 100bp.
Input

The adapter screened reads
```
reads.screened.adapter.solexaqa
```
Output

The output is again three folders with the mapping files (SOAP format this time), and the screened and extracted reads.

Reads NOT matching the DB (these are the ones we want to use):
```
reads.screened.screened.adapter.on.hg19.solexaqa
```
Reads matching the DB:
```
reads.extracted.screened.adapter.on.hg19.solexaqa
```
SOAP output files:
```
reads.mapped.screened.adapter.on.hg19.solexaqa
```
These SOAP files are NOT filtered at 90% identity, but include all reads that matched at the set mismatch level. However, after the reads are mapped and the extracted and screened folders are generated, the reads that don't match at minimum least 90% are filtered out.

4.4 Assembly and gene prediction

4.4.1 Assembly

We assemble the adapter and human screened reads.

Command

mc -sf FILE -a screened.adapter.on.hg19 -memory 250G

Memory requirements and other recommendations
This step is very varying. Som samples can be assembled using 60GB RAM. Some samples require more than 600GB RAM. A setting of 250GB will assemble MOST samples. This step uses 8 cores.
Parameters
- -a screened.adapter.on.hg19: this defines the reads that did NOT match the adapters and also did NOT match the hg19 DB.
- -memory 250G: This gives us 250G of RAM

Input

The reads not matching any of the two DBs.

reads.screened.screened.adapter.on.hg19.solexaqa

Output

The output is assembled contigs, scaffolds and scaftigs in the assembly folder. We always the the scaftigs.
```
assembly.screened.adapter.on.hg19.solexaqa.K35/
```
Note that the number 35 here will vary for each sample.

4.4.2 Assembly revision

The assembled scaftigs are revised.

Command

mc -sf FILE -ar screened.adapter.on.hg19 -memory 250G

Memory requirements and other recommendations
Same as for assembly. Some samples require more memory than others. This step uses 8 CPUs.
Parameters
- -ar screened.adapter.on.hg19: This will know that we want to use the assembly based on these reads. the scaftigs are used by default (cannot use the contigs).

Input
Files located in:

assembly.screened.adapter.on.hg19.solexaqa.K35

Output

Revised scaftigs, located:

assembly.revised.screened.adapter.on.hg19.solexaqa.K35

4.4.3 Gene prediction

We then predict genes on the revised scaftigs. This is done using MetaGeneMark (can be changed in config file).

Command

mc -sf FILE -gp assembly.revised -r screened.adapter.on.hg19

Memory requirements and other recommendations
Doesn't use much memory. 1 CPU is used.
Parameters
- -gp assembly.revised: we want to use the revised assemblies
- -r screened.adapter.on.hg19: we also need to specify that these revised assemblies were based on these reads.

Input

revised scaftigs in:

assembly.revised.screened.adapter.on.hg19.solexaqa.K35/

Output

Predicted genes in fasta (nucleotide and protein) format in:
```
gene.prediction.assembly.revised.screened.adapter.on.hg19.solexaqa.K35.MetaGeneMark.500
```
Note that this is the 'end' of this fork. Eg. there are no further functions in MOCAT that processes these gene predictions.

4.5 Taxonomic profiles

The mOTU profiles have to be created before the taxonomic profiles, because the taxonomic profiles requires files generated in the previous steps (creating the mOTU profiles).

4.5.1 Map reads against mOTU database

The first step is to map the high quality reads against the mOTU database. This database varies, but for the cancer project it is the mOTU.nr.padded database.

Command

mc -sf FILE -s mOTU.nr.padded -r screened.adapter.on.hg19 -cpus 4 -identity 97 -no_screened_files -config screen_soap_max_mm=5

Memory requirements and other recommendations
These databases are relatively small because they only have marker genes, and therefore this step can be run on most computers. Memory requirement is usually <4GB. Normally screens uses 8 CPUs, but we change this to only 4, because we can then run more samples at the same time.
Parameters
- -s mOTU.nr.padded: this is the name of the database we want to map against
- -r screened.adapter.on.hg19: we're using the high quality reads
- -cpus 4: we want to use 4 CPUs instead of 8
- -identity 97: in the next steps we want to use reads that have at least 97% identity to the genes in the DB
- -no_screened_files: normally both the extracted and screened files are generated. When mapping against this DB, because most reads wouldn't match the DB and because we don't use those reads, we don't want tor produce the extracted folder.
- -config screen_soap_max_mm=5: By setting this to 5 instead of 10 as previously when screening against hg19 we speed up the mapping a bit. This means that a read is maximum 100bp and 5 of those can missmatch, meaning the lowest possible identity when mapping now is 95%. Had we set it to 10 mm, it's be 90%, but these reads would anyway be filtered out. Thus setting it to 5 speeds things up a little.
Input

The HQ processed, adapter and hg19 screened reads:
```
reads.screened.screened.adapter.on.hg19.solexaqa
```

Output

We only get the mapped data files and also the extracted reads. The extracted reads are the ones that did MATCH the database,

reads.extracted.screened.screened.adapter.on.hg19.on.mOTU.nr.padded.solexaqa
reads.mapped.screened.screened.adapter.on.hg19.on.mOTU.nr.padded.solexaqa

4.5.2 Filter reads mapped against mOTU.nr.padded

We want to produce mOTU profiles based on the reads matching this database.

The next step in MOCAT is a filering step. This is a design feature of MOCAT, and in theory, yes this step could be excluded, but we have to use this because sometimes we want to combine multiple databases (but not in this case, but we still have to run it).

This step filters reads from the SOAP mapping files into a filtered file.

Command

mc -sf FILE -f mOTU.nr.padded -r screened.adapter.on.hg19 -identity 97 -config filter_psort_buffer=2G -memory 5G

Memory requirements and other recommendations
Because small DB, this step can be run using <5GB. 3 CPUs are used.
Parameters
- -f mOTU.nr.padded: this is the name of the database we mapped against
- -r screened.adapter.on.hg19: we're using the high quality reads
- -identity 97: in the next steps we want to use reads that have at least 97% identity to the genes in the DB. here we could filter at a higher % than we originally mapped, but for the applications and because we later use some of these reads to map against a second DB, it's important that for our purposes we map (previous step) and filter (this step) on the same %.
- -config filter_psort_buffer=2G: psort is used to sort reads is a specific order. Because the result files are usually small, we specify only a little memory here. For other application we sould set a higher value here. Specifying 'config' first means we want to override this setting from the config file.
- -memory 5G: because we set 2GB for psort, we still need another few GB for other processes.

Input

The SOAP mapping file from the previous step:

reads.mapped.screened.screened.adapter.on.hg19.on.mOTU.nr.padded.solexaqa

Output

The output is a custom sorted BAM file in this folder:
```
reads.filtered.mOTU.nr.padded.solexaqa
```

4.5.3 Generate mOTU profiles

The filtered reads are passed to a custom script that calculates the insert and base coverages of each gene in the database, and also summarizes these into mOTU profiles.

Command

mc -sf FILE -p mOTU.nr.padded -r screened.adapter.on.hg19 -identity 97 -mode mOTU -memory 3G -o RESULTS

Memory requirements and other recommendations
This step can be run on most computers. 1 CPU is used.
Parameters
- -p mOTU.nr.padded: this is the name of the database we mapped against
- -r screened.adapter.on.hg19: we're using the high quality reads
- -identity 97: We need to specify the same % cutoff as above
- -mode mOTU: setting this to mOTU will generate the mOTU profiles in addition to only gene coverages
Input

The BAM file in:
```
reads.filtered.mOTU.nr.padded.solexaqa
```
Output

running -p without the -mode mOTU will generate these folders with gene abundances:
```
base.coverage.mOTU.nr.padded.solexaqa
insert.coverage.mOTU.nr.padded.solexaqa/
```
By setting the mode to mOTU, we also generate these mOTU abundances:
```
motu.profiles.mOTU.nr.padded.solexaqa
```
The MAIN output are combined tables that are stored in the PROJECT FOLDER, not in the individual sample folders:
```
../motu.profiles
../abundance.tables
```
This said, the most used files, the actual profiles, are stored in the RESULTS folder:
```
../RESULTS/annotated.mOTU.abundances.gz
../RESLUTS/mOTU.abundances.gz
../RESULTS/annotated.mOTU.counts.gz
../RESLUTS/mOTU.counts.gz
../RESULTS/mOTU.v1.map.txt
../RESULTS/mOTU-LG.v1.annotations.txt
```
Inside the motu.profiles folder the file ending with '.insert.mm.dist.among.unique.scaled.mOTU.gz' are number of inserts matching the different COGs. This file is parsed by an R script to generate the 4 .tab files. These files has the number of inserts mapping to annotated species cluster, and species clusters. The 'fractions' files has these abundances as fractions of the total sum.

4.5.4 Map reads from mOTU database against Ref10MGv9 DB

The mOTU profiles are taxonomic profiles, but are not based on standard NCBI taxonomic levels. Because we want to generate taxonomic profiles summarized on species and genus level, we have to mapped marker genes from the mOTU database against a database with only taxonomically annotated marker genes.

Command

mc -sf FILE -s Ref10MGv9.padded -r screened.screened.adapter.on.hg19.on.mOTU.nr.padded -e -only_map -cpus 4 -identity 97 -config screen_soap_max_mm=5

Memory requirements and other recommendations
This mapping step requires less memory than the previous one and can be run on any server. We also only use 4 CPUs.
Parameters
- -s Ref10MGv9.padded: this is the name of the database with marker genes from only taxonomically annotated sequences
- -r screened.screened.adapter.on.hg19.on.mOTU.nr.padded: we want to use not the "normal" high quality reads, but actually the reads that mapped to the previous database.
This database is actually only a subset of the previous database, so any read not matching the previous database won't match this database. So using ALL original HQ reads isn't necessary.
- -e: this option IS VERY IMPORTANT, by specifying this it means we want the extracted, and not screened, reads from the previous DB mapping. When we mapped reads against hg19 long time ago we wanted the reads that did NOT match the database, but now we want the reads that DO MATCH, thus we specify -e to use these reads instead.
- -only_map: specifying this doesn't generate the extracted and screened reads when doing the mapping, because we will not use these files that otherwise would have been generated. The next step, filtering, uses the mapping file, and not the actual reads. but we couldn't specify this in the previous screen against the 779MetaRef10MG DB, because we needed the extracted reads from that DB to mapagainst this DB. :)
- -cpus 4: Again, only 4 CPUs
- -identity 97: We also map at 97% here
- -config screen_soap_max_mm=5: because we map at 97% we can, again, speed things up a little by overriding this setting and set it to 5 instead of 10

Input

Input are the reads that matched the mOTU.nr.padded DB

reads.extracted.screened.screened.adapter.on.hg19.on.mOTU.nr.padded.solexaqa

Output

Output is only the mapped file, because we don't generate the screened or extracted folders:

reads.mapped.extracted.screened.screened.adapter.on.hg19.on.mOTU.nr.padded.on.Ref10MGv9.padded.solexaqa

4.5.5 Filter the mapped reads to Ref10MGv9.padded

Next is to filter these reads. Again, technically unnecessary for this application, because we already mapped at 97%, but the SOAP file has to be converted into a sorted BAM file.

Command

mc -sf FILE -f Ref10MGv9.padded -r screened.screened.adapter.on.hg19.on.mOTU.nr.padded -e -identity 97 -config filter_psort_buffer=2G -memory 5G

Memory requirements and other recommendations
This filtering step requires less than 5GB RAM and uses 3 CPUs. It can be run on epsilon.
Parameters
By now you should know them.

Input

The mapped reads:

reads.mapped.extracted.screened.screened.adapter.on.hg19.on.mOTU.nr.padded.on.Ref10MGv9.padded.solexaqa

Output

Sorted BAM file in:

reads.filtered.Ref10MGv9.padded.solexaqa

4.5.6 Create taxonomic (RefMG) profiles

The filtered BAM file is processed to generate gene abundances and also the taxonomic profiles. Again, this is done in two steps, first with -no_paste, and then when we have processed a larger number of samples like this, we can summarize them (done in the next step).

Command

mc -sf FILE -p Ref10MGv9.padded -r screened.screened.adapter.on.hg19.on.mOTU.nr.padded -e -identity 97 -mode RefMG -previous_db_calc_tax_stats_file -memory 3G -o RESULTS

Memory requirements and other recommendations
The required memory here is less than 3GB. 1 CPU is used.
Parameters
- -mode RefMG: by specifying this mode we generate the taxonomic profiles. Note that we could not run mode mOTU here, because there are specific files attached to each of these two databases. And the databases are formatted in specific ways so that these different modes function for different databases.
- -previous_db_calc_tax_stats_file: This is a important technical detail. By setting this we change the number of total mapped reads from the total that matched the previous database, to the total that mapped within the coordinates of the genes of the previous database. The difference will not be very big, if this is omitted.

Input

The BAM file in:

reads.filtered.Ref10MGv9.padded.solexaqa

Output

We produce both gene abundance tables, and the taxonomic profiles in this step. The gene abundance tables are never used, but are implicitly used in the algorithm to create the taxonomic abundance tables.
```
taxonomic.profiles.Ref10MGv9.padded.solexaqa
base.coverage.Ref10MGv9.padded.solexaqa
insert.coverage.Ref10MGv9.padded.solexaqa
```
In the PROJECT FOLDER we get the MAIN output; summarized taxonomic profiles, and also gene abundance tables (that we're not using):
```
../taxonomic.profiles
../abundance.tables
```
This said, the most used files, the actual profiles, are stored in the RESULTS folder:
```
../RESULTS/NCBI.species.abundances.tab.gz
```

5 Part III - useful tips

5.1 Log files

Everything MOCAT does is stored in log files. Good things is that it's fairly easy to trace an error. Bad thing is that there'll be a lot of files. I personally use the following log files:

5.1.1 samples specific output

<PROJECT>/logs/<STEP>/samples/<SAMPLE LOG>.log

These files describes output for each sample individually. This is usually where errors are reported.

5.1.2 job specific output

<PROJECT>/logs/<STEP>/commands/<JOB ID>.log

These files contains information about the specific run. When and where it was started, and most importantly if it failed or succeeded. And if, failed, usually the failed samples are reported.

5.1.3 resource output

<PROJECT>/logs/resources

These are the resources that were required for a specific job. Very useful when we want to run where we can run a specific job, to optimize the pipeline.

5.2 MOCAT features

Some of the MOCAT features I use most often are:

-noqueue: adding this option runs the job outside of the queue, without changing the config file
-x: setting this will only create the job files, but not actually submit the jobs to the queue
-cpus: changes the number of CPUs for the job. Setting this to a 1 for e.g. assembly will obviously slow down that step
-tc: sets the maximum number of jobs that are being run in the queue at the same time (only works on SGE)
-config XXX=aaa: overrides config setting XXX and sets it to aaa

Author: Jens Roat Kultima <kultima@jens.embl.de>

Date: 2013-10-22 11:28:43 CEST

HTML generated by org-mode 6.33x in emacs 23

Processing Metagenomes using MOCAT

Table of Contents

1 Preface

2 Initial comments and questions

3 Part I - Summary of commands

4 Part II - Generating profiles

4.1 Main end products of all processing steps

4.2 Main processing steps

4.3 Basic

4.3.1 Read trim filter

4.3.2 Screen reads against adapters

4.3.3 Remove eukaryotic reads

4.4 Assembly and gene prediction

4.4.1 Assembly

4.4.2 Assembly revision

4.4.3 Gene prediction

4.5 Taxonomic profiles

4.5.1 Map reads against mOTU database

4.5.2 Filter reads mapped against mOTU.nr.padded

4.5.3 Generate mOTU profiles

4.5.4 Map reads from mOTU database against Ref10MGv9 DB

4.5.5 Filter the mapped reads to Ref10MGv9.padded

4.5.6 Create taxonomic (RefMG) profiles

5 Part III - useful tips

5.1 Log files

5.1.1 samples specific output

5.1.2 job specific output

5.1.3 resource output

5.2 MOCAT features