Processing Metagenomes using MOCAT
Table of Contents
1 Preface
This document is up to date as of
.This guide describes how metagenomes can be processed using MOCAT to calculate mOTU and taxonomic profiles.
2 Initial comments and questions
What does 'padded' mean in the database name? 'padded' refers to that the marker gene sequences in the database has been extended with up to 100 bp at each end. It is done this way to ensure that reads close to the end of gene sequences also map to the genes. The '.coord' file has information where the genes start and end, and this file is parsed when calculating the exact coverages.
3 Part I - Summary of commands
# basic processing MOCAT.pl -sf FILE -rtf MOCAT.pl -sf FILE -sff adapter -r reads.processed MOCAT.pl -sf FILE -s hg19 -r adapter -memory 21G -identity 90 # assembly and gene prediction MOCAT.pl -sf FILE -a -r screened.adapter.on.hg19 -memory 250G MOCAT.pl -sf FILE -ar -r screened.adapter.on.hg19 -memory 250G MOCAT.pl -sf FILE -gp assembly.revised -r screened.adapter.on.hg19 # mOTU profiling MOCAT.pl -sf FILE -s mOTU.nr.padded -r screened.adapter.on.hg19 -cpus 4 -identity 97 -no_screened_files -config screen_soap_max_mm=5 MOCAT.pl -sf FILE -f mOTU.nr.padded -r screened.adapter.on.hg19 -identity 97 -config filter_psort_buffer=2G -memory 5G MOCAT.pl -sf FILE -p mOTU.nr.padded -r screened.adapter.on.hg19 -mode mOTU -identity 97 -memory 3G -o RESULTS # taxonomic profiling MOCAT.pl -sf FILE -s Ref10MGv9.padded -r screened.adapter.on.hg19.on.mOTU.nr.padded -e -only_map -cpus 4 identity 97 -config screen_soap_max_mm=5 MOCAT.pl -sf FILE -f Ref10MGv9.padded -r screened.adapter.on.hg19.on.mOTU.nr.padded -e -identity 97 -config filter_psort_buffer=2G -memory 5G MOCAT.pl -sf FILE -p Ref10MGv9.padded -r screened.adapter.on.hg19.on.mOTU.nr.padded -e -identity 97 -mode RefMG -previous_db_calc_tax_stats_file -memory 3G -o RESULTS
4 Part II - Generating profiles
4.1 Main end products of all processing steps
The 3 main end products of all these processing steps in MOCAT are:
- predicted genes on revised assemblies for each sample
- mOTU profiles based on marker gene abundances for each project
- taxonomic profiles based on marker gene abundances for each project
4.2 Main processing steps
Creating the final output can be divided into 3 major processing steps:
- basic (3 steps)
- assembly and gene prediction (3 steps)
- taxonomic profiling (taxonomic and mOTU profiles) (7 steps)
4.3 Basic
4.3.1 Read trim filter
First the samples are process by read trim filter, removing low quality reads.
- Command
mc -sf FILE -rtf -config MOCAT_LSF_memory_limit=3000
- Memory requirements and other recommendations
Less than 1GB. This step uses 1 CPU. - Parameters
None. - Input
fq.gz files in the sample folder, example:lane1.1.fq.gz lane1.2.fq.gz lane2.1.fq.gz lane2.2.fq.gz lane8.1.fq.gz lane8.2.fq.gz
- Output
Pair 1, 2 and single fq.gz files in
reads.processed.solexaqa
4.3.2 Screen reads against adapters
Using USearch, reads are mapped against a fasta file with adapter sequences. Any read matching any of these sequences is removed.
- Standard Illumina adapter sequences
For our examples we have pasted these sequences into a single file called 'adapter' and placed it in the data folder.
>fwd AGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCG >fwd_rc CGGTCTCGGCATTCCTGCTGAACCGCTCTTCCGATCT >rev AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT >rev_rc ACACTCTTTCCCTACACGACGCTCTTCCGATCT
- Command
mc -sf FILE -sff adapter -r reads.processed
- Memory requirements and other recommendations
<1GB. This step uses 3 CPUs. - Parameters
- -sff adapter: This means we map reads against the 'adapter' file.
- -r reads.processed: this could actually be omitted. But this means we use the processed reads to map against the DB.
- Input
Processed reads in:
reads.processed.solexaqa
- Output
Note that each screen step produces three output folders: mapped, screened and extracted folders.
The files with aligned reads is here:
reads.mapped.adapter.solexaqa
Then, from the processed reads, reads NOT matching the database are stored in the screened folder:
reads.screened.adapter.solexaqa
And reads MATCHING the database are stored in the extracted folder:
reads.extracted.adapter.solexaqa
IMPORTANT COMMENT: YES, this does produce redundancy, in that the same information is stored over and over again, e.g. we have first raw reads and then processed reads and then screened reads. This is solved by REMOVING the reads.processed.solexaqa folders (and also other folders).
After this step is complete, we can remove the processed reads (has to be done manually):
reads.processed.solexaqa
4.3.3 Remove eukaryotic reads
Using SOAPaligner we map reads to the custom made hg19 database at 90% identity. Any read matching the database is removed.
- Command
mc -sf FILE -s hg19 -r adapter -memory 21G -identity 90
- Memory requirements and other recommendations
This depends on the sample. The actual mapping requires not too much memory <10G, but there's a made hash being made inside Perl to extract the reads that match and didn't match. So if a sample has many eukaryotic reads, this hash will be large and the sample will require more memory. This step uses 8 CPUs.As an example, with the current setting presented here a sample with many eukaryotic reads will probably crash and needs to be restarted with more memory.
- Parameters
- -s hg19: we're screening against this database.
- -r adapter: we want to use the reads that had been screened against the adapter DB.
- -memory 21G: This could be set higher if needed
Note that there are some hidden parameters in the config file that play an important role
- - [screen_soap_max_mm : 10] : This is the max number of mismatches allowed for a read. Since we want to map at 90%, this has to be set to at least 10, because, in theory we could have reads up to 100bp.
- Input
The adapter screened reads
reads.screened.adapter.solexaqa
- Output
The output is again three folders with the mapping files (SOAP format this time), and the screened and extracted reads.
Reads NOT matching the DB (these are the ones we want to use):
reads.screened.screened.adapter.on.hg19.solexaqa
Reads matching the DB:
reads.extracted.screened.adapter.on.hg19.solexaqa
SOAP output files:
reads.mapped.screened.adapter.on.hg19.solexaqa
These SOAP files are NOT filtered at 90% identity, but include all reads that matched at the set mismatch level. However, after the reads are mapped and the extracted and screened folders are generated, the reads that don't match at minimum least 90% are filtered out.
4.4 Assembly and gene prediction
4.4.1 Assembly
We assemble the adapter and human screened reads.
- Command
mc -sf FILE -a screened.adapter.on.hg19 -memory 250G
- Memory requirements and other recommendations
This step is very varying. Som samples can be assembled using 60GB RAM. Some samples require more than 600GB RAM. A setting of 250GB will assemble MOST samples. This step uses 8 cores. - Parameters
- -a screened.adapter.on.hg19: this defines the reads that did NOT match the adapters and also did NOT match the hg19 DB.
- -memory 250G: This gives us 250G of RAM
- Input
The reads not matching any of the two DBs.
reads.screened.screened.adapter.on.hg19.solexaqa
- Output
The output is assembled contigs, scaffolds and scaftigs in the assembly folder. We always the the scaftigs.
assembly.screened.adapter.on.hg19.solexaqa.K35/
Note that the number 35 here will vary for each sample.
4.4.2 Assembly revision
The assembled scaftigs are revised.
- Command
mc -sf FILE -ar screened.adapter.on.hg19 -memory 250G
- Memory requirements and other recommendations
Same as for assembly. Some samples require more memory than others. This step uses 8 CPUs. - Parameters
- -ar screened.adapter.on.hg19: This will know that we want to use the assembly based on these reads. the scaftigs are used by default (cannot use the contigs).
- Input
Files located in:assembly.screened.adapter.on.hg19.solexaqa.K35
- Output
Revised scaftigs, located:
assembly.revised.screened.adapter.on.hg19.solexaqa.K35
4.4.3 Gene prediction
We then predict genes on the revised scaftigs. This is done using MetaGeneMark (can be changed in config file).
- Command
mc -sf FILE -gp assembly.revised -r screened.adapter.on.hg19
- Memory requirements and other recommendations
Doesn't use much memory. 1 CPU is used. - Parameters
- -gp assembly.revised: we want to use the revised assemblies
- -r screened.adapter.on.hg19: we also need to specify that these revised assemblies were based on these reads.
- Input
revised scaftigs in:
assembly.revised.screened.adapter.on.hg19.solexaqa.K35/
- Output
Predicted genes in fasta (nucleotide and protein) format in:
gene.prediction.assembly.revised.screened.adapter.on.hg19.solexaqa.K35.MetaGeneMark.500
Note that this is the 'end' of this fork. Eg. there are no further functions in MOCAT that processes these gene predictions.
4.5 Taxonomic profiles
The mOTU profiles have to be created before the taxonomic profiles, because the taxonomic profiles requires files generated in the previous steps (creating the mOTU profiles).
4.5.1 Map reads against mOTU database
The first step is to map the high quality reads against the mOTU database. This database varies, but for the cancer project it is the mOTU.nr.padded database.
- Command
mc -sf FILE -s mOTU.nr.padded -r screened.adapter.on.hg19 -cpus 4 -identity 97 -no_screened_files -config screen_soap_max_mm=5
- Memory requirements and other recommendations
These databases are relatively small because they only have marker genes, and therefore this step can be run on most computers. Memory requirement is usually <4GB. Normally screens uses 8 CPUs, but we change this to only 4, because we can then run more samples at the same time. - Parameters
- -s mOTU.nr.padded: this is the name of the database we want to map against
- -r screened.adapter.on.hg19: we're using the high quality reads
- -cpus 4: we want to use 4 CPUs instead of 8
- -identity 97: in the next steps we want to use reads that have at least 97% identity to the genes in the DB
- -no_screened_files: normally both the extracted and screened files are generated. When mapping against this DB, because most reads wouldn't match the DB and because we don't use those reads, we don't want tor produce the extracted folder.
- -config screen_soap_max_mm=5: By setting this to 5 instead of 10 as previously when screening against hg19 we speed up the mapping a bit. This means that a read is maximum 100bp and 5 of those can missmatch, meaning the lowest possible identity when mapping now is 95%. Had we set it to 10 mm, it's be 90%, but these reads would anyway be filtered out. Thus setting it to 5 speeds things up a little.
- Input
The HQ processed, adapter and hg19 screened reads:
reads.screened.screened.adapter.on.hg19.solexaqa
- Output
We only get the mapped data files and also the extracted reads. The extracted reads are the ones that did MATCH the database,
reads.extracted.screened.screened.adapter.on.hg19.on.mOTU.nr.padded.solexaqa reads.mapped.screened.screened.adapter.on.hg19.on.mOTU.nr.padded.solexaqa
4.5.2 Filter reads mapped against mOTU.nr.padded
We want to produce mOTU profiles based on the reads matching this database.
The next step in MOCAT is a filering step. This is a design feature of MOCAT, and in theory, yes this step could be excluded, but we have to use this because sometimes we want to combine multiple databases (but not in this case, but we still have to run it).
This step filters reads from the SOAP mapping files into a filtered file.
- Command
mc -sf FILE -f mOTU.nr.padded -r screened.adapter.on.hg19 -identity 97 -config filter_psort_buffer=2G -memory 5G
- Memory requirements and other recommendations
Because small DB, this step can be run using <5GB. 3 CPUs are used. - Parameters
- -f mOTU.nr.padded: this is the name of the database we mapped against
- -r screened.adapter.on.hg19: we're using the high quality reads
- -identity 97: in the next steps we want to use reads that have at least 97% identity to the genes in the DB. here we could filter at a higher % than we originally mapped, but for the applications and because we later use some of these reads to map against a second DB, it's important that for our purposes we map (previous step) and filter (this step) on the same %.
- -config filter_psort_buffer=2G: psort is used to sort reads is a specific order. Because the result files are usually small, we specify only a little memory here. For other application we sould set a higher value here. Specifying 'config' first means we want to override this setting from the config file.
- -memory 5G: because we set 2GB for psort, we still need another few GB for other processes.
- Input
The SOAP mapping file from the previous step:
reads.mapped.screened.screened.adapter.on.hg19.on.mOTU.nr.padded.solexaqa
- Output
The output is a custom sorted BAM file in this folder:
reads.filtered.mOTU.nr.padded.solexaqa
4.5.3 Generate mOTU profiles
The filtered reads are passed to a custom script that calculates the insert and base coverages of each gene in the database, and also summarizes these into mOTU profiles.
- Command
mc -sf FILE -p mOTU.nr.padded -r screened.adapter.on.hg19 -identity 97 -mode mOTU -memory 3G -o RESULTS
- Memory requirements and other recommendations
This step can be run on most computers. 1 CPU is used. - Parameters
- -p mOTU.nr.padded: this is the name of the database we mapped against
- -r screened.adapter.on.hg19: we're using the high quality reads
- -identity 97: We need to specify the same % cutoff as above
- -mode mOTU: setting this to mOTU will generate the mOTU profiles in addition to only gene coverages
- Input
The BAM file in:
reads.filtered.mOTU.nr.padded.solexaqa
- Output
running -p without the -mode mOTU will generate these folders with gene abundances:
base.coverage.mOTU.nr.padded.solexaqa insert.coverage.mOTU.nr.padded.solexaqa/
By setting the mode to mOTU, we also generate these mOTU abundances:
motu.profiles.mOTU.nr.padded.solexaqa
The MAIN output are combined tables that are stored in the PROJECT FOLDER, not in the individual sample folders:
../motu.profiles ../abundance.tables
This said, the most used files, the actual profiles, are stored in the RESULTS folder:
../RESULTS/annotated.mOTU.abundances.gz ../RESLUTS/mOTU.abundances.gz ../RESULTS/annotated.mOTU.counts.gz ../RESLUTS/mOTU.counts.gz ../RESULTS/mOTU.v1.map.txt ../RESULTS/mOTU-LG.v1.annotations.txt
Inside the motu.profiles folder the file ending with '.insert.mm.dist.among.unique.scaled.mOTU.gz' are number of inserts matching the different COGs. This file is parsed by an R script to generate the 4 .tab files. These files has the number of inserts mapping to annotated species cluster, and species clusters. The 'fractions' files has these abundances as fractions of the total sum.
4.5.4 Map reads from mOTU database against Ref10MGv9 DB
The mOTU profiles are taxonomic profiles, but are not based on standard NCBI taxonomic levels. Because we want to generate taxonomic profiles summarized on species and genus level, we have to mapped marker genes from the mOTU database against a database with only taxonomically annotated marker genes.
- Command
mc -sf FILE -s Ref10MGv9.padded -r screened.screened.adapter.on.hg19.on.mOTU.nr.padded -e -only_map -cpus 4 -identity 97 -config screen_soap_max_mm=5
- Memory requirements and other recommendations
This mapping step requires less memory than the previous one and can be run on any server. We also only use 4 CPUs. - Parameters
- -s Ref10MGv9.padded: this is the name of the database with marker genes from only taxonomically annotated sequences
- -r screened.screened.adapter.on.hg19.on.mOTU.nr.padded: we want to use not the "normal" high quality reads, but actually the reads that mapped to the previous database.
This database is actually only a subset of the previous database, so any read not matching the previous database won't match this database. So using ALL original HQ reads isn't necessary.
- -e: this option IS VERY IMPORTANT, by specifying this it means we want the extracted, and not screened, reads from the previous DB mapping. When we mapped reads against hg19 long time ago we wanted the reads that did NOT match the database, but now we want the reads that DO MATCH, thus we specify -e to use these reads instead.
- -only_map: specifying this doesn't generate the extracted and screened reads when doing the mapping, because we will not use these files that otherwise would have been generated. The next step, filtering, uses the mapping file, and not the actual reads. but we couldn't specify this in the previous screen against the 779MetaRef10MG DB, because we needed the extracted reads from that DB to mapagainst this DB. :)
- -cpus 4: Again, only 4 CPUs
- -identity 97: We also map at 97% here
- -config screen_soap_max_mm=5: because we map at 97% we can, again, speed things up a little by overriding this setting and set it to 5 instead of 10
- Input
Input are the reads that matched the mOTU.nr.padded DB
reads.extracted.screened.screened.adapter.on.hg19.on.mOTU.nr.padded.solexaqa
- Output
Output is only the mapped file, because we don't generate the screened or extracted folders:
reads.mapped.extracted.screened.screened.adapter.on.hg19.on.mOTU.nr.padded.on.Ref10MGv9.padded.solexaqa
4.5.5 Filter the mapped reads to Ref10MGv9.padded
Next is to filter these reads. Again, technically unnecessary for this application, because we already mapped at 97%, but the SOAP file has to be converted into a sorted BAM file.
- Command
mc -sf FILE -f Ref10MGv9.padded -r screened.screened.adapter.on.hg19.on.mOTU.nr.padded -e -identity 97 -config filter_psort_buffer=2G -memory 5G
- Memory requirements and other recommendations
This filtering step requires less than 5GB RAM and uses 3 CPUs. It can be run on epsilon. - Parameters
By now you should know them. - Input
The mapped reads:
reads.mapped.extracted.screened.screened.adapter.on.hg19.on.mOTU.nr.padded.on.Ref10MGv9.padded.solexaqa
- Output
Sorted BAM file in:
reads.filtered.Ref10MGv9.padded.solexaqa
4.5.6 Create taxonomic (RefMG) profiles
The filtered BAM file is processed to generate gene abundances and also the taxonomic profiles. Again, this is done in two steps, first with -no_paste, and then when we have processed a larger number of samples like this, we can summarize them (done in the next step).
- Command
mc -sf FILE -p Ref10MGv9.padded -r screened.screened.adapter.on.hg19.on.mOTU.nr.padded -e -identity 97 -mode RefMG -previous_db_calc_tax_stats_file -memory 3G -o RESULTS
- Memory requirements and other recommendations
The required memory here is less than 3GB. 1 CPU is used. - Parameters
- -mode RefMG: by specifying this mode we generate the taxonomic profiles. Note that we could not run mode mOTU here, because there are specific files attached to each of these two databases. And the databases are formatted in specific ways so that these different modes function for different databases.
- -previous_db_calc_tax_stats_file: This is a important technical detail. By setting this we change the number of total mapped reads from the total that matched the previous database, to the total that mapped within the coordinates of the genes of the previous database. The difference will not be very big, if this is omitted.
- Input
The BAM file in:
reads.filtered.Ref10MGv9.padded.solexaqa
- Output
We produce both gene abundance tables, and the taxonomic profiles in this step. The gene abundance tables are never used, but are implicitly used in the algorithm to create the taxonomic abundance tables.
taxonomic.profiles.Ref10MGv9.padded.solexaqa base.coverage.Ref10MGv9.padded.solexaqa insert.coverage.Ref10MGv9.padded.solexaqa
In the PROJECT FOLDER we get the MAIN output; summarized taxonomic profiles, and also gene abundance tables (that we're not using):
../taxonomic.profiles ../abundance.tables
This said, the most used files, the actual profiles, are stored in the RESULTS folder:
../RESULTS/NCBI.species.abundances.tab.gz
5 Part III - useful tips
5.1 Log files
Everything MOCAT does is stored in log files. Good things is that it's fairly easy to trace an error. Bad thing is that there'll be a lot of files. I personally use the following log files:
5.1.1 samples specific output
<PROJECT>/logs/<STEP>/samples/<SAMPLE LOG>.log
These files describes output for each sample individually. This is usually where errors are reported.
5.1.2 job specific output
<PROJECT>/logs/<STEP>/commands/<JOB ID>.log
These files contains information about the specific run. When and where it was started, and most importantly if it failed or succeeded. And if, failed, usually the failed samples are reported.
5.1.3 resource output
<PROJECT>/logs/resources
These are the resources that were required for a specific job. Very useful when we want to run where we can run a specific job, to optimize the pipeline.
5.2 MOCAT features
Some of the MOCAT features I use most often are:
- -noqueue: adding this option runs the job outside of the queue, without changing the config file
- -x: setting this will only create the job files, but not actually submit the jobs to the queue
- -cpus: changes the number of CPUs for the job. Setting this to a 1 for e.g. assembly will obviously slow down that step
- -tc: sets the maximum number of jobs that are being run in the queue at the same time (only works on SGE)
- -config XXX=aaa: overrides config setting XXX and sets it to aaa
Date: 2013-10-22 11:28:43 CEST
HTML generated by org-mode 6.33x in emacs 23