# Tutorial 2: Advanced mOTUs usage ## Length filtering (`-l`) This is the minimum alignment length between a read and a marger gene sequence to consider the read for the subsequent counting step. The default minimum length cutoff is 75 bases. > Adjusting `-l` is crucial when a substantial fraction of your sequencing reads themselves are shorter than the default of 75 bases, as is often the case with older sequencing technologies. For example if your average read length is 45 bases, `l` has to be smaller than 45. The `-l` option also affects the mapping rate of reads to the marker gene database. You can increase the fraction of mapped reads by lowering `-l` (for example using a cutoff of 50 bases for reads with an average length of 100). This might help in detecting more taxa at the cost of retaining misalignments. Raising the minimum length cutoff (such as having `-l 90` when the average read length is 100 bp) will increase the stringency with which taxa are detected (since fewer reads will pass the filter) at the cost of missing taxa. Let us explore how changing `l` affects the number of detected species: ```bash # l = 50; significantly shorter than the average read length of 98 motus profile -f sampleA_1.fastq -r sampleA_2.fastq -n sampleA -o sampleA_l50.motus -l 50 # number of species detected; this command also counts unassigned so subtract 1 from the result grep -c -v '0.0000000000\|#' sampleA_l50.motus # 100 species were detected ``` ```bash # l = 90; note that the average read length is 98 motus profile -f sampleA_1.fastq -r sampleA_2.fastq -n sampleA -o sampleA_l90.motus -l 90 grep -c -v '0.0000000000\|#' sampleA_l90.motus # 92 species were detected ``` ```bash # l = 200; this is much longer than the average length of reads motus profile -f sampleA_1.fastq -r sampleA_2.fastq -n sampleA -o sampleA_l200.motus -l 200 grep -c -v '0.0000000000\|#' sampleA_l90.motus # 0 species were detected; this is because not a single read passes the length filter. ``` ## Presence/absence of marker gene clusters (`-g`) Every mOTU is composed of a minimum of 6 and at most 10 marker genes clusters. The read count of the mOTU is calculated as the median of the non-zero read counts of the marker gene clusters. The `-g` option (possible values are integers between 1 and 10) controls the minimum number of detected marker gene clusters required to detect a mOTU. The default value is 3, meaning that at least 3 marker gene clusters need to have a non-zero read count say that the mOTU is present in the sample. You can adjust the trade-off between precision and recall by fine-tuning `-g`. Higher values will increase precision (meaning fewer false positives) at the cost of recall (failing to detect true taxa). Lower values will identify more taxa but with lower precision (more false positives). > Proceed with caution when using values above 6, as some mOTUs may become undetectable if they contain fewer marker gene clusters than the set value. Let us explore how changing `g` affects the number of detected species: ```bash # g = 1 motus profile -f sampleA_1.fastq -r sampleA_2.fastq -n sampleA -o sampleA_g1.motus -g 1 # number of species detected; this command also counts unassigned so subtract 1 from the result grep -c -v '0.0000000000\|#' sampleA_g1.motus # 239 species were detected ``` ```bash # g = 8 motus profile -f sampleA_1.fastq -r sampleA_2.fastq -n sampleA -o sampleA_g8.motus -g 8 # number of species detected; this command also counts unassigned so subtract 1 from the result grep -c -v '0.0000000000\|#' sampleA_g8.motus # 37 species were detected ``` > **Precision** > > The fraction of taxa truly present in the sample out of all the detected taxa. Increasing precision will decrease the fraction of false positives among the detected at the cost of recall (failing to detect true taxa). > > **Recall** > > The fraction of taxa detected out of all the taxa present in the sample. Increasing recall will allow us to detect low abundace microbes at the cost of precision (more false positives). > > The combined use of options `-l` and `-g` in `motus profile` can fine-tune precision and recall. For example if `-l` is low, you can compensate for misalignments by setting a higher value for `-g`. ## Quantification of mOTUs (`-y`) There are three modes to quantify the abundance of marker gene clusters in mOTUs which you can select with the `-y` option: * `insert.scaled_counts`(default): normalizes the number of mapped inserts by marker gene length and then additionally rescales them to the same range as the initial counts i.e., the sum of all mapped inserts divided by the sum of gene length-normalized insert counts. * `insert.raw_counts`: counts the number of inserts mapping to each gene within a marker gene cluster. These counts are affected by gene-length differences, as longer genes will recruit more reads. * `base.coverage`: quantifies the average base coverage of the gene by measuring the alignment length of all reads mapping against the marker genes divided by the length of the associated marker genes. This accounts for both gene length differences and varying read lengths. Let us explore how the mOTUs profile changes if we use these different quantification modes: ```bash #Using `insert.scaled_counts` #this is also the default read counting algorithm motus profile -f sampleA_1.fastq -r sampleA_2.fastq -n sampleA -o sampleA_inssca.motus #species with highest relative abundances (based on read counts scaled by gene length) sort -t$'\t' -k2 -rn sampleA_inssca.motus | head -n 10 #result Akkermansia species incertae sedis [meta_mOTU_v3_12805] 0.2908309155 Citrobacter sp. [ref_mOTU_v3_00096] 0.1313391619 Escherichia coli [ref_mOTU_v3_00095] 0.1036889168 Flavonifractor plautii [ref_mOTU_v3_02971] 0.0647546569 Ruthenibacterium lactatiformans [ref_mOTU_v3_04716] 0.0525975924 Oscillibacter species incertae sedis [ext_mOTU_v3_16336] 0.0404556273 Oscillibacter sp. [ref_mOTU_v3_04664] 0.0384171044 Clostridium sp. CAG:217 [meta_mOTU_v3_12270] 0.0374014634 unassigned 0.0336275544 Flavonifractor plautii [ref_mOTU_v3_05238] 0.0309776043 ``` ```bash # Using `insert.raw_counts` motus profile -f sampleA_1.fastq -r sampleA_2.fastq -n sampleA -o sampleA_insraw.motus -y insert.raw_counts # species with highest relative abundances (based on raw read counts) sort -t$'\t' -k2 -rn sampleA_insraw.motus | head -n 10 #result Akkermansia species incertae sedis [meta_mOTU_v3_12805] 0.2841867197 Citrobacter sp. [ref_mOTU_v3_00096] 0.1511182084 Escherichia coli [ref_mOTU_v3_00095] 0.1102183629 Flavonifractor plautii [ref_mOTU_v3_02971] 0.0683584272 Ruthenibacterium lactatiformans [ref_mOTU_v3_04716] 0.0455631411 Oscillibacter species incertae sedis [ext_mOTU_v3_16336] 0.0401828848 Clostridium sp. CAG:217 [meta_mOTU_v3_12270] 0.0366754483 Oscillibacter sp. [ref_mOTU_v3_04664] 0.0360993941 Flavonifractor plautii [ref_mOTU_v3_05238] 0.0281884539 unassigned 0.0260374774 ``` ```bash # Using `base.coverage` motus profile -f sampleA_1.fastq -r sampleA_2.fastq -n sampleA -o sampleA_basecov.motus -y base.coverage #species with highest base coverage sort -t$'\t' -k2 -rn sampleA_basecov.motus | head -n 10 #result Akkermansia species incertae sedis [meta_mOTU_v3_12805] 0.2259902579 Citrobacter sp. [ref_mOTU_v3_00096] 0.1871926478 Escherichia coli [ref_mOTU_v3_00095] 0.0576114815 Oscillibacter species incertae sedis [ext_mOTU_v3_16336] 0.0549728846 Clostridium sp. CAG:217 [meta_mOTU_v3_12270] 0.0538317335 Flavonifractor plautii [ref_mOTU_v3_02971] 0.0537337415 unassigned 0.0418818211 Oscillibacter sp. [ref_mOTU_v3_04664] 0.0408552295 Ruthenibacterium lactatiformans [ref_mOTU_v3_04716] 0.0359480354 Lachnospiraceae species incertae sedis [ext_mOTU_v3_16428] 0.0320283941 ``` > You can also perform single-nucleotide variant (SNV) profiling and long read profiling with the mOTUs tool. Check out [SNV profiling](SNV_profiling.md) and [Long read profiling](long_reads.md) for more information.