# Tutorial 2: Advanced mOTUs usage


## Length filtering (`-l`)

This is the minimum alignment length between a read and a marger gene sequence to consider the read for the subsequent counting step. The default minimum length cutoff is 75 bases. 

> Adjusting `-l` is crucial when a substantial fraction of your sequencing reads themselves are shorter than the default of 75 bases, as is often the case with older sequencing technologies. For example if your average read length is 45 bases, `l` has to be smaller than 45.

The `-l` option also affects the mapping rate of reads to the marker gene database. You can increase the fraction of mapped reads by lowering `-l` (for example using a cutoff of 50 bases for reads with an average length of 100). This might help in detecting more taxa at the cost of retaining misalignments. Raising the minimum length cutoff (such as having `-l 90` when the average read length is 100 bp) will increase the stringency with which taxa are detected (since fewer reads will pass the filter) at the cost of missing taxa. 

Let us explore how changing `l` affects the number of detected species:

```bash
# l = 50; significantly shorter than the average read length of 98
motus profile -f sampleA_1.fastq -r sampleA_2.fastq -n sampleA -o sampleA_l50.motus -l 50

# number of species detected; this command also counts unassigned so subtract 1 from the result
grep -c -v '0.0000000000\|#' sampleA_l50.motus

# 100 species were detected
```

```bash
# l = 90; note that the average read length is 98 

motus profile -f sampleA_1.fastq -r sampleA_2.fastq -n sampleA -o sampleA_l90.motus -l 90

grep -c -v '0.0000000000\|#' sampleA_l90.motus
# 92 species were detected
```
```bash
# l = 200; this is much longer than the average length of reads
motus profile -f sampleA_1.fastq -r sampleA_2.fastq -n sampleA -o sampleA_l200.motus -l 200

grep -c -v '0.0000000000\|#' sampleA_l90.motus
# 0 species were detected; this is because not a single read passes the length filter.
```

## Presence/absence of marker gene clusters (`-g`)

Every mOTU is composed of a minimum of 6 and at most 10 marker genes clusters. The read count of the mOTU is calculated as the median of the non-zero read counts of the marker gene clusters. 

The `-g` option (possible values are integers between 1 and 10) controls the minimum number of detected marker gene clusters required to detect a mOTU. The default value is 3, meaning that at least 3 marker gene clusters need to have a non-zero read count say that the mOTU is present in the sample. You can adjust the trade-off between precision and recall by fine-tuning `-g`. Higher values will increase precision (meaning fewer false positives) at the cost of recall (failing to detect true taxa). Lower values will identify more taxa but with lower precision (more false positives).

> Proceed with caution when using values above 6, as some mOTUs may become undetectable if they contain fewer marker gene clusters than the set value. 


Let us explore how changing `g` affects the number of detected species:

```bash
# g = 1
motus profile -f sampleA_1.fastq -r sampleA_2.fastq -n sampleA -o sampleA_g1.motus -g 1

# number of species detected; this command also counts unassigned so subtract 1 from the result
grep -c -v '0.0000000000\|#' sampleA_g1.motus

# 239 species were detected
```
```bash
# g = 8
motus profile -f sampleA_1.fastq -r sampleA_2.fastq -n sampleA -o sampleA_g8.motus -g 8

# number of species detected; this command also counts unassigned so subtract 1 from the result
grep -c -v '0.0000000000\|#' sampleA_g8.motus

# 37 species were detected
```

> **Precision** 
>
> The fraction of taxa truly present in the sample out of all the detected taxa. Increasing precision will decrease the fraction of false positives among the detected at the cost of recall (failing to detect true taxa).
>
> **Recall**
> 
> The fraction of taxa detected out of all the taxa present in the sample. Increasing recall will allow us to detect low abundace microbes at the cost of precision (more false positives).
> 
> The combined use of options `-l` and `-g` in `motus profile` can fine-tune precision and recall. For example if `-l` is low, you can compensate for misalignments by setting a higher value for `-g`.


## Quantification of mOTUs (`-y`)

There are three modes to quantify the abundance of marker gene clusters in mOTUs which you can select with the `-y` option:

* `insert.scaled_counts`(default): normalizes the number of mapped inserts by marker gene length and then additionally rescales them to the same range as the initial counts i.e., the sum of all mapped inserts divided by the sum of gene length-normalized insert counts.  
* `insert.raw_counts`: counts the number of inserts mapping to each gene within a marker gene cluster. These counts are affected by gene-length differences, as longer genes will recruit more reads. 
* `base.coverage`: quantifies the average base coverage of the gene by measuring the alignment length of all reads mapping against the marker genes divided by the length of the associated marker genes. This accounts for both gene length differences and varying read lengths. 

Let us explore how the mOTUs profile changes if we use these different quantification modes:


```bash
#Using `insert.scaled_counts`

#this is also the default read counting algorithm
motus profile -f sampleA_1.fastq -r sampleA_2.fastq -n sampleA -o sampleA_inssca.motus 

#species with highest relative abundances (based on read counts scaled by gene length)
sort -t$'\t' -k2 -rn sampleA_inssca.motus | head -n 10

#result
Akkermansia species incertae sedis [meta_mOTU_v3_12805]	0.2908309155
Citrobacter sp. [ref_mOTU_v3_00096]	0.1313391619
Escherichia coli [ref_mOTU_v3_00095]	0.1036889168
Flavonifractor plautii [ref_mOTU_v3_02971]	0.0647546569
Ruthenibacterium lactatiformans [ref_mOTU_v3_04716]	0.0525975924
Oscillibacter species incertae sedis [ext_mOTU_v3_16336]	0.0404556273
Oscillibacter sp. [ref_mOTU_v3_04664]	0.0384171044
Clostridium sp. CAG:217 [meta_mOTU_v3_12270]	0.0374014634
unassigned	0.0336275544
Flavonifractor plautii [ref_mOTU_v3_05238]	0.0309776043
```


```bash
# Using `insert.raw_counts`

motus profile -f sampleA_1.fastq -r sampleA_2.fastq -n sampleA -o sampleA_insraw.motus -y insert.raw_counts

# species with highest relative abundances (based on raw read counts)
sort -t$'\t' -k2 -rn sampleA_insraw.motus | head -n 10

#result
Akkermansia species incertae sedis [meta_mOTU_v3_12805]	0.2841867197
Citrobacter sp. [ref_mOTU_v3_00096]	0.1511182084
Escherichia coli [ref_mOTU_v3_00095]	0.1102183629
Flavonifractor plautii [ref_mOTU_v3_02971]	0.0683584272
Ruthenibacterium lactatiformans [ref_mOTU_v3_04716]	0.0455631411
Oscillibacter species incertae sedis [ext_mOTU_v3_16336]	0.0401828848
Clostridium sp. CAG:217 [meta_mOTU_v3_12270]	0.0366754483
Oscillibacter sp. [ref_mOTU_v3_04664]	0.0360993941
Flavonifractor plautii [ref_mOTU_v3_05238]	0.0281884539
unassigned	0.0260374774
```


```bash
# Using `base.coverage`
motus profile -f sampleA_1.fastq -r sampleA_2.fastq -n sampleA -o sampleA_basecov.motus -y base.coverage

#species with highest base coverage
sort -t$'\t' -k2 -rn sampleA_basecov.motus | head -n 10

#result
Akkermansia species incertae sedis [meta_mOTU_v3_12805]	0.2259902579
Citrobacter sp. [ref_mOTU_v3_00096]	0.1871926478
Escherichia coli [ref_mOTU_v3_00095]	0.0576114815
Oscillibacter species incertae sedis [ext_mOTU_v3_16336]	0.0549728846
Clostridium sp. CAG:217 [meta_mOTU_v3_12270]	0.0538317335
Flavonifractor plautii [ref_mOTU_v3_02971]	0.0537337415
unassigned	0.0418818211
Oscillibacter sp. [ref_mOTU_v3_04664]	0.0408552295
Ruthenibacterium lactatiformans [ref_mOTU_v3_04716]	0.0359480354
Lachnospiraceae species incertae sedis [ext_mOTU_v3_16428]	0.0320283941
```

> You can also perform single-nucleotide variant (SNV) profiling and long read profiling with the mOTUs tool. Check out [SNV profiling](SNV_profiling.md) and [Long read profiling](long_reads.md) for more information.