# Tutorial 1: Generating and exploring taxonomic profiles

This tutorial requires the mOTU profiler to be correctly installed as described in the [Installation](install.md) section. 

This guide has been written assuming you use a Mac or Linux Command Line. 


Before we begin, we need to download some example paired-end short read sequencing data. Here we have forward (suffixed with "_1.fastq") and reverse reads (suffixed with "_2.fastq") from two metagenomic samples: A and B. 

```bash
#download sampleA read files
wget https://zenodo.org/record/6517497/files/sampleA_1.fastq
wget https://zenodo.org/record/6517497/files/sampleA_2.fastq

#download sampleB read files
wget https://zenodo.org/record/6517497/files/sampleB_1.fastq
wget https://zenodo.org/record/6517497/files/sampleB_2.fastq
```

If you are using macOS, you need to use `curl`:
```
curl <link> -o <file name>
```

Example:

```bash
curl https://zenodo.org/record/6517497/files/sampleA_1.fastq -o sampleA_1.fastq
```
## Using `motus profile` on a single metagenomic sample

We want to create a species-level taxonomic abundance profile for a sample.  
You can use the `motus profile` command to create a taxonomic profile from a metagenomic sample consisting of one or more FASTQ-formatted quality-controlled read files.  


```bash
#creating a profile of sample A
motus profile -f sampleA_1.fastq -r sampleA_2.fastq -n sampleA -o sampleA.motus

#creating a profile of sample B
motus profile -f sampleB_1.fastq -r sampleB_2.fastq -n sampleB -o sampleB.motus
```

where the flags 

`-f` refers to the forward read file(s),

`-r` refers to the reverse read file(s), 

`-n` refers to the name you wish to assign to this profile (this is important for merging multiple taxonomic profiles),

`-o` refers to where the output of `motus profile` is written.


> **Input metagenomic data format**
>* Input sequencing read files must be FASTQ-formatted (compressed or uncompressed). BedTools can be used to translate SAM/BAM-formatted reads files to FASTQ formatted read files (**provide command**)
>*  In the case of paired-end reads, the number of reads in both files should be the same. 
>* The mOTU profiler can profile multiple paired-end runs from the same sample at a time if the matching paired-end files are provided in matching order for the respective optional arguments of the mOTUs command. 
>* Files with interleaved reads (forward and reverse reads in the same file) are not supported. However you can separate reads in the file with BBMap before using mOTUs (**provide command**).
>* Unpaired sequencing read file(s) can also be submitted with the `-s` option. Unpaired reads are usually generated in single-end sequencing mode, or were overlapping paired-end that were merged or are singletons that lost their mate during quality control. 


> **Quality control of sequencing read files**
> 
> While here we have used raw reads to generate a taxonomic profiles, a quality control of input sequencing data is recommended. This can involve removal of contamination, trimming of adapter or barcode sequences and removal of low-quality bases. 
> 
> As a guideline we recommend that all bases have a Phred quality score of at least 20 and that sequencing reads be at least 75 bases in length. However we have an option to adjust this length criteria if required (check out **Tutorial 2: Advanced usage** later in this page)

## Threading options

The runtime of `mOTUs profile` command is limited by the computation of alignments of input sequences against the marker gene database using BWA. 

You can assign multiple cores (`-t` flag) to accelerate the alignment process: 
 
```bash
motus profile -f sampleA_1.fastq -r sampleA_2.fastq -n sampleA -o sampleA.motus -t 8
```

Parallelization on up to 16 threads improves runtime almost linearly; additional cores, however, result in sublinear speed-ups (check out image below).

![thread_graph](https://www.embl.de/download/zeller/milanese/pic_github/threads3.png)

## Exploring the taxonomic profile

The resulting profile reports the relative abundance for each mOTU (ref + meta + ext)

```bash
#output the first ten lines of the profile
head -n 10 sampleA.motus
```

```
# git tag version 3.0.1 |  motus version 3.0.1 | map_tax 3.0.1 | gene database: nr3.0.1 | calc_mgc 3.0.1 -y insert.scaled_counts -l 75 | calc_motu 3.0.1 -k mOTU -C no_CAMI -g 3 | taxonomy: ref_mOTU_3.0.1 meta_mOTU_3.0.1
# call: python motus-3.0.1//motus profile -f sampleA_1.fastq -r sampleA_2.fastq -n sampleA -o sampleA.motus
#consensus_taxonomy	sampleA
Leptospira alexanderi [ref_mOTU_v3_00001]	0.0000000000
Leptospira weilii [ref_mOTU_v3_00002]	0.0000000000
Chryseobacterium sp. [ref_mOTU_v3_00004]	0.0000000000
Chryseobacterium gallinarum [ref_mOTU_v3_00005]	0.0000000000
Chryseobacterium indologenes [ref_mOTU_v3_00006]	0.0000000000
Chryseobacterium artocarpi/ureilyticum [ref_mOTU_v3_00007]	0.0000000000
Chryseobacterium jejuense [ref_mOTU_v3_00008]	0.0000000000
```

The first line indicates which version of `motus` and the database we used (`# git tag version 3.0.1 |  motus version 3.0.1 | map_tax 3.0.1 | gene database: nr3.0.1 `), the parameters used for computing the mOTU abundances (`calc_mgc 3.0.1 -y insert.scaled_counts -l 75 |calc_motu 3.0.1 -k mOTU -C no_CAMI -g 3`), and the taxonomy version we used (`taxonomy: ref_mOTU_3.0.1 meta_mOTU_3.0.1`) . It is useful to check the the parameters used in order to reproduce the exact same profile. We will get into what the specific flags in the full command do in the next tutorial. 

The second line tells us what specific call we used to run `motus` from the command line with the information of the fastq files that were used.

The third line gives us the header of two columns that result in a taxonomic profile namely the taxa name/id and sample name as specified by the `-n` flag. 

The output of mOTUs is after these first three lines: a tab-separated table where the first column lists the taxonomic identifiers, dy default at the rank of mOTUs (i.e., species level), and the second column correspond to their relative abundances in the sample. 

You can easily remove the first two rows with:

```bash
#remove the first two rows of the output file
tail -n+3 sampleA.motus
```

If you would like to take a look at the ten most abundant mOTUs:

```bash
#output the ten most abundant motus
sort -t$'\t' -k2 -rn sampleA.motus | head -n 10
```
```
Akkermansia species incertae sedis [meta_mOTU_v3_12805]	0.2908309155
Citrobacter sp. [ref_mOTU_v3_00096]	0.1313391619
Escherichia coli [ref_mOTU_v3_00095]	0.1036889168
Flavonifractor plautii [ref_mOTU_v3_02971]	0.0647546569
Ruthenibacterium lactatiformans [ref_mOTU_v3_04716]	0.0525975924
Oscillibacter species incertae sedis [ext_mOTU_v3_16336]	0.0404556273
Oscillibacter sp. [ref_mOTU_v3_04664]	0.0384171044
Clostridium sp. CAG:217 [meta_mOTU_v3_12270]	0.0374014634
unassigned	0.0336275544
Flavonifractor plautii [ref_mOTU_v3_05238]	0.0309776043
```
The `unassigned` that you see refers to the fraction of unmapped reads which represents species that we know are present in a sample but are not able to quantify.


There are many taxa that have a relative abundance of 0, meaning they are not detected in the sample. 

You can quickly check how many species were actually detected with:

```bash
#number of species detected; this command also counts unassigned so subtract 1 from the result
grep -c -v '0.0000000000\|#' sampleA.motus
```
We see a total of `96` taxa are detected. 

You can check how many ref-mOTUs (or meta-mOTUs or ext-mOTUs) were detected using these command: 

```bash
#keeping only taxa that were dectected 
grep -v '0.0000000000\|#' sampleA.motus > sampleA_detected.motus
#counting the number of detected ref-mOTUs
grep -c 'ref_mOTU_v3_' sampleA_detected.motus
#counting the number of meta-mOTUs
grep -c 'meta_mOTU_v3_' sampleA_detected.motus
```
`38` ref-mOTUs sand `20` `meta-mOTUs` were detected. 


## Merging multiple taxonomic profiles

In metagenomic studies, you typically want to explore the taxonomic profile of multiple samples at once. This is usually in the form of a table where the rows represent taxa, columns represent samples, each cell is the abundance of the taxa in the sample. This kind of taxa abundance table is also usually the basis of subsequent downstream analysis. 

Using the `motus merge` command, you can easily merge the profiles of multiple samples


```bash
#merging profiles of sample A and sample B
motus merge -i sampleA.motus,sampleB.motus -o merged.motus
```

The resulting merged profile consists of three columns: taxa id and abundance profiles of the two samples
```bash
head -n 10 merged.motus
```

```
# motus version 3.0.1 | merge 3.0.1 | info merged profiles: # git tag version 3.0.1 |  motus version 3.0.1 | map_tax 3.0.1 | gene database: nr3.0.1 | calc_mgc 3.0.1 -y insert.scaled_counts -l 75 | calc_motu 3.0.1 -k mOTU -C no_CAMI -g 3 | taxonomy: ref_mOTU_3.0.1 meta_mOTU_3.0.1 
# call: python motus-3.0.1//motus merge -i sampleA.motus,sampleB.motus -o merged.motus
#consensus_taxonomy	sampleA	sampleB
Leptospira alexanderi [ref_mOTU_v3_00001]	0.0000000000	0.0000000000
Leptospira weilii [ref_mOTU_v3_00002]	0.0000000000	0.0000000000
Chryseobacterium sp. [ref_mOTU_v3_00004]	0.0000000000	0.0000000000
Chryseobacterium gallinarum [ref_mOTU_v3_00005]	0.0000000000	0.0000000000
Chryseobacterium indologenes [ref_mOTU_v3_00006]	0.0000000000	0.0000000000
Chryseobacterium artocarpi/ureilyticum [ref_mOTU_v3_00007]	0.0000000000	0.0000000000
Chryseobacterium jejuense [ref_mOTU_v3_00008]	0.0000000000	0.0000000000
```

If you have many profiles that you would like to merge and they are all in the same folder DIR, you can do

```bash
#merging all taxonomic profiles within directory DIR
motus merge -d DIR -o merged.motus
```

> Note that the directory must only have mOTUs profiles. The command will fail if there are other types of files in the directory. 


> Of course you can always import the profile as a `tsv` and explore it using your favourite programming language.
> To import the profile in R: you can run the following command:
> `read.csv("sampleA.motus",header = T,sep = "\t",skip = 2, row.names = 1)`


## Some options to modify the resulting taxonomic profile

### Display read counts (`-c`)

 It is often that downstream analyses (like rarefaction or differential abundance testing) using the taxa abundance table require abundances as read counts. The `-c` flag changes the output from relative abundances to number of assigned reads. 

```bash
motus profile -c -f sampleA_1.fastq -r sampleA_2.fastq -n sampleA -o sampleA.motus 
```
Again, we can have a look at the mOTUs with the largest read counts

```bash
sort -t$'\t' -k2 -rn sampleA.motus | head -n 10
```

```
Akkermansia species incertae sedis [meta_mOTU_v3_12805]	826
Citrobacter sp. [ref_mOTU_v3_00096]	373
Escherichia coli [ref_mOTU_v3_00095]	294
Flavonifractor plautii [ref_mOTU_v3_02971]	184
Ruthenibacterium lactatiformans [ref_mOTU_v3_04716]	149
Oscillibacter species incertae sedis [ext_mOTU_v3_16336]	115
Oscillibacter sp. [ref_mOTU_v3_04664]	109
Clostridium sp. CAG:217 [meta_mOTU_v3_12270]	106
unassigned	95
Flavonifractor plautii [ref_mOTU_v3_05238]	88
```

### Profile at a different taxonomic level 

You can change the taxonomic level at which to display abundances by using the `-k` flag:

For example, if you are interested in overall abundances of phyla in sampleA:

```bash
motus profile -f sampleA_1.fastq -r sampleA_2.fastq -n sampleA -o sampleA.motus -k phylum
```
Let's have a look at the most abundant phyla:

```bash
sort -t$'\t' -k2 -rn sampleA_phylum.motus | head -n 10
```

```
Firmicutes	0.4223272722
Verrucomicrobia	0.2911835465
Proteobacteria	0.2351208666
unassigned	0.0336275544
Actinobacteria	0.0151475705
Bacteroidetes	0.0018493239
Synergistetes	0.0007438659
candidate division Zixibacteria	0.0000000000
candidate division WWE3	0.0000000000
candidate division WOR-3	0.0000000000
```

> There are more options that influence the quality of the alignment (minimum length, percent identity) or change the output format and reported data (NCBI taxID, full rank, summarizing at a specific taxonomic rank). These options can be simply typing `motus profile` in your terminal window. We have a more comprehensive description of all the flags and parameters in [Advanced options](advanced_options.md).