## FAQ

### 1. How many mOTUs read counts should I expect to have?
The number of mOTUs read counts (obtained with the `-c` option) is proportional to the library size (the number of reads in the fastq files).
There is a Pearson correlation of 0.88. Here is a plot of what to expect when using human fecal samples:

<img src="https://www.embl.de/download/zeller/milanese/pic_github/n_read_info.png" width="500">

Note that we counted all reads here, hence paired end reads (one in the forward fastq file and one in the reverse fastq file) are counted as two separate reads.

Number of mOTUs count expected:
| Total number of reads (million) | Median mOTUs count |
| ------------- | ------------- |
| 5  | 600  |
| 8  | 900  |
| 15  | 1,900  |
| 25  | 3,300  |
| 35  | 5,500  |
| 50  | 8,800  |
| 100  | 13,000  |

### 2. Why do only a few reads map in my profiles?

One possibility is that all the reads are filtered out. By default, the mOTUs profiler filters out all the reads that map with less than 75 nucleotides (`Default minimum alignment length -l 75`). This might happen in the case of old metagenomic samples where the fastq reads were on average of length 50. Always try to keep `l` smaller than the average read length (in this case `-l 45`) to keep more reads during the filtering process.  
Note that a warning will be printed by the tool in *stderr* if your average read length is shorter than `l` like so:
```
Warning: Average read length (50) is lower than the -l filter (75). We suggest to decrease the value of -l
```
You can also add `-g 1` to keep more reads (see heading `Adjusting precision and recall` in [Tutorial 2](documentation/tutorial.md) for more information).  

Another possibility is that you have samples that are not well represented by reference genomes and by the MAGs used to build the mOTUs database. Although we used almost 700,000 genomes from more than 10,000 samples, you might have samples from a completely new environment that has species that have never been seen before (see [here](https://doi.org/10.1101/2021.04.20.440600 ) for the environments that were used in mOTUs3). In this case, you could contribute your own genomes to extend the mOTUs database (and consequently the ability to profile new species) with the [mOTUs-extender tool](https://github.com/motu-tool/mOTUs-extender).

### 3. What is the meaning of the unassigned fraction?
The ```unassigned``` at the end of the profile file represents the fraction of unmapped reads. This represents species that we know to be present in the sample, but are not able to quantify individually. Hence we group them together into an ```unassigned``` fraction. For almost all subsequent analyses, it is better to remove this value, since it does not represent a single species/clade. The usefulness of the `unassigned` fraction is shown when we need to calculate relative abundances. See the following example:
```
 True rel. ab.      mOTUs read counts      mOTUs rel. ab.
species1   20%        species1    200     species1    20%
species2   10%        species3    300     species3    30%
species3   30%        species4    100     species4    10%
species4   10%        unassigned  400     unassigned  40%
species5   30%
```
In the example, the sample (True rel. ab.) contains 5 species, of which only 3 are represented in the mOTUs profiler. Despite this, the relative abundance of these species is correct since we are able to measure the unassigned (or unmapped reads). If you would calculate the relative abundance without taking into account the unassigned, then you would get an over-estimation of the profiled species:
```
 True rel. ab.     mOTUs read counts       mOTUs rel. ab.
species1   20%        species1   200     species1   33.4%
species2   10%        species3   300     species3     50%
species3   30%        species4   100     species4   16.6%
species4   10%
species5   30%
```
For your analysis (for example comparing healthy controls to diseased samples), you will use `species1:20%; species2:30%; species3:10%` and remove the `unassigned` (after calculating the relative abundances).


### 4. Where can I find the taxonomy annotation for each mOTUs?

You can download the latest mOTUs database from [this link](https://zenodo.org/record/5140350). When you unzip the file, you can find the following files:
- `db_mOTU_taxonomy_meta-mOTUs.tsv`, the taxonomy for the meta-mOTUs, with a total of 8 columns: first column is the mOTUs ID and then the 7 levels (kingdom to species);
- `db_mOTU_taxonomy_ref-mOTUs.tsv`, the taxonomy of the ref-mOTUs, with a total of 9 columns: first column is the specI ID (as in http://progenomes.embl.de/), second column is the ref-mOTUs ID and then the 7 levels (kingdom to species);
- `db_mOTU_taxonomy_ref-mOTUs_short_names.tsv`, this is a file with 3 columns: ref-mOTUs ID, short name and full name. The full name corresponds to the last column of `db_mOTU_taxonomy_ref-mOTUs.tsv`. Since many of these names are really long, we created a shorter version (second column) which is the one printed by the profiler by default. If you use the `-u` command in mOTUs, you will print the full name for the species.

Note that if you have already installed mOTUs, the database and the files are already present in your system.


### 5. How can I contact you if I have questions?

Feature requests / bug reports should be submitted via the official [GitHub](https://github.com/motu-tool/mOTUs/issues) repository.