# Tutorial 1: Generating and exploring taxonomic profiles This tutorial requires the mOTU profiler to be correctly installed as described in the [Installation](install.md) section. This guide has been written assuming you use a Mac or Linux Command Line. Before we begin, we need to download some example paired-end short read sequencing data. Here we have forward (suffixed with "_1.fastq") and reverse reads (suffixed with "_2.fastq") from two metagenomic samples: A and B. ```bash #download sampleA read files wget https://zenodo.org/record/6517497/files/sampleA_1.fastq wget https://zenodo.org/record/6517497/files/sampleA_2.fastq #download sampleB read files wget https://zenodo.org/record/6517497/files/sampleB_1.fastq wget https://zenodo.org/record/6517497/files/sampleB_2.fastq ``` If you are using macOS, you need to use `curl`: ``` curl -o ``` Example: ```bash curl https://zenodo.org/record/6517497/files/sampleA_1.fastq -o sampleA_1.fastq ``` ## Using `motus profile` on a single metagenomic sample We want to create a species-level taxonomic abundance profile for a sample. You can use the `motus profile` command to create a taxonomic profile from a metagenomic sample consisting of one or more FASTQ-formatted quality-controlled read files. ```bash #creating a profile of sample A motus profile -f sampleA_1.fastq -r sampleA_2.fastq -n sampleA -o sampleA.motus #creating a profile of sample B motus profile -f sampleB_1.fastq -r sampleB_2.fastq -n sampleB -o sampleB.motus ``` where the flags `-f` refers to the forward read file(s), `-r` refers to the reverse read file(s), `-n` refers to the name you wish to assign to this profile (this is important for merging multiple taxonomic profiles), `-o` refers to where the output of `motus profile` is written. > **Input metagenomic data format** >* Input sequencing read files must be FASTQ-formatted (compressed or uncompressed). BedTools can be used to translate SAM/BAM-formatted reads files to FASTQ formatted read files (**provide command**) >* In the case of paired-end reads, the number of reads in both files should be the same. >* The mOTU profiler can profile multiple paired-end runs from the same sample at a time if the matching paired-end files are provided in matching order for the respective optional arguments of the mOTUs command. >* Files with interleaved reads (forward and reverse reads in the same file) are not supported. However you can separate reads in the file with BBMap before using mOTUs (**provide command**). >* Unpaired sequencing read file(s) can also be submitted with the `-s` option. Unpaired reads are usually generated in single-end sequencing mode, or were overlapping paired-end that were merged or are singletons that lost their mate during quality control. > **Quality control of sequencing read files** > > While here we have used raw reads to generate a taxonomic profiles, a quality control of input sequencing data is recommended. This can involve removal of contamination, trimming of adapter or barcode sequences and removal of low-quality bases. > > As a guideline we recommend that all bases have a Phred quality score of at least 20 and that sequencing reads be at least 75 bases in length. However we have an option to adjust this length criteria if required (check out **Tutorial 2: Advanced usage** later in this page) ## Threading options The runtime of `mOTUs profile` command is limited by the computation of alignments of input sequences against the marker gene database using BWA. You can assign multiple cores (`-t` flag) to accelerate the alignment process: ```bash motus profile -f sampleA_1.fastq -r sampleA_2.fastq -n sampleA -o sampleA.motus -t 8 ``` Parallelization on up to 16 threads improves runtime almost linearly; additional cores, however, result in sublinear speed-ups (check out image below). ![thread_graph](https://www.embl.de/download/zeller/milanese/pic_github/threads3.png) ## Exploring the taxonomic profile The resulting profile reports the relative abundance for each mOTU (ref + meta + ext) ```bash #output the first ten lines of the profile head -n 10 sampleA.motus ``` ``` # git tag version 3.0.1 | motus version 3.0.1 | map_tax 3.0.1 | gene database: nr3.0.1 | calc_mgc 3.0.1 -y insert.scaled_counts -l 75 | calc_motu 3.0.1 -k mOTU -C no_CAMI -g 3 | taxonomy: ref_mOTU_3.0.1 meta_mOTU_3.0.1 # call: python motus-3.0.1//motus profile -f sampleA_1.fastq -r sampleA_2.fastq -n sampleA -o sampleA.motus #consensus_taxonomy sampleA Leptospira alexanderi [ref_mOTU_v3_00001] 0.0000000000 Leptospira weilii [ref_mOTU_v3_00002] 0.0000000000 Chryseobacterium sp. [ref_mOTU_v3_00004] 0.0000000000 Chryseobacterium gallinarum [ref_mOTU_v3_00005] 0.0000000000 Chryseobacterium indologenes [ref_mOTU_v3_00006] 0.0000000000 Chryseobacterium artocarpi/ureilyticum [ref_mOTU_v3_00007] 0.0000000000 Chryseobacterium jejuense [ref_mOTU_v3_00008] 0.0000000000 ``` The first line indicates which version of `motus` and the database we used (`# git tag version 3.0.1 | motus version 3.0.1 | map_tax 3.0.1 | gene database: nr3.0.1 `), the parameters used for computing the mOTU abundances (`calc_mgc 3.0.1 -y insert.scaled_counts -l 75 |calc_motu 3.0.1 -k mOTU -C no_CAMI -g 3`), and the taxonomy version we used (`taxonomy: ref_mOTU_3.0.1 meta_mOTU_3.0.1`) . It is useful to check the the parameters used in order to reproduce the exact same profile. We will get into what the specific flags in the full command do in the next tutorial. The second line tells us what specific call we used to run `motus` from the command line with the information of the fastq files that were used. The third line gives us the header of two columns that result in a taxonomic profile namely the taxa name/id and sample name as specified by the `-n` flag. The output of mOTUs is after these first three lines: a tab-separated table where the first column lists the taxonomic identifiers, dy default at the rank of mOTUs (i.e., species level), and the second column correspond to their relative abundances in the sample. You can easily remove the first two rows with: ```bash #remove the first two rows of the output file tail -n+3 sampleA.motus ``` If you would like to take a look at the ten most abundant mOTUs: ```bash #output the ten most abundant motus sort -t$'\t' -k2 -rn sampleA.motus | head -n 10 ``` ``` Akkermansia species incertae sedis [meta_mOTU_v3_12805] 0.2908309155 Citrobacter sp. [ref_mOTU_v3_00096] 0.1313391619 Escherichia coli [ref_mOTU_v3_00095] 0.1036889168 Flavonifractor plautii [ref_mOTU_v3_02971] 0.0647546569 Ruthenibacterium lactatiformans [ref_mOTU_v3_04716] 0.0525975924 Oscillibacter species incertae sedis [ext_mOTU_v3_16336] 0.0404556273 Oscillibacter sp. [ref_mOTU_v3_04664] 0.0384171044 Clostridium sp. CAG:217 [meta_mOTU_v3_12270] 0.0374014634 unassigned 0.0336275544 Flavonifractor plautii [ref_mOTU_v3_05238] 0.0309776043 ``` The `unassigned` that you see refers to the fraction of unmapped reads which represents species that we know are present in a sample but are not able to quantify. There are many taxa that have a relative abundance of 0, meaning they are not detected in the sample. You can quickly check how many species were actually detected with: ```bash #number of species detected; this command also counts unassigned so subtract 1 from the result grep -c -v '0.0000000000\|#' sampleA.motus ``` We see a total of `96` taxa are detected. You can check how many ref-mOTUs (or meta-mOTUs or ext-mOTUs) were detected using these command: ```bash #keeping only taxa that were dectected grep -v '0.0000000000\|#' sampleA.motus > sampleA_detected.motus #counting the number of detected ref-mOTUs grep -c 'ref_mOTU_v3_' sampleA_detected.motus #counting the number of meta-mOTUs grep -c 'meta_mOTU_v3_' sampleA_detected.motus ``` `38` ref-mOTUs sand `20` `meta-mOTUs` were detected. ## Merging multiple taxonomic profiles In metagenomic studies, you typically want to explore the taxonomic profile of multiple samples at once. This is usually in the form of a table where the rows represent taxa, columns represent samples, each cell is the abundance of the taxa in the sample. This kind of taxa abundance table is also usually the basis of subsequent downstream analysis. Using the `motus merge` command, you can easily merge the profiles of multiple samples ```bash #merging profiles of sample A and sample B motus merge -i sampleA.motus,sampleB.motus -o merged.motus ``` The resulting merged profile consists of three columns: taxa id and abundance profiles of the two samples ```bash head -n 10 merged.motus ``` ``` # motus version 3.0.1 | merge 3.0.1 | info merged profiles: # git tag version 3.0.1 | motus version 3.0.1 | map_tax 3.0.1 | gene database: nr3.0.1 | calc_mgc 3.0.1 -y insert.scaled_counts -l 75 | calc_motu 3.0.1 -k mOTU -C no_CAMI -g 3 | taxonomy: ref_mOTU_3.0.1 meta_mOTU_3.0.1 # call: python motus-3.0.1//motus merge -i sampleA.motus,sampleB.motus -o merged.motus #consensus_taxonomy sampleA sampleB Leptospira alexanderi [ref_mOTU_v3_00001] 0.0000000000 0.0000000000 Leptospira weilii [ref_mOTU_v3_00002] 0.0000000000 0.0000000000 Chryseobacterium sp. [ref_mOTU_v3_00004] 0.0000000000 0.0000000000 Chryseobacterium gallinarum [ref_mOTU_v3_00005] 0.0000000000 0.0000000000 Chryseobacterium indologenes [ref_mOTU_v3_00006] 0.0000000000 0.0000000000 Chryseobacterium artocarpi/ureilyticum [ref_mOTU_v3_00007] 0.0000000000 0.0000000000 Chryseobacterium jejuense [ref_mOTU_v3_00008] 0.0000000000 0.0000000000 ``` If you have many profiles that you would like to merge and they are all in the same folder DIR, you can do ```bash #merging all taxonomic profiles within directory DIR motus merge -d DIR -o merged.motus ``` > Note that the directory must only have mOTUs profiles. The command will fail if there are other types of files in the directory. > Of course you can always import the profile as a `tsv` and explore it using your favourite programming language. > To import the profile in R: you can run the following command: > `read.csv("sampleA.motus",header = T,sep = "\t",skip = 2, row.names = 1)` ## Some options to modify the resulting taxonomic profile ### Display read counts (`-c`) It is often that downstream analyses (like rarefaction or differential abundance testing) using the taxa abundance table require abundances as read counts. The `-c` flag changes the output from relative abundances to number of assigned reads. ```bash motus profile -c -f sampleA_1.fastq -r sampleA_2.fastq -n sampleA -o sampleA.motus ``` Again, we can have a look at the mOTUs with the largest read counts ```bash sort -t$'\t' -k2 -rn sampleA.motus | head -n 10 ``` ``` Akkermansia species incertae sedis [meta_mOTU_v3_12805] 826 Citrobacter sp. [ref_mOTU_v3_00096] 373 Escherichia coli [ref_mOTU_v3_00095] 294 Flavonifractor plautii [ref_mOTU_v3_02971] 184 Ruthenibacterium lactatiformans [ref_mOTU_v3_04716] 149 Oscillibacter species incertae sedis [ext_mOTU_v3_16336] 115 Oscillibacter sp. [ref_mOTU_v3_04664] 109 Clostridium sp. CAG:217 [meta_mOTU_v3_12270] 106 unassigned 95 Flavonifractor plautii [ref_mOTU_v3_05238] 88 ``` ### Profile at a different taxonomic level You can change the taxonomic level at which to display abundances by using the `-k` flag: For example, if you are interested in overall abundances of phyla in sampleA: ```bash motus profile -f sampleA_1.fastq -r sampleA_2.fastq -n sampleA -o sampleA.motus -k phylum ``` Let's have a look at the most abundant phyla: ```bash sort -t$'\t' -k2 -rn sampleA_phylum.motus | head -n 10 ``` ``` Firmicutes 0.4223272722 Verrucomicrobia 0.2911835465 Proteobacteria 0.2351208666 unassigned 0.0336275544 Actinobacteria 0.0151475705 Bacteroidetes 0.0018493239 Synergistetes 0.0007438659 candidate division Zixibacteria 0.0000000000 candidate division WWE3 0.0000000000 candidate division WOR-3 0.0000000000 ``` > There are more options that influence the quality of the alignment (minimum length, percent identity) or change the output format and reported data (NCBI taxID, full rank, summarizing at a specific taxonomic rank). These options can be simply typing `motus profile` in your terminal window. We have a more comprehensive description of all the flags and parameters in [Advanced options](advanced_options.md).