Collection and formatting of sequence data#

The prerequisite for generating multiple sequence alignments (MSAs) is a collection of DNA/protein sequences. The user/researcher is responsible to collect the sequences of interest and to format them so they can be used as an input to MSA programmes.

Here, we will use the protein sequences of hemoglobin genes from human, mouse and chicken as an input to NGPhylogeny. For illustration, we will run the “One-click workflow” with default settings.

Multiple sequence alignment (MSA)#

As a first step, the sequences will be aligned by MAFFT (see also Section Alignment). Alignments are usually visually depicted with sequences as rows and nucleotides (DNA) or amino acid residues (proteins) as columns. Mutation events over generations result in nucleotide changes and an amino acid change if a nucleotide change leads to a non-synonymous substitution of the affected codon. Insertion or deletion events are denoted as hyphens in one or more sequences in the alignment.

../../_images/phylo_msa.png

MSA curation#

The quality of a MSA is important for the accuracy of phylogenetic inference. With increasing numbers and higher divergence of sequences (i.e., from evolutionarily more distant organisms), there is a good chance that an alignment will contain errors. Manual curation can become challenging, and furthermore, not every position in the alignment may be phylogenetically informative (N.B.: can you think of reasons why?). There are several bioinformatic tools dedicated to the curation of MSAs. By default, NGPhylogeny uses BMGE.

../../_images/phylo_msa_curation.png