Multiple sequence alignment (MSA)#

As we will cover in later sections, there are situations in which you want to compare and align multiple sequences all at once. This is a much harder problem to solve than pairwise alignment, in fact producing a truly optimal alignment is not feasible within a reasonable computational time, and there are various approaches that can be taken depending on what is already known about the relationships between the sequences. We will look at two approaches that make few assumptions about the sequences to be aligned, and which are used by a lot of MSA software.

Progressive alignment#

This approach builds a final MSA by combining pairwise alignments, starting with the two closest sequences and working towards the most distantly related. The problem with this method is that part of the alignment that is optimal when it is introduced early in the process might not be so good later when other sequences join the MSA.

One popular implementation of this method is MAFFT, available here. We have also made the software available on our server and will show you the basics of how to use it here. At minimum, MAFFT requires an input file with multiple sequences in fasta format and usually always outputs to the command line, so we must redirect it.

# Run MAFFT
mafft my_sequences.fasta > my_msa.fasta

The output format is by default fasta but can be set to clustal format, explained below. Other options relate to the speed or accuracy of the aligner - you can read more in the MAFFT manual if interested.

Another popular implementation of this method is Clustal, the current version of which is called Clustal Omega and is supported by the EMBL-EBI, hosted here. We have also made the software available on our server and will show you the basics of how to use it here. At minimum, Clustal Omega requires an input file containing multiple sequences, accepting both multi-fasta and existing alignment formats.

# Run Clustal Omega
clustalo -i my_sequences.fasta -o my_msa.fasta # -i means input, -o means output

The output is by default also in fasta format, but now each sequence has gaps inserted at the right points so that the nth position in each sequence is aligned. Once again, there are many command options available, many of which won’t make any sense to you at the moment, but some are immediately useful. For instance, –outfmt allows you to select a different output format - there is no dominant format for MSA, and programs that use them as input may or may not support any specific format you choose. Clustal has a format itself which is useful for browsing a multiple alignment as it includes a line of characters indicating whether or not a column in the alignment is identical or not. The width of this format can be adjusted with –wrap.

Iterative alignment#

Iterative methods differ from progressive alignment by going back to sequences previously introduced to the MSA and realigning them. Exactly how often and how to do these realignments varies between software packages. These methods also cannot guarantee an optimal alignment, and the trade-off versus progressive methods is that the realignments obviously take additional computational time.

A popular iterative-based method is MUSCLE, available here. We have also made this software available on our server and will show you the basics of how to use it here. At minimum, MUSCLE also only requires an input fasta file containing multiple sequences - other formats are not accepted.

# Run MUSCLE
muscle -in my_sequences.fasta -out my_msa.fasta # -in means input, -out means output

The output is by default also in fasta format, and only a few other formats are supported. Beyond that, the options determine how long the algorithm will run for - more iterations may improve the alignment but will take longer, and each incremental improvement takes longer and longer to achieve.

Exercise 4.5#

Exercise 4.5

  • Make sure you are in your homefolder before executing the commands, create a new folder to perform the sequence alignment within and change your directory to the created directory

# Go to your homefolder
cd

# Create a folder for your MSA
mkdir msa

# Move into your directory for the msa
cd msa
  • Perform a multiple alignment of the file /nfs/teaching/551-0132-00L/4_Alignment/gyra.faa with each program, obtaining Clustal formatted output, and compare the results by looking at the output files

# MAFFT
ml MAFFT
mafft --clustalout /nfs/teaching/551-0132-00L/4_Alignment/gyra.faa > mafft_aln.txt

# Clustal-Omega
ml Clustal-Omega
clustalo -i /nfs/teaching/551-0132-00L/4_Alignment/gyra.faa -o clustal_aln.txt --outfmt clu

# MUSCLE
ml MUSCLE
muscle -in /nfs/teaching/551-0132-00L/4_Alignment/gyra.faa -out muscle_aln.txt -clw

# See that there are some differences between methods
  • Inspect the output file from your MSA using MAFFT. What differences between the amino acid sequences do you observe at a first glance?

When inspecting the output from your MAAFT alignment you can see amino acid sequences for multiple bacterial genera. Obvious differences include amino acid sequence differences, gaps and lengths of gaps within the sequence.

../../_images/maaft_output.png
  • Inspect the output file from your MSA using MAFFT. Can you identify organisms that are more similar to each other than others in the output file?

When inspecting the output from your MAAFT alignment, if focussing on Escherichia, Pseudomonas, Shigella and Salmonella it is clear that the amino acid sequence from 1 of these organisms is most dissimilar, specifically Pseudomonas.

../../_images/maaft_differences.png
  • Inspect and compare the outputs of your MSA using MAFFT and MUSCLE. What are the differences between the outputs? Can you link these to the above described theory and explain why they are different?

You can see that the MUSCLE output has listed the bacteria in a different order to MAAFT. This is due to the realignment perfromed within the iterative method used in the MUSCLE algorithm. Once again focus on Escherichia, Pseudomonas, Shigella and Salmonella, notice that MUSCLE has positioned Pseudomonas as an outlier from the others mentioned.

../../_images/muscle_output.png