Phylogenetics#
General information#
Main objective#
Phylogenetic analyses have become central to understanding the evolutionary history, ecology and diversity of life on earth. In this lesson, you will learn about basic concepts of phylogenetic analyses based on DNA sequences. We will show you how to infer the evolutionary relationship of groups of organisms. To this end, we will collect phylogenetically informative sequences and teach you, step-by-step, how to reconstruct a phylogenetic tree via a web-based interface. Furthermore, you will learn how to assign a taxon of interest to an existing phylogenetic tree by command
Learning objectives#
Learning objective 1 - You can generate phylogenetic trees
You are able to collect DNA and protein sequences of interest and store them them in an adequate format
You can explain the different steps and know examples of software required to generate phylogenetic trees
Using phylogenetic trees, you can explain the hypothesised evolutionary relationship between organisms and genes
Learning objective 2 - You can perform taxonomic classification
You understand 16S/18S rRNA genes as marker genes for investigating evolutionary relationships
You can extract 16S rRNA genes from genome sequences using barrnap
You are familiar with the database Microbe Atlas Project for taxonomic identification of microbes
Phylogenetic trees#
The objectives for reconstructing phylogenetic trees can be manifold. Generally speaking, a phylogenetic tree is a hypothesis of how biological species or other entities (e.g., genes) are related through evolution. It is a branching diagram showing the inferred evolutionary relationships among these entities based on similarities in their genetic and/or physical characteristics.
For the interpretation of phylogenetic trees, it is important to understand the concept of homology as similarity due to shared ancestry. For example, the forelimbs of vertebrates are homologous structures. Although in different animals, they may vary in form and function (e.g., arms, forelegs, wings, front flippers), they have evolved from the same structure in the last common ancestor of tetrapods. However, the function of wings in insects, bats and birds is analogous, as it has evolved independently in widely divergent groups of animals.
By extension of the concept of homology to DNA and protein sequences, two sequences are homologous if they share ancestry. High similarity between two sequences provide strong evidence for their shared ancestry, but is my no means conclusive. Importantly, based on the definition of homology specified above, the similarity between sequences is merely an empirical observation. Whether or not these sequences are homologous requires interpretation, e.g. by reconstructing phylogenetic trees. As with wings, sequence similarity may occur as a result of convergent evolution, or with short sequences, by chance.
Furthermore, homologous sequences can be orthologous or paralogous with respect to each other: “Where the homology is the result of gene duplication so that both copies have descended side by side during the history of an organism, (for example, alpha and beta hemoglobin) the genes should be called paralogous (para = in parallel). Where the homology is the result of speciation so that the history of the gene reflects the history of the species (for example alpha hemoglobin in man and mouse) the genes should be called orthologous (ortho = exact).” – W. Fitch. Homologous sequences that have been transfered between species are xenologs.
One of the most important implications for phylogenetics is that only sets of orthologous sequences are expected to reflect the underlying evolution of species, whereas a set of homologous genes (including orthologs, paralogs and xenologs) can be informative about the evolutionary relationship between species (gene duplication within/among species and horizontal gene transfer). Orthologous genes, as compared to paralogs, are also more likely to share the same function.
Advanced reading: Phylogenies - lecture notes (by Casey Dunn): Phylogenetic Biology.
Note that inferring orthology, building a species tree from a set of orthologous genes and assuming functional conservation among orthologous genes is not as straight forward as it seems. For more information, see for example: Gabaldon and Koonin, 2013.
Generating a phylogenetic tree (via a web-server)#
In this section of the course, we will introduce you to NGPhylogeny, a web-based platform that performs a phylogenetic analysis in a user-friendly way, that is without the need for the installation of several software programmes and the re-formatting of input and output files. The user can chose to run the analyses with a “one click” workflow using default tools and parameters, or create more advanced and customised workflows. The platform provides detailed information about the individual steps that are performed and the tools that are used to execute them:
Collection and formatting of sequence data
Multiple sequence alignment (MSA)
MSA curation
Tree inference
Tree visualisation
0. Collection and formatting of sequence data#
The prerequisite for generating multiple sequence alignments (MSAs) is a collection of DNA/protein sequences. The user/researcher is responsible to collect the sequences of interest and to format them so they can be used as an input to MSA programmes.
Here, we will use the protein sequences of hemoglobin genes from human, mouse and chicken as an input to NGPhylogeny. For illustration, we will run the “One-click workflow” with default settings.
1. Multiple sequence alignment (MSA)#
As a first step, the sequences will be aligned by MAFFT (see also section 4). Alignments are usually visually depicted with sequences as rows and nucleotides (DNA) or amino acid residues (proteins) as columns. Mutation events over generations result in nucleotide changes and an amino acid change if a nucleotide change leads to a non-synonymous substitution of the affected codon. Insertion or deletion events are denoted as hyphens in one or more sequences in the alignment.
2. MSA curation#
The quality of a MSA is important for the accuracy of phylogenetic inference. With increasing numbers and higher divergence of sequences (i.e., from evolutionarily more distant organisms), there is a good chance that an alignment will contain errors. Manual curation can become challenging, and furthermore, not every position in the alignment may be phylogenetically informative (N.B.: can you think of reasons why?). There are several bioinformatic tools dedicated to the curation of MSAs. By default, NGPhylogeny uses BMGE.
3. Tree inference#
The curated MSA serves as an input to construct and refine a phylogenetic tree, which can be considered a hypothesis of the evolutionary relationships between divergent species or genes represented in the genomes of divergent species. Several computational approaches exist that can be grouped into distance-matrix, maximum parsimony, maximum likelihood and Bayesian inference methods. The method differ in their assumptions, algorithms and types of models used. Distance matrix methods are faster and computationally less expensive. However, the other methods are considered to produce more accurate results. By default, NGPhylogeny uses FastME as a distance-based programme to infer phylogenetic trees.
4. Tree visualisation#
The Newick format is one of the most widely used formats to represent phylogenetic trees in computer-readable form. Several software packages exist to visualize and manipulate trees in different ways. For example, a cladogram displays the branching structure of a tree without branch length scaling, while in a phylogram, the branch lengths are proportional to the inferred evolutionary change. A tree can be unrooted, which makes no assumptions about ancestry. Although it is possible to root a tree on any of its branches, usually, it is rooted at the most recent common ancestor of all species/genes (leaves) in the tree. The layout of trees can be a rectangular or circular cladogram, for example.
Exercise 6.1
Please visit the website https://ngphylogeny.fr, select “One click workflows” under “Phylogeny Analysis” and upload (or copy/paste) the file /nfs/teaching/551-0132-00L/6_Phylogenetics/hemoglobin_homologs.faa
, which contains homologous protein sequences of the globin gene family from vertebrates (human, mouse, chicken) and a non-vertebrate, the lancelet Branchiostoma floridae, as an outgroup. Once the workflow finishes, you can inspect the resulting tree directly in NGPhylogeny.
Save or copy the Newick-formatted tree data and upload it to iTOL (you can also export the Output Tree directly to iTOL), a powerful online tool for tree visualisation and annotation. Once the tree is displayed, click on any branch or leaf. A pop-up window will appear and under Editing/Tree structure, you can click on “Root the tree at midpoint”. The same can be achieved by clicking on the “Advanced” tab on the “Control panel” and clicking on “Midpoint root” under “Other functions” at the bottom. The tree is now displayed so that the last common ancestor of all sequences is represented as the root. Given this tree, HbA=hemoglobin alpha chain, HbB=hemoglobin beta chain, Mb=Myoglobin and Gb=Globin answer the following questions:
Q1: For any combination of the genes in the tree, determine whether they are orthologs or paralogs (for example, Homo-sapiens-HbA1 and Gallus-gallus-HbA are [orthologs|paralogs]).
HbA to HbB, HbA to Mb and HbA to Mb, and Ha genes within the same species are paralogs.
Mb genes from different species are orthologs.
HbB genes from different species are orthologs.
Without further analysis (e.g. testing gene neighborhood), it is not possible to determine if HbA1 or HbA2 genes in humans and mice are orthologous to the HbA gene in chicken.
Similarly, it is not possible to determine which pairs of HbA genes in humans and mice are orthologous to each other.
Q2: Importance of orthology: if you had only collected the sequences: Homo-sapiens-HbA1, Gallus-gallus-HbA and Mus-musculus-HbB, what would you have inferred about the relationships between human, mice and chicken (which organisms are more closely related to each other)?
Due to incomplete sampling/data, humans would appear to be more closely related to chicken than to mice.
Q3: Homo sapiens and Mus musculus have two isoforms of HbA genes (HbA1 and HbA2). The branch length between the isoforms is zero. Formulate a hypothesis when this gene duplication ocurred. What kind of additional data would you collect to test your hypothesis?
Sharing the exact same protein sequence suggest a recent (i.e., on an evolutionary time scale) duplication event of the gene; however, sometime before the last common ancestor of humans and mice existed. Testing for the copy number of the HbA gene in more distantly related organisms (e.g., mammals, tetrapods, vertebrates) could provide additional evidence when the duplication of the HbA gene occurred.
Further reading:
Evolution of the globin gene superfamily in vertebrates (note Figure 1).
Evolutionary Innovations in Hemoglobin-Oxygen Transport (note Figures 1 and 3).
Generating a phylogenetic tree (on the command line)#
The steps performed by the webserver can also be carried out on the server via the command line.
1. Multiple sequence alignment (MSA)#
This was covered in the previous class, alignment.
2. MSA curation#
You can run exactly the same alignment curation program as the webserver, BMGE:
ml BMGE
bmge -Xmx2G -i <input file> -t <type> -o <output file>
# For help
bmge -Xmx2G -h
The -Xmx2G argument is for Java, which runs the program, and is always necessary - 2G provides 2Gb of memory and can be increased if needed.
Unfortunately, BMGE currently has a bug when using the -op argument to output a Phylip compatible file, where it accidentally triples the length of the alignment in the header (this may be only for amino acid input I haven’t tested). You can fix this as follows:
bmge -Xmx2G -i <input file> -t AA -opaa /dev/stdout | tr -d \' | awk 'NR==3{print " "$1" "$2/3};NR>3{print}' > <output file>
3. Tree inference#
You can also run the same tree construction program as the webserver, PhyML:
ml PhyML
phyml -i <input file>
Unfortunately you cannot specify output file names - they are set to be the input file name with suffixes. The output file will be in Newick format, which is a standard for phylogenetic trees.
4. Tree visualisation#
For visualisation you have several options. You can use a specific webserver such as iTOL, or a package in your favourite programming language that can read a Newick file and produce suitable graphics. Here we will use the ape package for R:
# Import a tree
library(R)
tree <- read.tree("my_msa.aln_phyml_tree.txt")
tree
# Plot tree in a couple of different ways
plot(tree)
plot(tree, type="fan") # circular
plot(tree, type="unrooted") # unrooted
There are many more options available in the package, including labelling tree nodes and edges, adding colours and so forth. You can also access the tree structure directly in its phylo object (tree in the above code), including edge lengths and tip labels.
Exercise 6.2
Using the file /nfs/teaching/551-0132-00L/6_Phylogenetics/hemoglobin_homologs.faa
, work through each step of the process using only the tools available on cousteau:
Align the file with MAFFT
Curate the alignment with BMGE
Construct a tree with PhyML
Display the tree with R
Taxonomic classification#
Instead of annotating diverse features within a genome, we might be interested to use the information contained within a genome to identify the organism itself and its similarity to other organisms. Typically, people are interested in taxonomic classification, thus assigning a name to an organism within an ordered system such as the phylogenetic tree describing the evolutionary relationships among various organisms.
You learned last week that we can look at sequence similarity by multiple sequence alignment. One problem with this whole genome comparison is that different areas within the genome have different levels of variability. Especially in prokaryotes which have high levels of variation and so-called accessory genes in addition to core genes, this might be problematic because we could misassign closest relatives based on the information given in the whole genome. By far the biggest problem with this approach is however that for most prokaryotes, we do not have their whole genomes available and therefore cannot directly compare whole genome sequences, because we would not know which information is missing.
Marker genes#
Therefore, we try to make use of so-called marker genes which contain characteristics useful for taxonomic classification. Similarly to protein domains, genome sequences also have conserved regions and variable regions. For taxonomic classification, we want to use a marker gene which is conserved enough to be present in any organism of interest but variable enough for identifying evolutionary relationships between organisms. One example of such a gene is the 16S/18S rRNA gene which is part of the small ribosomal subunit in prokaryotes/eukaryotes respectively and therefore essential for survival and ubiquitously present. In addition, the 16S/18S rRNA genes contain many hypervariable regions which contain evolutionary properties and can therefore be used to infer evolutionary relationships.
16S/18S rRNA databases#
Since we want to classify an organism by comparing the marker gene sequence to other known sequences, we again need a databases containing known 16S/18S rRNA sequences to align our sequence against. Many databases exist but we will focus on a specific 16S database called the Microbe Atlas Project. This database covers microbiomes across diverse environments from global sampling efforts and therefore allows direct comparisons of taxa across many different studies and biogeographic locations. In addition, it contains a lot of metadata thus information about the sample origin and environmental parameters thereby allowing studies of microbiomes in the context of ecology.
Exercise 6.2
Barrnap is a fast algorithm to identify rRNA genes, but also less sophisticated. Try it out by running barrnap on a genome to retrieve the 16S rRNA sequence. You can use bacterial genomes in
/nfs/teaching/551-0132-00L/1_Unix1/genomes/bacteria/
or which you downloaded previously.
# Load barrnap
ml barrnap
# Let's run barrnap on one of the genomes
barrnap -k bac example_genomic.fna --outseq example_rrna_barrnap.fa
After retrieving the 16S sequence from your genome, you want to use a database such as the Microbe Atlas Project to perform taxonomic classification. How can you do that?
Go to the website, enter your 16S sequences and click on submit
We learn from the results, that each taxonomic classification is associated with some uncertainty