Taxonomic classification#

Instead of annotating diverse features within a genome, we might be interested to use the information contained within a genome to identify the organism itself and its similarity to other organisms. Typically, people are interested in taxonomic classification, thus assigning a name to an organism within an ordered system such as the phylogenetic tree describing the evolutionary relationships among various organisms.

You learned last week that we can look at sequence similarity by multiple sequence alignment. One problem with this whole genome comparison is that different areas within the genome have different levels of variability. Especially in prokaryotes which have high levels of variation and so-called accessory genes in addition to core genes, this might be problematic because we could misassign closest relatives based on the information given in the whole genome. By far the biggest problem with this approach is however that for most prokaryotes, we do not have their whole genomes available and therefore cannot directly compare whole genome sequences, because we would not know which information is missing.

Marker genes#

Therefore, we try to make use of so-called marker genes which contain characteristics useful for taxonomic classification. Similarly to protein domains, genome sequences also have conserved regions and variable regions. For taxonomic classification, we want to use a marker gene which is conserved enough to be present in any organism of interest but variable enough for identifying evolutionary relationships between organisms. One example of such a gene is the 16S/18S rRNA gene which is part of the small ribosomal subunit in prokaryotes/eukaryotes respectively and therefore essential for survival and ubiquitously present. In addition, the 16S/18S rRNA genes contain many hypervariable regions which contain evolutionary properties and can therefore be used to infer evolutionary relationships.

16S/18S rRNA databases#

Since we want to classify an organism by comparing the marker gene sequence to other known sequences, we again need a databases containing known 16S/18S rRNA sequences to align our sequence against. Many databases exist but we will focus on a specific 16S database called the Microbe Atlas Project. This database covers microbiomes across diverse environments from global sampling efforts and therefore allows direct comparisons of taxa across many different studies and biogeographic locations. In addition, it contains a lot of metadata thus information about the sample origin and environmental parameters thereby allowing studies of microbiomes in the context of ecology.

Exercise 6.2#

Exercise 6.2

  • Barrnap is a fast algorithm to identify rRNA genes, but also less sophisticated. Try it out by running barrnap on a genome to retrieve the 16S rRNA sequence. You can use bacterial genomes in /nfs/teaching/551-0132-00L/1_Unix/genomes/bacteria/ or which you downloaded previously.

# Load barrnap
ml barrnap

# Let's run barrnap on one of the genomes
barrnap -k bac example_genomic.fna --outseq example_rrna_barrnap.fa
  • After retrieving the 16S sequence from your genome, you want to use a database such as the Microbe Atlas Project to perform taxonomic classification. How can you do that?

  • Go to the website, enter your 16S sequences and click on submit.We learn from the results, that each taxonomic classification is associated with some uncertainty.

  • In which samples have these organisms been detected? How much do you trust these results?

  • Microbe Atlas can show you distribution of sample types in which similar sequences to yours have been found, e.g., in samples associated with animals, aquatic environments, soils or plants.

  • If you were to search for 16S rRNA sequences of E.coli or Salmonella you would find that many of the samples containing this classification are derived from animals. When interpreting these results, it is however important to note that every database can contain errors in the metadata (e.g. information regarding sample origin provided by scientists who submitted these sequences). Additionally, sampling biases can exist in the database, for instance animal-associated entries in MicrobeAtlas currently outnumber those from marine environments.