Automated annotation

Automated annotation#

When you have a whole genome to annotate, you want a program to do as much as possible for you automatically. There are several pipelines available such as the NCBI Prokaryotic Genome Annotation Pipeline, PATRIC and RAST. Here we will show you how to use the whole genome annotation program Prokka, which is a pipeline that uses various feature prediction tools:

  • Prodigal for genes

  • Barrnap for ribosomal RNA

  • Aragorn for transfer RNA

  • SignalP for signal peptides

  • Infernal for non-coding RNA

Further it searches different databases in a specific order for protein function annotation, ranking them in order of quality:

  1. All bacterial proteins in UniProt that have real protein or transcript evidence and are not a fragment.

  2. All proteins from finished bacterial genomes in RefSeq for a specified genus.

  3. A series of hidden Markov model profile databases, including Pfam and TIGRFAMs.

  4. If no matches can be found, label as ‘hypothetical protein’.

We have made Prokka available in our module system (ml prokka). It has some recommended ways of running it, with increasing complexity:

# Load Prokka from the module system
ml prokka

# Become familiar with the command parameters by opening the help page
prokka
prokka -h
prokka --help

# Beginner
# Vanilla (but with free toppings)
prokka contigs.fa

# Moderate
# Choose the names of the output files
prokka --outdir mydir --prefix mygenome contigs.fa

# Specialist
# Have curated genomes I want to use to annotate from
prokka --proteins MG1655.gbk --outdir mutant --prefix K12_mut contigs.fa

# Expert
# It's not just for bacteria, people
prokka --kingdom Archaea --outdir mydir --genus Pyrococcus --locustag PYCC

# Wizard
# Watch and learn
prokka --outdir mydir --locustag EHEC --proteins NewToxins.faa --evalue 0.001 --gram neg --addgenes contigs.fa

Those are just examples of course, but you can see that there are many ways to customise the annotation, especially the output. Since prokka commonly creates multiple output files, you are asked to define an output directory outdir (instead of an output file) and a file name prefix for all files that will be created. The pipeline then creates files with the same prefix and standardized file extensions within that output directory. Note that prokka also uses a slightly different description of options and parameters, with two dashes followed by a whole word –outdir instead of one dash followed by one letter -o. This is a more explicit parameter description commonly used in automated pipelines which have many parameters (to avoid confusion between parameters).

You can either use the command line help page (by typing prokka) or have a look at the Prokka Github page to browse through the output files that Prokka creates.

Exercise 5.4#

Exercise 5.4

  • Run prokka on one of the genomes you have previously worked with, either in /nfs/teaching/551-0132-00L/1_Unix/genomes/bacteria/ or one you downloaded. How does the annotation differ from the official genbank record? Are there more or fewer genes?

# Load prokka
ml prokka

# Let's choose the name of the output files
prokka --outdir prokka --prefix my_genome my_genome.fasta

# We learn from this exercise that the annotation can differ in many ways, even the number of genes can be wrong. The genomes we worked with so far are very well studied and many of their annotations are based on direct observations rather than computational inference.
  • So far we have seen that we can use diverse databases and algorithms for genome annotation and feature prediction. Can you think of risks associated with the choices we make when annotating a genome or a set of genes?

  • The first and probably biggest risk of misannotation comes from the databases we use. They may contain information of different levels of evidence (i.e. there is a huge difference between experimental validation of protein function and statistical inference). It is therefore important to always be aware of the level of evidence provided by a database and not interpret the results disregarding the uncertainty of a prediction.

  • An additional risk regarding databases is the quality of a database and the level of manual curation that is done by the people maintaining the database. In recent years and with growing amounts of data, curation has become crucial to determine which data is useful versus useless or inaccurate data. It is therefore important to know the quality of a database to be aware of additional uncertainties.

  • Besides the risk of errors and functional misannotations within databases, there are plenty of differences between algorithms and how their parameters are defined. It is therefore important that we know the specific parameter settings of an algorithm of choice to be aware how it evaluates a probability of an annotated gene function.

  • Lastly, when we work with a specific genome of interest, we should always be aware of the information that is already known/still unknown about this organism or closely related organisms. If we work with a well-known organism, we can be more confident in the annotation since the databases probably contain more detailed and thoroughly evaluated data. If we work with a less studied organism, we should be aware that there might not be much evidence for annotation and even if we find an annotation, the risk is higher that there are wrong inferences that were not studied and corrected yet.