Genomics II#

General information#

The material presented here is integral part of the lectures. It consists mostly of refreshers and background information that we recommend you to review prior to the corresponding lecture.

In addition, at the bottom of this document, we provide a link to additional exercises. While the background material is tailored to be read before the corresponding lecture, the exercises are thought to be worked on after the lecture has presented the concepts. The exercises are not mandatory and are complementary to the project work carried out for this section of the class. The exercise sessions will deepen the algorithmic understanding of concepts discussed in the and have a stronger focus on programming / scripting. The exercise for this lecture will explore read alignment filtering and variant calling, using the Python programming language.

Preparation and background material#

The following material will provide background and context information on Bayesian statistics and the concept of linkage.

Bayes Theorem#

Bayes’ rule is a central concept in statistics and of great relevance for statistical modeling. We will briefly review the theorem here, as it is central for probabilistic variant calling approaches.

Let us first review the formula for conditional probability. For any two events A and B, their joint probability can be computed as

(1)#\[P(A|B)P(B) = P(A,B) = P(B|A)P(A)\]

where P(A|B) represents the conditional probability of event A given event B. Only in case of independence the conditional probability equals the unconditional probability and P(A|B) = P(A). A direct result of this is the fact that for independent events it holds that P(A)P(B) = P(A, B).

If we remove the middle term from Equ. (1), we can directly derive Bayes theorem:

(2)#\[\begin{split}P(A|B)P(B) &= P(B|A)P(A) \\ P(A|B) &= \frac{P(B|A)P(A)}{P(B)}\end{split}\]

In Bayesian statistics, the terms of the equation above have specific names. P(A) is the prior probability, describing the uncertainty of the model before seeing any data. This is also called the belief. The term P(B|A) is called the likelihood and expressed the conditional probability of the data given the prior. Both together, this can be used to compute the posterior probability P(A|B) that represents the probability of the model conditioned on the observed data or the updated belief. In the interpretation of Bayesian statistics, the (relatively) uninformed model prior is updated through measurement data, resulting in a more accurate model. The remaining term P(B) is the probability of the data is usually most difficult to compute. In practice this term is often estimated.

We will use this relationship during variant calling when computing the probability of a genotype given the alignment information we have. In this context we also need to use Bayes theorem extended with the law of total probability.

(3)#\[\begin{split} P(A_i|B) &= \frac{P(B|A_i)P(A_i)}{\sum_{j=1}^n P(B|A_j)P(A_j)}\\ &\\ &\mbox{with } i \in {1..n} \mbox{ and } \sum_{j=1}^n P(A_j) = 1\\\end{split}\]

In the context of variant calling, the partial events \(A_j\) will represent the possible alignments of a given read. That is, the probability of the data \(P(B)\) will be a composition of individual alignment events.

Linkage of variants#

Often in variant calling one would like to make the simplifying assumption that any two neighboring variants are fully independent from each other. Unfortunately, this is not the case in reality. As a direct result of how individual haplotypes are composed during meiotic crossover, neighboring variants are often jointly inherited in blocks. These so called haplotype blocks lead to statistical correlation between variants within close proximity to each other on the genome sequence. This structure is different for any individual. From a population statistics perspective, these sites are in linkage disequilibrium.

Additional Exercises#

This additional exercise has been compiled to extend on the concepts presented in the lecture. Completing the exercise is not mandatory, but recommended if you would like to gain a deeper understanding of the algorithms. It will also allow you to practice your programming skills in Python.

Before starting, please review the below resource requirements.

Resources#

This section requires the use of Jupyter on Cousteau. You should have therefore completed the Setup .

You can access Jupyter on our server through this link: http://cousteau-jupyter.ethz.ch/

If you would like to refresh your Python knowledge before starting, please refer to the Python Refresher that is part of the course.

Starting the exercise#

Once you have logged into Jupyter, you should find a folder called genomics. In there you should find the folder exercises where all the genomics exercises are going to be located.

Once you open this folder, you can click on the exercise02_alignment.ipynb notebook.

All remaining instructions can be found directly in the Jupyter notebook.

Next: Genomics III

Genomics II

Contents