Bayes Theorem#
Bayes’ rule is a central concept in statistics and of great relevance for statistical modeling. We will briefly review the theorem here, as it is central for probabilistic variant calling approaches.
Let us first review the formula for conditional probability. For any two events A and B, their joint probability can be computed as
where P(A|B) represents the conditional probability of event A given event B. Only in case of independence the conditional probability equals the unconditional probability and P(A|B) = P(A). A direct result of this is the fact that for independent events it holds that P(A)P(B) = P(A, B).
If we remove the middle term from Equ. (1), we can directly derive Bayes theorem:
In Bayesian statistics, the terms of the equation above have specific names. P(A) is the prior probability, describing the uncertainty of the model before seeing any data. This is also called the belief. The term P(B|A) is called the likelihood and expressed the conditional probability of the data given the prior. Both together, this can be used to compute the posterior probability P(A|B) that represents the probability of the model conditioned on the observed data or the updated belief. In the interpretation of Bayesian statistics, the (relatively) uninformed model prior is updated through measurement data, resulting in a more accurate model. The remaining term P(B) is the probability of the data is usually most difficult to compute. In practice this term is often estimated.
We will use this relationship during variant calling when computing the probability of a genotype given the alignment information we have. In this context we also need to use Bayes theorem extended with the law of total probability.
In the context of variant calling, the partial events \(A_j\) will represent the possible alignments of a given read. That is, the probability of the data \(P(B)\) will be a composition of individual alignment events.
Linkage of variants#
Often in variant calling one would like to make the simplifying assumption that any two neighboring variants are fully independent from each other. Unfortunately, this is not the case in reality. As a direct result of how individual haplotypes are composed during meiotic crossover, neighboring variants are often jointly inherited in blocks. These so called haplotype blocks lead to statistical correlation between variants within close proximity to each other on the genome sequence. This structure is different for any individual. From a population statistics perspective, these sites are in linkage disequilibrium.