Background and Theory

This page gives a brief overview of the statistical models that underpin simGL.

Coverage simulation

Let \(n\) be the number of individuals and \(S\) the number of sites. For individual \(i\), the per-individual mean depth is sampled once from a Gamma distribution:

\[\lambda_i \sim \text{Gamma}\!\left(\frac{\mu^2}{\sigma^2},\,\frac{\sigma^2}{\mu}\right)\]

where \(\mu\) = mean_depth and \(\sigma\) = std_depth. The Gamma parameterisation used here has mean \(\mu\) and variance \(\sigma^2\).

For each site \(s\) of individual \(i\), the read count is then:

\[d_{s,i} \sim \text{Poisson}(\lambda_i)\]

Read simulation

Given a diploid genotype \(g \in \{0\text{/}0,\,0\text{/}1,\,1\text{/}1\}\) at a biallelic site, the true allele probability for a randomly sampled read is:

\[p(\text{ref} \mid g) = \frac{\text{number of ref haplotypes in }g}{2}\]

With sequencing error rate \(e\), each read is flipped to a uniformly random wrong base with probability \(e\) and retained as the true base with probability \(1 - e\).

Genotype likelihoods

Given \(d\) observed reads, let \(k_a\) be the count of reads for allele \(a \in \{A, C, G, T\}\). For a diploid genotype \(\{a_1, a_2\}\) the log-likelihood under the ANGSD model is:

\[\log P(\text{reads} \mid a_1 a_2) = \sum_{b \in \{A,C,G,T\}} k_b \log \left[ \frac{1}{2}\, p(b \mid a_1,\, e) + \frac{1}{2}\, p(b \mid a_2,\, e) \right]\]

where

\[\begin{split}p(b \mid a,\, e) = \begin{cases} 1 - e & \text{if } b = a \\ e / 3 & \text{if } b \neq a \end{cases}\end{split}\]

Because \(e\) appears inside a logarithm, it must be strictly positive. Setting \(e = 0\) would give \(\log 0 = -\infty\) for any read that does not match the assumed allele.

The ANGSD default corresponds to Phred quality Q30, i.e. \(e = 10^{-30/10} \approx 0.001\).

Minor allele frequency estimation

simGL.GL_to_Mm() identifies the most likely major (M) and minor (m) allele pair at each site. For every candidate pair \((M, m)\), the function scores individuals by combining their per-individual GL values with the prior probability of each genotype under Hardy–Weinberg equilibrium (equal allele frequencies, \(p = 0.5\)), then multiplies those scores across individuals. The pair with the minimum combined score (highest likelihood) is returned. This approach follows the scoring framework of Skotte et al. (2012) and Nielsen et al. (2012).