Background and Theory
This page gives a brief overview of the statistical models that underpin simGL.
Coverage simulation
Let \(n\) be the number of individuals and \(S\) the number of sites. For individual \(i\), the per-individual mean depth is sampled once from a Gamma distribution:
where \(\mu\) = mean_depth and \(\sigma\) = std_depth.
The Gamma parameterisation used here has mean \(\mu\) and variance
\(\sigma^2\).
For each site \(s\) of individual \(i\), the read count is then:
Read simulation
Given a diploid genotype \(g \in \{0\text{/}0,\,0\text{/}1,\,1\text{/}1\}\) at a biallelic site, the true allele probability for a randomly sampled read is:
With sequencing error rate \(e\), each read is flipped to a uniformly random wrong base with probability \(e\) and retained as the true base with probability \(1 - e\).
Genotype likelihoods
Given \(d\) observed reads, let \(k_a\) be the count of reads for allele \(a \in \{A, C, G, T\}\). For a diploid genotype \(\{a_1, a_2\}\) the log-likelihood under the ANGSD model is:
where
Because \(e\) appears inside a logarithm, it must be strictly positive. Setting \(e = 0\) would give \(\log 0 = -\infty\) for any read that does not match the assumed allele.
The ANGSD default corresponds to Phred quality Q30, i.e. \(e = 10^{-30/10} \approx 0.001\).
Minor allele frequency estimation
simGL.GL_to_Mm() identifies the most likely major (M) and minor (m)
allele pair at each site. For every candidate pair \((M, m)\), the
function scores individuals by combining their per-individual GL values
with the prior probability of each genotype under Hardy–Weinberg equilibrium
(equal allele frequencies, \(p = 0.5\)), then multiplies those scores
across individuals. The pair with the minimum combined score (highest
likelihood) is returned. This approach follows the scoring framework of
Skotte et al. (2012) and
Nielsen et al. (2012).