Background and Theory ===================== This page gives a brief overview of the statistical models that underpin simGL. Coverage simulation ------------------- Let :math:`n` be the number of individuals and :math:`S` the number of sites. For individual :math:`i`, the per-individual mean depth is sampled once from a Gamma distribution: .. math:: \lambda_i \sim \text{Gamma}\!\left(\frac{\mu^2}{\sigma^2},\,\frac{\sigma^2}{\mu}\right) where :math:`\mu` = ``mean_depth`` and :math:`\sigma` = ``std_depth``. The Gamma parameterisation used here has mean :math:`\mu` and variance :math:`\sigma^2`. For each site :math:`s` of individual :math:`i`, the read count is then: .. math:: d_{s,i} \sim \text{Poisson}(\lambda_i) Read simulation --------------- Given a diploid genotype :math:`g \in \{0\text{/}0,\,0\text{/}1,\,1\text{/}1\}` at a biallelic site, the true allele probability for a randomly sampled read is: .. math:: p(\text{ref} \mid g) = \frac{\text{number of ref haplotypes in }g}{2} With sequencing error rate :math:`e`, each read is flipped to a uniformly random wrong base with probability :math:`e` and retained as the true base with probability :math:`1 - e`. Genotype likelihoods -------------------- Given :math:`d` observed reads, let :math:`k_a` be the count of reads for allele :math:`a \in \{A, C, G, T\}`. For a diploid genotype :math:`\{a_1, a_2\}` the log-likelihood under the ANGSD model is: .. math:: \log P(\text{reads} \mid a_1 a_2) = \sum_{b \in \{A,C,G,T\}} k_b \log \left[ \frac{1}{2}\, p(b \mid a_1,\, e) + \frac{1}{2}\, p(b \mid a_2,\, e) \right] where .. math:: p(b \mid a,\, e) = \begin{cases} 1 - e & \text{if } b = a \\ e / 3 & \text{if } b \neq a \end{cases} Because :math:`e` appears inside a logarithm, it must be strictly positive. Setting :math:`e = 0` would give :math:`\log 0 = -\infty` for any read that does not match the assumed allele. The ANGSD default corresponds to Phred quality Q30, i.e. :math:`e = 10^{-30/10} \approx 0.001`. Minor allele frequency estimation ---------------------------------- :func:`simGL.GL_to_Mm` identifies the most likely major (M) and minor (m) allele pair at each site. For every candidate pair :math:`(M, m)`, the function scores individuals by combining their per-individual GL values with the prior probability of each genotype under Hardy–Weinberg equilibrium (equal allele frequencies, :math:`p = 0.5`), then multiplies those scores across individuals. The pair with the minimum combined score (highest likelihood) is returned. This approach follows the scoring framework of `Skotte et al. (2012) `_ and `Nielsen et al. (2012) `_.