User Guide

This page describes the main functions of simGL and the considerations that govern how they are used together.

Genotype matrix requirements

simGL operates on biallelic sites only. The gm array passed to simGL.sim_allelereadcounts() must contain only 0 and 1 values (0 = carries reference allele, 1 = carries alternate allele).

When using msprime to generate input, the raw genotype matrix can contain values greater than 1 at sites where multiple independent mutations occur at the same genomic position (recurrent mutation). Filter these out before calling simGL:

gm_full = ts.genotype_matrix()
gm = gm_full[gm_full.max(axis=1) == 1]

Coverage model

There are two ways to specify coverage for the simulated haplotypes:

Single value. Provide a scalar mean_depth and std_depth. simGL draws a mean depth for each haplotype from a gamma distribution with the specified mean and standard deviation.
Per-haplotype vector. Provide a 1-D array with one entry per haplotype. simGL uses those values directly as the per-haplotype Poisson means, which is useful for reproducing a specific coverage profile from real data.

Once the number of reads per haplotype per site is determined, reads are distributed across the four possible bases (A, C, G, T) according to the genotype of the individual at that site and the sequencing error rate e. Setting e = 0 is valid here — it simply means no read errors are introduced.

Read counts can also be generated in two ways, controlled by the depth_type argument:

Independent (default). Read counts at each site are sampled independently from a Poisson distribution, assuming coverage is uncorrelated across sites.
Linked. simGL places whole reads of a fixed length randomly along each haplotype. Read counts at a site are then obtained by counting how many reads overlap that position. This introduces correlation between coverage at nearby sites and is more realistic for short-read data.

Genotype likelihood computation

simGL.allelereadcounts_to_GL() computes genotype log-likelihoods using the ANGSD/GATK formula. The sequencing error rate e appears inside a logarithm, so e must be strictly greater than zero. The ANGSD default (Phred Q30) is e = 0.001. If the user wants to simulate error-free reads, set e to a very small value (e.g. 1e-6) instead of zero.

The output GL array has shape (n_sites, n_individuals, n_genotypes) where n_genotypes = 10 for diploid data (all ACGT genotype pairs). Values are negative log-likelihoods; a smaller value means a more likely genotype.

Subsetting to biallelic genotypes

The full vector of GL is rarely needed downstream. Use simGL.ref_alt_to_index() and simGL.subset_GL() to reduce it to the three biallelic genotypes (hom-ref, het, hom-alt):

Ra     = simGL.ref_alt_to_index(ref, alt)   # shape (n_sites, 2)
GL_sub = simGL.subset_GL(GL, Ra, ploidy=2)  # shape (n_sites, n_ind, 3)

subset_GL always places the hom-ref genotype at index 0, het at 1, and hom-alt at 2, regardless of whether the reference allele has a higher or lower ACGT index than the alternate allele.

If reference and alternative allele information is not available, use simGL.GL_to_Mm() to identify the major and minor alleles from the GL array and pass the result to simGL.subset_GL(). See below for details.

Normalization

simGL.normalize_GL() shifts each individual’s GL vector so that the best (minimum) value becomes 0. This is required before writing a VCF file and is consistent with the convention used by other software such as GATK and ANGSD.

GL_norm = simGL.normalize_GL(GL)

Minor-allele frequency estimation

simGL.GL_to_Mm() identifies the most likely major and minor alleles at each site by scoring every candidate allele pair with a likelihood-based criterion: for each pair it computes a HWE-weighted combination of the GL values across all individuals, then picks the pair with the best (minimum) combined score. It requires polymorphic sites only (i.e. sites where the alternate allele frequency is strictly between 0 and 1). Fixed sites (all ref or all alt) must be removed before calling this function.

Writing a VCF file

simGL.GL_to_vcf() writes a VCF 4.2 file with the following FORMAT fields:

GT — genotype call derived from the argmin of the normalized GL (missing ./. when all GL values are equal, i.e. no reads).
GL — normalized biallelic genotype log-likelihoods (3 values per sample, rounded to 3 decimal places).
AD — allele depth for all four ACGT alleles, taken directly from the arc read-count array.

Writing a pileup file

simGL.allelereadcounts_to_pileup() converts an arc array to a samtools-style pileup file. Each row corresponds to one genomic site; each individual contributes three columns (depth, base string, quality string). This format can be used as input to tools that expect pileup data.

Monomorphic sites

Genotype matrices produced by simulation software typically contain only polymorphic sites. To simulate sequencing reads for monomorphic positions as well (e.g. to model random sequencing errors at invariant sites), simGL.incorporate_monomorphic() inserts rows in which all haplotypes carry the reference allele (value 0) at every position not already present in the polymorphic matrix. The expanded matrix can then be passed directly to simGL.sim_allelereadcounts().

Be aware that a larger matrix increases runtime and memory use, so insert monomorphic sites only when strictly necessary.

gm_full = simGL.incorporate_monomorphic(gm, pos, start=0, end=100_000)