Quickstart

This guide walks through a minimal end-to-end example: simulate a tree sequence, simulate read counts based on coverage and error rates, compute genotype likelihoods from those, and write a VCF file.

Simulate a tree sequence

The first step is to simulate a tree sequence. Here we use msprime to obtain a haplotypic genotype matrix of shape (n_sites, n_haplotypes) that serves as input to simGL.

simGL requires the input matrix to be biallelic (values 0 or 1 only), so we filter out any site where more than two alleles are present:

import msprime
import numpy as np

ts = msprime.sim_ancestry(
    samples=10, ploidy=2,
    sequence_length=100_000,
    recombination_rate=1e-8,
    population_size=10_000,
    random_seed=1,
)
ts = msprime.sim_mutations(ts, rate=1e-4, random_seed=1)

# Keep only biallelic sites (required by simGL)
gm_full = ts.genotype_matrix()           # shape (n_sites, n_haplotypes)
biallelic = gm_full.max(axis=1) == 1
gm = gm_full[biallelic]                  # shape (n_biallelic_sites, n_haplotypes)

Each row of gm is a genomic site; each column is a haplotype (two per diploid individual). Entries are 0 (reference allele) and 1 (alternative allele). For example, with 3 individuals (6 haplotypes):

gm (first 3 sites):
[[1 1 1 0 0 0],   # site 0: ind0=hom-alt, ind1=het,     ind2=hom-ref
 [1 0 0 1 1 1],   # site 1: ind0=het,     ind1=het,     ind2=hom-alt
 [1 0 1 0 0 0]]   # site 2: ind0=het,     ind1=het,     ind2=hom-ref

Get reference and alternative alleles

Extract the reference and alternative allele labels from the tree sequence before simulating reads. These are needed both for read simulation and for labelling the output VCF:

variants = list(ts.variants())
ref = np.array([v.alleles[0] for v in variants])[biallelic]
alt = np.array([v.alleles[1] for v in variants])[biallelic]

Simulate read counts

Once we have the genotype matrix and allele labels, simulate read counts for each individual at each site. The output arc has shape (n_sites, n_individuals, 4), where the last dimension holds read counts for A, C, G, T in that order:

import simGL

arc = simGL.sim_allelereadcounts(
    gm,
    mean_depth=10.0,
    std_depth=2.0,
    e=0.01,       # sequencing error rate for read simulation
    ploidy=2,
    seed=42,
    ref=ref,
    alt=alt,
)
# arc shape: (n_sites, n_individuals, 4)  — counts for A, C, G, T

For example, arc[0] for site 0 with ref=T and alt=A:

arc[0]:
[[17  0  0  0],   # ind0: 17 alt (A) reads — hom-alt
 [ 7  0  1  9],   # ind1: 7 alt (A), 9 ref (T), 1 error read — het
 [ 0  0  1 15]]   # ind2: 15 ref (T), 1 error read — hom-ref

Specifying coverage per individual. By default simGL draws one mean depth per haplotype from a gamma distribution with the given mean_depth and std_depth. To reproduce a specific per-individual coverage profile, for example, if the user is simulating a specific dataset for which the coverage per individual is known, pass a 1-D array of mean depths with one entry per haplotype. Note that the coverage is per haplotype, therefore, for diploid data, the array should have two entries per individual (each having half the individual’s mean depth):

mean_depths = np.array([5., 5., 10., 10., 3., 3.,   # individuals 0–2
                         5., 5., 10., 10., 3., 3.,   # individuals 3–5
                         5., 5., 10., 10., 3., 3.,   # individuals 6–8
                         5., 5.])                     # individual 9
arc_perind = simGL.sim_allelereadcounts(
    gm, mean_depth=mean_depths, std_depth=0.,
    e=0.01, ploidy=2, seed=1, ref=ref, alt=alt,
)

Simulating coverage correlations across sites (linked mode). By default coverage is sampled independently at each site from a Poisson distribution. Setting depth_type="linked" instead places whole reads of a fixed length (read_length) randomly across the region, so that nearby sites share coverage from overlapping reads:

pos = np.array([int(v.site.position) for v in variants])[biallelic]

arc_linked = simGL.sim_allelereadcounts(
    gm, mean_depth=10., std_depth=2., e=0.01, ploidy=2, seed=1,
    ref=ref, alt=alt,
    depth_type="linked", read_length=100,
    start=0, end=100_000, pos=pos,
)

Compute genotype likelihoods (GL)

With the allele read counts, compute genotype likelihoods (GLs) for each individual at each site. Likelihoods are computed for all possible diploid genotypes (10 for diploids organisms with four-allele ACGT data), giving shape (n_sites, n_individuals, 10):

GL = simGL.allelereadcounts_to_GL(arc, e=0.01, ploidy=2)
# GL shape: (n_sites, n_individuals, 10)  — one value per diploid ACGT genotype

For example, GL[0] for site 0 (ref=T, alt=A). The 10 genotype columns follow the order AA, AC, AG, AT, CC, CG, CT, GG, GT, TT:

GL[0]:
#          AA      AC      AG      AT      CC      CG      CT      GG      GT      TT
ind0:  [  0.0 ,  11.7 ,  11.7 ,  11.7 ,  96.8 ,  96.8 ,  96.8 ,  96.8 ,  96.8 ,  96.8 ]  # hom-alt (AA)
ind1:  [ 40.2 ,  45.0 ,  40.0 ,   0.0 ,  80.1 ,  75.1 ,  35.0 ,  74.4 ,  30.0 ,  28.8 ]  # het (AT)
ind2:  [ 85.4 ,  85.4 ,  80.4 ,  10.3 ,  85.4 ,  80.4 ,  10.3 ,  79.7 ,   5.3 ,   0.0 ]  # hom-ref (TT)

Each row is normalized so that the minimum (most likely genotype) equals 0. This can be prevented by setting normalized=False in the function call.

Subset to biallelic GLs and write a VCF

The full genotype GL is rarely needed downstream. Use simGL.subset_GL() to reduce it to the three biallelic genotypes (e.g., hom-ref, het, hom-alt):

Ra      = simGL.ref_alt_to_index(ref, alt)        # shape (n_sites, 2)
GL_sub  = simGL.subset_GL(GL, Ra, ploidy=2)       # shape (n_sites, n_ind, 3)

For example, GL_sub[0] for site 0:

GL_sub[0]:
#           hom-ref   het   hom-alt
ind0:  [  96.8 ,  11.7 ,   0.0 ]   # hom-alt most likely
ind1:  [  28.8 ,   0.0 ,  40.2 ]   # het most likely
ind2:  [   0.0 ,  10.3 ,  85.4 ]   # hom-ref most likely

If reference and alternative alleles are not available, use simGL.GL_to_Mm() to identify the major and minor alleles directly from the GL array, then pass the result to simGL.subset_GL():

Mm     = simGL.GL_to_Mm(GL, ploidy=2)       # shape (n_sites, 2) — allele indices
GL_sub = simGL.subset_GL(GL, Mm, ploidy=2)
GL_sub = simGL.normalize_GL(GL_sub)

Finally, write a VCF file with genotype calls, likelihoods, and allele depths:

pos   = np.array([int(v.site.position) for v in variants])[biallelic] + 1
names = [f"ind{i}" for i in range(ts.num_individuals)]

simGL.GL_to_vcf(GL_sub, arc, ref, alt, pos, names, "output.vcf")

The resulting output.vcf contains GT:GL:AD fields for every sample at every biallelic site. The first two data lines look like:

#CHROM  POS  ID  REF  ALT  QUAL  FILTER  INFO  FORMAT   ind0                        ind1                      ind2
1       59   .   T    A    .     .       .     GT:GL:AD  1/1:96.8,11.7,0.0:17,0,0,0  0/1:28.8,0.0,40.2:7,0,1,9  0/0:0.0,10.3,85.4:0,0,1,15
1       114  .   T    A    .     .       .     GT:GL:AD  0/1:62.5,0.0,45.4:14,0,0,11  0/1:39.5,0.0,33.8:9,0,0,8  1/1:113.9,13.8,0.0:20,0,0,0