Quickstart
==========

This guide walks through a minimal end-to-end example: simulate a tree sequence,
simulate read counts based on coverage and error rates, compute genotype likelihoods
from those, and write a VCF file.

Simulate a tree sequence
------------------------

The first step is to simulate a tree sequence. Here we use msprime to obtain a
haplotypic genotype matrix of shape ``(n_sites, n_haplotypes)`` that serves as
input to simGL.

simGL requires the input matrix to be biallelic (values 0 or 1 only), so we
filter out any site where more than two alleles are present:

.. code-block:: python

   import msprime
   import numpy as np

   ts = msprime.sim_ancestry(
       samples=10, ploidy=2,
       sequence_length=100_000,
       recombination_rate=1e-8,
       population_size=10_000,
       random_seed=1,
   )
   ts = msprime.sim_mutations(ts, rate=1e-4, random_seed=1)

   # Keep only biallelic sites (required by simGL)
   gm_full = ts.genotype_matrix()           # shape (n_sites, n_haplotypes)
   biallelic = gm_full.max(axis=1) == 1
   gm = gm_full[biallelic]                  # shape (n_biallelic_sites, n_haplotypes)

Each row of ``gm`` is a genomic site; each column is a haplotype (two per diploid
individual). Entries are 0 (reference allele) and 1 (alternative allele). For
example, with 3 individuals (6 haplotypes):

.. code-block:: text

   gm (first 3 sites):
   [[1 1 1 0 0 0],   # site 0: ind0=hom-alt, ind1=het,     ind2=hom-ref
    [1 0 0 1 1 1],   # site 1: ind0=het,     ind1=het,     ind2=hom-alt
    [1 0 1 0 0 0]]   # site 2: ind0=het,     ind1=het,     ind2=hom-ref

Get reference and alternative alleles
--------------------------------------

Extract the reference and alternative allele labels from the tree sequence before
simulating reads. These are needed both for read simulation and for labelling the
output VCF:

.. code-block:: python

   variants = list(ts.variants())
   ref = np.array([v.alleles[0] for v in variants])[biallelic]
   alt = np.array([v.alleles[1] for v in variants])[biallelic]

Simulate read counts
--------------------

Once we have the genotype matrix and allele labels, simulate read counts for each
individual at each site. The output ``arc`` has shape
``(n_sites, n_individuals, 4)``, where the last dimension holds read counts for
A, C, G, T in that order:

.. code-block:: python

   import simGL

   arc = simGL.sim_allelereadcounts(
       gm,
       mean_depth=10.0,
       std_depth=2.0,
       e=0.01,       # sequencing error rate for read simulation
       ploidy=2,
       seed=42,
       ref=ref,
       alt=alt,
   )
   # arc shape: (n_sites, n_individuals, 4)  — counts for A, C, G, T

For example, ``arc[0]`` for site 0 with ref=T and alt=A:

.. code-block:: text

   arc[0]:
   [[17  0  0  0],   # ind0: 17 alt (A) reads — hom-alt
    [ 7  0  1  9],   # ind1: 7 alt (A), 9 ref (T), 1 error read — het
    [ 0  0  1 15]]   # ind2: 15 ref (T), 1 error read — hom-ref

**Specifying coverage per individual.** By default simGL draws one mean depth per
haplotype from a gamma distribution with the given ``mean_depth`` and ``std_depth``.
To reproduce a specific per-individual coverage profile, for example, if the user
is simulating a specific dataset for which the coverage per individual is known, 
pass a 1-D array of mean depths with one entry per haplotype. Note that the 
coverage is per haplotype, therefore, for diploid data, the array should have two 
entries per individual (each having half the individual's mean depth):

.. code-block:: python

   mean_depths = np.array([5., 5., 10., 10., 3., 3.,   # individuals 0–2
                            5., 5., 10., 10., 3., 3.,   # individuals 3–5
                            5., 5., 10., 10., 3., 3.,   # individuals 6–8
                            5., 5.])                     # individual 9
   arc_perind = simGL.sim_allelereadcounts(
       gm, mean_depth=mean_depths, std_depth=0.,
       e=0.01, ploidy=2, seed=1, ref=ref, alt=alt,
   )

**Simulating coverage correlations across sites (linked mode).** By default
coverage is sampled independently at each site from a Poisson distribution.
Setting ``depth_type="linked"`` instead places whole reads of a fixed length (``read_length``)
randomly across the region, so that nearby sites share coverage from overlapping reads:

.. code-block:: python

   pos = np.array([int(v.site.position) for v in variants])[biallelic]

   arc_linked = simGL.sim_allelereadcounts(
       gm, mean_depth=10., std_depth=2., e=0.01, ploidy=2, seed=1,
       ref=ref, alt=alt,
       depth_type="linked", read_length=100,
       start=0, end=100_000, pos=pos,
   )

Compute genotype likelihoods (GL)
----------------------------------

With the allele read counts, compute genotype likelihoods (GLs) for each
individual at each site. Likelihoods are computed for all possible diploid
genotypes (10 for diploids organisms with four-allele ACGT data), giving shape
``(n_sites, n_individuals, 10)``:

.. code-block:: python

   GL = simGL.allelereadcounts_to_GL(arc, e=0.01, ploidy=2)
   # GL shape: (n_sites, n_individuals, 10)  — one value per diploid ACGT genotype

For example, ``GL[0]`` for site 0 (ref=T, alt=A). The 10 genotype columns follow
the order AA, AC, AG, AT, CC, CG, CT, GG, GT, TT:

.. code-block:: text

   GL[0]:
   #          AA      AC      AG      AT      CC      CG      CT      GG      GT      TT
   ind0:  [  0.0 ,  11.7 ,  11.7 ,  11.7 ,  96.8 ,  96.8 ,  96.8 ,  96.8 ,  96.8 ,  96.8 ]  # hom-alt (AA)
   ind1:  [ 40.2 ,  45.0 ,  40.0 ,   0.0 ,  80.1 ,  75.1 ,  35.0 ,  74.4 ,  30.0 ,  28.8 ]  # het (AT)
   ind2:  [ 85.4 ,  85.4 ,  80.4 ,  10.3 ,  85.4 ,  80.4 ,  10.3 ,  79.7 ,   5.3 ,   0.0 ]  # hom-ref (TT)

Each row is normalized so that the minimum (most likely genotype) equals 0.
This can be prevented by setting ``normalized=False`` in the function call.

Subset to biallelic GLs and write a VCF
-----------------------------------------

The full genotype GL is rarely needed downstream. Use :func:`simGL.subset_GL` to reduce it to the
three biallelic genotypes (e.g., hom-ref, het, hom-alt):

.. code-block:: python

   Ra      = simGL.ref_alt_to_index(ref, alt)        # shape (n_sites, 2)
   GL_sub  = simGL.subset_GL(GL, Ra, ploidy=2)       # shape (n_sites, n_ind, 3)

For example, ``GL_sub[0]`` for site 0:

.. code-block:: text

   GL_sub[0]:
   #           hom-ref   het   hom-alt
   ind0:  [  96.8 ,  11.7 ,   0.0 ]   # hom-alt most likely
   ind1:  [  28.8 ,   0.0 ,  40.2 ]   # het most likely
   ind2:  [   0.0 ,  10.3 ,  85.4 ]   # hom-ref most likely

If reference and alternative alleles are not available, use
:func:`simGL.GL_to_Mm` to identify the major and minor alleles directly from the
GL array, then pass the result to :func:`simGL.subset_GL`:

.. code-block:: python

   Mm     = simGL.GL_to_Mm(GL, ploidy=2)       # shape (n_sites, 2) — allele indices
   GL_sub = simGL.subset_GL(GL, Mm, ploidy=2)
   GL_sub = simGL.normalize_GL(GL_sub)

Finally, write a VCF file with genotype calls, likelihoods, and allele depths:

.. code-block:: python

   pos   = np.array([int(v.site.position) for v in variants])[biallelic] + 1
   names = [f"ind{i}" for i in range(ts.num_individuals)]

   simGL.GL_to_vcf(GL_sub, arc, ref, alt, pos, names, "output.vcf")

The resulting ``output.vcf`` contains ``GT:GL:AD`` fields for every sample at
every biallelic site. The first two data lines look like:

.. code-block:: text

   #CHROM  POS  ID  REF  ALT  QUAL  FILTER  INFO  FORMAT   ind0                        ind1                      ind2
   1       59   .   T    A    .     .       .     GT:GL:AD  1/1:96.8,11.7,0.0:17,0,0,0  0/1:28.8,0.0,40.2:7,0,1,9  0/0:0.0,10.3,85.4:0,0,1,15
   1       114  .   T    A    .     .       .     GT:GL:AD  0/1:62.5,0.0,45.4:14,0,0,11  0/1:39.5,0.0,33.8:9,0,0,8  1/1:113.9,13.8,0.0:20,0,0,0