Skip to content

v1.3.0

Latest

Choose a tag to compare

@AdamNogell AdamNogell released this 17 Apr 12:26

ExP Heatmap v1.3.0

This release is a large hardening pass focused on correctness, reproducibility, and portability beyond the canonical 1000 Genomes panel.

Highlights

  • Correctness fix: Pair-specific NaN preservation in compute. A missing value in one population pair no longer erases that locus from every other pair's output.
  • Portability: Plotting, regions, and focus now work on arbitrary custom panels (e.g. GGVP), not just the canonical 1000 Genomes 26-population layout.
  • New CLI commands: filter-vcf, summary, regions, focus, compare.
  • Explicit wide-region downsampling: Static PNG output now uses a documented max/mean/median reducer instead of silent raster compression.
  • First-class test suite: 28 tests, GitHub Actions CI on Python 3.10–3.12.
  • Reproducibility: Dockerfile, environment.yml, GitHub Actions CI.
  • Benchmark harness: Full-pipeline, population-scaling, and local-display benchmark scripts with machine-context JSON.
  • Public-data validation: End-to-end scripts/validation/ pipelines for 1000 Genomes chr15/SLC24A5 (full pipeline) and chr2/LCT (region-scoped reconstruction).

Breaking changes

None of the on-disk TSV column names or CLI argument names were removed. The following behavioral differences may affect downstream tooling that made implicit assumptions:

  1. Per-pair NaN rows are now written to the output TSVs. Previously, a NaN in any single pair removed that locus from all pair files. If your downstream code assumed equal non-NaN row counts across pair files in a compute output directory, update it to filter NaNs per-pair instead of by count.
  2. src/exp_heatmap/tests.py has been removed. The compute engine now lives in src/exp_heatmap/compute_core.py. The public CLI and exp_heatmap.compute wrapper are unchanged, so scripted users are not affected. Python code that imported exp_heatmap.tests directly must switch to exp_heatmap.compute_core.
  3. The legacy exp-selection CLI entry point was removed. Use exp_heatmap (canonical) or exp-heatmap (hyphen alias) instead.
  4. Python 3.8 and 3.9 are no longer tested. Minimum supported version is now Python 3.10 (CI runs 3.10, 3.11, 3.12). Older Python versions may still work but are not verified.
  5. Terminology: The primary name for the reciprocal ordered-pair ranking mode is now directional. The string "2-tailed" continues to work everywhere as a backward-compatible alias. The -log10_p_value_{ascending,descending} TSV column names are retained for compatibility; conceptually they are empirical rank scores, not inferential p-values.

New CLI commands

Command Purpose
filter-vcf Keep only biallelic SNP records. Optional --chrom/--start/--end for region-scoped filtering. Removes the dependency on external vcftools for the standard preprocessing path.
summary Collapse pairwise rows to superpopulation pairs and optionally render an interactive summary heatmap.
regions Extract top-scoring genomic windows into a TSV for downstream review or figure planning.
focus Render an interactive view showing only rows involving a chosen population.
compare Render an interactive side-by-side comparison of two genomic regions from one compute output.

plot and full gained --rank-scores, --max-columns, --column-aggregation, and --dpi options.

Correctness changes

Pair-specific NaN preservation

Before v1.3.0, compute built a combined NaN mask across every population pair and dropped any locus where any pair had a NaN. This silently removed loci from every output file even when the NaN affected only a single pair.

v1.3.0 writes one TSV per pair that preserves that pair's own NaN positions. The rows remain position-aligned across all per-pair files, so downstream plotting keeps the same locus grid but now reflects the real pattern of missingness per pair.

Robust AF extraction

xp_utils.filter_by_AF previously required variants/AF in the input Zarr and crashed on VCFs that did not carry INFO/AF. It now prefers variants/AF when present and otherwise derives alternate allele frequencies from genotypes.

File-safe Matplotlib backend

plot.py now forces the non-interactive Agg backend before importing pyplot. This fixed a macOS crash during static PNG generation and is the more defensible default for a file-producing CLI.

Explicit wide-region downsampling

Static PNG rendering now supports:

  • --max-columns explicit column budget (defaults are derived from figure width × DPI when unset)
  • --column-aggregation max|mean|median reducer
  • --dpi for the saved figure

Vertical marker lines now snap to the nearest rendered column when binning would otherwise drop the exact position.

Custom population panel support

The plotting layer no longer hard-codes the 26-population 1000 Genomes panel. If the compute output directory contains TSVs for the full canonical set, the classical 1000G row ordering and superpopulation annotations are preserved. Otherwise, the population set is inferred from the TSV filenames and a custom-panel heatmap is drawn automatically. This was the fix that unblocked end-to-end GGVP rendering in this release.

New modules

Module Purpose
exp_heatmap.compute_core Core compute loop (replaces the misleadingly named tests.py).
exp_heatmap.vcf_utils Biallelic-SNP VCF filtering with optional region selection.
exp_heatmap.ggvp GGVP dataset helpers: release manifest parsing, integrated panel construction from official IGSR / 1000 Genomes metadata, download with MD5 checksum verification.
exp_heatmap.benchmark_utils Helpers used by the benchmark scripts for region selection, population resolution, and comparison-region construction.

Reproducibility

  • Conda: environment.yml declares an exp-heatmap-dev env with Python 3.11 and the package installed in editable mode with dev extras.
  • Docker: Dockerfile produces a minimal python:3.11-slim image with vcftools and the CLI preinstalled. Entry point is exp_heatmap.
  • CI: .github/workflows/tests.yml runs the pytest suite on Python 3.10, 3.11, and 3.12 on every push and pull request.

Benchmark harness

Three reproducible benchmark scripts live under scripts/benchmarks/:

  • run_local_benchmark.py: local display and reporting-layer benchmark across window sizes, population counts, and output modes.
  • run_pipeline_benchmark.py: controlled full-pipeline benchmark covering prepare, compute, and plot. Writes a machine-specs JSON plus stage-level TSVs.
  • run_population_scaling.py: synthetic ordered-pair display scaling benchmark over 5/10/20/30/50 populations.

Install the optional benchmark extras with:

pip install -e ".[benchmarks]"

Measured results on a reference machine (macOS 14 / Apple M4 Pro / 48 GB RAM / Python 3.12)

Full pipeline on GGVP chr21 (553,906 → 509,924 biallelic SNPs, 505 samples, 5 populations, XP-EHH):

Stage Small (1 Mb, 11,620 SNPs) Full chr21 (509,924 SNPs)
prepare 0.99 s / 0.28 GB 12.64 s / 0.38 GB
compute 2.76 s / 0.21 GB 74.73 s / 0.99 GB
plot 2.34 s / 0.26 GB 2.27 s / 0.74 GB

Display scaling (fixed genomic window, varying ordered-pair row count):

Populations Rows Static PNG Interactive HTML Peak RSS
5 20 0.10 MB 5.4 MB 0.43 GB
50 2,450 2.47 MB 99.5 MB 2.17 GB

Public-data validation

Two reproducible 1000 Genomes validation pipelines ship under scripts/validation/:

run_1kg_chr15_slc24a5.sh (full-pipeline validation)

End-to-end public-data run from the 1000 Genomes Phase 3 chromosome 15 release through filter-vcfpreparecomputeplot, focused on the SLC24A5 pigmentation locus.

  • Filter: 99.59 s (6,456,568 / 6,477,157 records retained)
  • Prepare: 252.98 s
  • Compute (XP-EHH): 3034.42 s
  • Plot static: 28.43 s / interactive: 14.04 s

run_1kg_chr2_lct_reconstruction.sh (region-scoped reconstruction)

Chromosome 2 filtered to the LCT plotting window plus 1 Mb flanking sequence before compute. Completes end-to-end in roughly 7 minutes of wall time, producing the same LCT XP-EHH heatmap as the archived author-prepared bundle.

  • Filter: 268.18 s (81,595 / 7,081,600 records retained)
  • Prepare: 11.18 s
  • Compute: 117.41 s
  • Plot static: 4.61 s / interactive: 2.11 s

Use scripts/validation/summarize_validation_run.py to regenerate JSON / Markdown summaries from log files.

GGVP support

A second public dataset has been validated end-to-end. Under scripts/ggvp/:

  • download_release.py: download per-chromosome GGVP VCF/TBI files with MD5 verification.
  • build_panel.py: construct the integrated population panel from the official GGVP alignment index, 1000 Genomes Phase 3 panel, and 1000 Genomes sequence index. The sequence-index fallback is required because the integrated GGVP VCF contains a handful of HG* samples absent from the legacy 1000 Genomes panel file.

Documentation updates

  • README.md now matches the actual software surface, uses empirical-rank language throughout, includes a benchmarks section with measured numbers, and adds a public-data validation section.

Packaging

  • Python ≥ 3.10 (classifiers cover 3.10, 3.11, 3.12).
  • CLI entry points consolidated: exp_heatmap and exp-heatmap (the legacy exp-selection has been removed).
  • New optional dependency extras:
    • [dev]pytest, pytest-cov
    • [benchmarks]psutil, pysam
    • [all] — both of the above
  • zarr < 3.0.0 constraint retained.
  • Minor metadata polish: added keywords, Documentation URL, more precise classifiers, and a requires-python = ">=3.10" declaration.

Upgrading

pip install --upgrade exp_heatmap

If you previously imported the internal tests module directly:

# before
from exp_heatmap import tests
tests.run(...)

# after
from exp_heatmap import compute_core
compute_core.run(...)

If you previously used the exp-selection CLI command, switch to exp_heatmap or exp-heatmap — the subcommand surface is unchanged.

If you maintained pipelines that relied on the old global-NaN masking semantics, regenerate your compute outputs with v1.3.0 before reusing them with downstream tools.