ExP Heatmap v1.3.0
This release is a large hardening pass focused on correctness, reproducibility, and portability beyond the canonical 1000 Genomes panel.
Highlights
- Correctness fix: Pair-specific NaN preservation in
compute. A missing value in one population pair no longer erases that locus from every other pair's output. - Portability: Plotting,
regions, andfocusnow work on arbitrary custom panels (e.g. GGVP), not just the canonical 1000 Genomes 26-population layout. - New CLI commands:
filter-vcf,summary,regions,focus,compare. - Explicit wide-region downsampling: Static PNG output now uses a documented
max/mean/medianreducer instead of silent raster compression. - First-class test suite: 28 tests, GitHub Actions CI on Python 3.10–3.12.
- Reproducibility:
Dockerfile,environment.yml, GitHub Actions CI. - Benchmark harness: Full-pipeline, population-scaling, and local-display benchmark scripts with machine-context JSON.
- Public-data validation: End-to-end
scripts/validation/pipelines for 1000 Genomes chr15/SLC24A5 (full pipeline) and chr2/LCT (region-scoped reconstruction).
Breaking changes
None of the on-disk TSV column names or CLI argument names were removed. The following behavioral differences may affect downstream tooling that made implicit assumptions:
- Per-pair NaN rows are now written to the output TSVs. Previously, a NaN in any single pair removed that locus from all pair files. If your downstream code assumed equal non-NaN row counts across pair files in a compute output directory, update it to filter NaNs per-pair instead of by count.
src/exp_heatmap/tests.pyhas been removed. The compute engine now lives insrc/exp_heatmap/compute_core.py. The public CLI andexp_heatmap.computewrapper are unchanged, so scripted users are not affected. Python code that importedexp_heatmap.testsdirectly must switch toexp_heatmap.compute_core.- The legacy
exp-selectionCLI entry point was removed. Useexp_heatmap(canonical) orexp-heatmap(hyphen alias) instead. - Python 3.8 and 3.9 are no longer tested. Minimum supported version is now Python 3.10 (CI runs 3.10, 3.11, 3.12). Older Python versions may still work but are not verified.
- Terminology: The primary name for the reciprocal ordered-pair ranking mode is now
directional. The string"2-tailed"continues to work everywhere as a backward-compatible alias. The-log10_p_value_{ascending,descending}TSV column names are retained for compatibility; conceptually they are empirical rank scores, not inferential p-values.
New CLI commands
| Command | Purpose |
|---|---|
filter-vcf |
Keep only biallelic SNP records. Optional --chrom/--start/--end for region-scoped filtering. Removes the dependency on external vcftools for the standard preprocessing path. |
summary |
Collapse pairwise rows to superpopulation pairs and optionally render an interactive summary heatmap. |
regions |
Extract top-scoring genomic windows into a TSV for downstream review or figure planning. |
focus |
Render an interactive view showing only rows involving a chosen population. |
compare |
Render an interactive side-by-side comparison of two genomic regions from one compute output. |
plot and full gained --rank-scores, --max-columns, --column-aggregation, and --dpi options.
Correctness changes
Pair-specific NaN preservation
Before v1.3.0, compute built a combined NaN mask across every population pair and dropped any locus where any pair had a NaN. This silently removed loci from every output file even when the NaN affected only a single pair.
v1.3.0 writes one TSV per pair that preserves that pair's own NaN positions. The rows remain position-aligned across all per-pair files, so downstream plotting keeps the same locus grid but now reflects the real pattern of missingness per pair.
Robust AF extraction
xp_utils.filter_by_AF previously required variants/AF in the input Zarr and crashed on VCFs that did not carry INFO/AF. It now prefers variants/AF when present and otherwise derives alternate allele frequencies from genotypes.
File-safe Matplotlib backend
plot.py now forces the non-interactive Agg backend before importing pyplot. This fixed a macOS crash during static PNG generation and is the more defensible default for a file-producing CLI.
Explicit wide-region downsampling
Static PNG rendering now supports:
--max-columnsexplicit column budget (defaults are derived from figure width × DPI when unset)--column-aggregation max|mean|medianreducer--dpifor the saved figure
Vertical marker lines now snap to the nearest rendered column when binning would otherwise drop the exact position.
Custom population panel support
The plotting layer no longer hard-codes the 26-population 1000 Genomes panel. If the compute output directory contains TSVs for the full canonical set, the classical 1000G row ordering and superpopulation annotations are preserved. Otherwise, the population set is inferred from the TSV filenames and a custom-panel heatmap is drawn automatically. This was the fix that unblocked end-to-end GGVP rendering in this release.
New modules
| Module | Purpose |
|---|---|
exp_heatmap.compute_core |
Core compute loop (replaces the misleadingly named tests.py). |
exp_heatmap.vcf_utils |
Biallelic-SNP VCF filtering with optional region selection. |
exp_heatmap.ggvp |
GGVP dataset helpers: release manifest parsing, integrated panel construction from official IGSR / 1000 Genomes metadata, download with MD5 checksum verification. |
exp_heatmap.benchmark_utils |
Helpers used by the benchmark scripts for region selection, population resolution, and comparison-region construction. |
Reproducibility
- Conda:
environment.ymldeclares anexp-heatmap-devenv with Python 3.11 and the package installed in editable mode with dev extras. - Docker:
Dockerfileproduces a minimalpython:3.11-slimimage withvcftoolsand the CLI preinstalled. Entry point isexp_heatmap. - CI:
.github/workflows/tests.ymlruns the pytest suite on Python 3.10, 3.11, and 3.12 on every push and pull request.
Benchmark harness
Three reproducible benchmark scripts live under scripts/benchmarks/:
run_local_benchmark.py: local display and reporting-layer benchmark across window sizes, population counts, and output modes.run_pipeline_benchmark.py: controlled full-pipeline benchmark coveringprepare,compute, andplot. Writes a machine-specs JSON plus stage-level TSVs.run_population_scaling.py: synthetic ordered-pair display scaling benchmark over 5/10/20/30/50 populations.
Install the optional benchmark extras with:
pip install -e ".[benchmarks]"Measured results on a reference machine (macOS 14 / Apple M4 Pro / 48 GB RAM / Python 3.12)
Full pipeline on GGVP chr21 (553,906 → 509,924 biallelic SNPs, 505 samples, 5 populations, XP-EHH):
| Stage | Small (1 Mb, 11,620 SNPs) | Full chr21 (509,924 SNPs) |
|---|---|---|
| prepare | 0.99 s / 0.28 GB | 12.64 s / 0.38 GB |
| compute | 2.76 s / 0.21 GB | 74.73 s / 0.99 GB |
| plot | 2.34 s / 0.26 GB | 2.27 s / 0.74 GB |
Display scaling (fixed genomic window, varying ordered-pair row count):
| Populations | Rows | Static PNG | Interactive HTML | Peak RSS |
|---|---|---|---|---|
| 5 | 20 | 0.10 MB | 5.4 MB | 0.43 GB |
| 50 | 2,450 | 2.47 MB | 99.5 MB | 2.17 GB |
Public-data validation
Two reproducible 1000 Genomes validation pipelines ship under scripts/validation/:
run_1kg_chr15_slc24a5.sh (full-pipeline validation)
End-to-end public-data run from the 1000 Genomes Phase 3 chromosome 15 release through filter-vcf → prepare → compute → plot, focused on the SLC24A5 pigmentation locus.
- Filter: 99.59 s (6,456,568 / 6,477,157 records retained)
- Prepare: 252.98 s
- Compute (XP-EHH): 3034.42 s
- Plot static: 28.43 s / interactive: 14.04 s
run_1kg_chr2_lct_reconstruction.sh (region-scoped reconstruction)
Chromosome 2 filtered to the LCT plotting window plus 1 Mb flanking sequence before compute. Completes end-to-end in roughly 7 minutes of wall time, producing the same LCT XP-EHH heatmap as the archived author-prepared bundle.
- Filter: 268.18 s (81,595 / 7,081,600 records retained)
- Prepare: 11.18 s
- Compute: 117.41 s
- Plot static: 4.61 s / interactive: 2.11 s
Use scripts/validation/summarize_validation_run.py to regenerate JSON / Markdown summaries from log files.
GGVP support
A second public dataset has been validated end-to-end. Under scripts/ggvp/:
download_release.py: download per-chromosome GGVP VCF/TBI files with MD5 verification.build_panel.py: construct the integrated population panel from the official GGVP alignment index, 1000 Genomes Phase 3 panel, and 1000 Genomes sequence index. The sequence-index fallback is required because the integrated GGVP VCF contains a handful ofHG*samples absent from the legacy 1000 Genomes panel file.
Documentation updates
README.mdnow matches the actual software surface, uses empirical-rank language throughout, includes a benchmarks section with measured numbers, and adds a public-data validation section.
Packaging
- Python ≥ 3.10 (classifiers cover 3.10, 3.11, 3.12).
- CLI entry points consolidated:
exp_heatmapandexp-heatmap(the legacyexp-selectionhas been removed). - New optional dependency extras:
[dev]—pytest,pytest-cov[benchmarks]—psutil,pysam[all]— both of the above
zarr < 3.0.0constraint retained.- Minor metadata polish: added
keywords,DocumentationURL, more preciseclassifiers, and arequires-python = ">=3.10"declaration.
Upgrading
pip install --upgrade exp_heatmapIf you previously imported the internal tests module directly:
# before
from exp_heatmap import tests
tests.run(...)
# after
from exp_heatmap import compute_core
compute_core.run(...)If you previously used the exp-selection CLI command, switch to exp_heatmap or exp-heatmap — the subcommand surface is unchanged.
If you maintained pipelines that relied on the old global-NaN masking semantics, regenerate your compute outputs with v1.3.0 before reusing them with downstream tools.