Skip to content

Perf/pipeline and animation overhaul#62

Open
LuciNyan wants to merge 52 commits into
mainfrom
perf/pipeline-and-animation-overhaul
Open

Perf/pipeline and animation overhaul#62
LuciNyan wants to merge 52 commits into
mainfrom
perf/pipeline-and-animation-overhaul

Conversation

@LuciNyan
Copy link
Copy Markdown
Owner

No description provided.

LuciNyan added 30 commits April 11, 2026 17:16
… GIF

Phase 1 — Pipeline Abstraction:
- Introduce declarative ShaderPass/Pipeline types in src/pipeline.ts
- Refactor stats-renderer and crt-renderer to use executePipeline()
- Theme-specific shader sequences now configurable as data

Phase 2 — Performance Optimization:
- Glow shader (24-44% faster): pre-compute falloff weight table and
  per-pixel luminance map (Float64Array for precision), inline final
  blending to eliminate render() overhead and ~9M temporary allocations
- Scanline shader: replace render() with direct buffer copy + multiply
- Border shader: replace render() with direct buffer copy + alpha edit
- All optimizations preserve exact numerical output (46 snapshot tests
  pass unchanged)

Animated GIF Output:
- Pure TypeScript GIF89a encoder with LZW compression and color quantization
- Animation orchestration: crt-flicker, glow-pulse, scanline-scroll effects
- New renderAnimatedGif() API and AnimationOptions type

Made-with: Cursor
Extract shared bilinear sampling to utils/bilinear.ts. Inline all
remaining shaders that used the render() framework, eliminating
per-pixel function call chains and temporary array allocations.

- curve: 112ms → 22ms (80% faster)
- orderedBayer: 88ms → 59ms (33% faster)
- pixelate (center): 3.7ms → 0.6ms (84% faster)
- halftone: inlined with direct buffer ops

All 46 snapshot tests pass with zero pixel differences.

Made-with: Cursor
Use row and column prefix sums to skip blur kernel iteration for pixels
with no bright neighbors in range. For typical pixel-profile cards (dark
background with bright text/elements), this skips 40-70% of blur work.

Realistic card benchmark: CRT config 49% faster (301ms → 154ms),
screen effect 37% faster (49ms → 31ms). Vertical pass reordered to
y-outer for cache-friendly prefix lookups.

Made-with: Cursor
Use Resvg's RenderedImage.pixels directly instead of encoding to PNG
(asPng) then decoding back via Jimp. Also replace Jimp's per-pixel
scan callback with Buffer.from() for direct memory copy.

Saves ~22ms per render by eliminating redundant PNG codec work.

Made-with: Cursor
Access Jimp bitmap data directly after resize instead of encoding to
PNG then decoding back. Removes one redundant PNG encode/decode cycle.

Made-with: Cursor
Replace per-pixel render() framework with per-block pre-computation
for all three pixelation sampling modes. The dominant mode was the
single largest bottleneck — it computed the same block color for every
pixel, causing avatar processing to take 6+ seconds. Now computes
once per block (~4900 blocks vs ~212K pixels for a typical avatar).

Results: blue_chill+glow 6.5s → 325ms (95% faster), CRT slow mode
1.4s → 463ms (68% faster), total test suite 45% faster.

No shaders import render() anymore — all have been inlined or
refactored to use direct buffer operations.

Made-with: Cursor
Eliminate function call overhead and tuple allocation for bilinear
sampling in the two shaders that call it most heavily per pixel.

CRT: 11 sampleBilinear calls/pixel (3 RGB channels + 8 bloom) × 527K
pixels = ~5.8M tuple allocations per render. Now zero allocations.
Also hoist per-row invariants (cy², rowBase) out of the inner loop.

Curve: 1 sampleBilinear call/pixel × 527K pixels. Inline eliminates
527K tuple allocations. Hoist ccY², oneMinusUvY per row.

Benchmarks (median, 10 runs):
  CRT  (1226×430): 89.9ms → 41.4ms (54% faster)
  CRT  (900×594):  99.1ms → 45.6ms (54% faster)
  Curve(1226×430): 21.3ms → 14.7ms (31% faster)

All 46 snapshot tests pass with zero pixel differences.

Made-with: Cursor
Jimp's PNG encoder (pure JavaScript) was the biggest non-shader
bottleneck: 23.7ms per final render. Replace with a minimal PNG
encoder that writes the PNG structure directly and delegates
compression to Node.js native zlib.deflateSync.

The encoder supports RGBA 8-bit color (the only format pixel-profile
uses), generating valid PNG files with IHDR + IDAT + IEND chunks.

Also converts getPngBufferFromPixels and getBase64FromPixels from
async (Jimp callback) to synchronous (direct encode), eliminating
Promise overhead for these hot-path operations.

Benchmarks (median, 10 runs):
  Avatar base64 (200×200): 3.2ms → 0.3ms (91% faster)
  Final PNG   (1226×430):  23.7ms → 3.1ms (87% faster)
  CRT PNG      (900×594):  ~23ms → 3.0ms  (87% faster)

Test suite: 8.29s → 6.87s (17% faster overall).
All 46 snapshot tests pass with zero pixel differences.

Made-with: Cursor
… shader

Math.sqrt uses hardware instructions while Math.pow is a complex software
function. Since pow(x, 0.25) = sqrt(sqrt(x)), this substitution is
mathematically identical but significantly faster.

curve shader: 16.4ms → 6.4ms (61% faster)
Screen pipeline: 42.5ms → 32.8ms (23% faster)

Made-with: Cursor
…in bilinear samples

In CRT and curve shaders, bilinear sample coordinates are already
clamped to [0, maxX/maxY] before floor+clamp operations. Since
coords are non-negative, Math.floor(x) === (x | 0), and the outer
Math.min/Math.max are redundant. Removing them eliminates ~6 function
calls per bilinear sample across 11+ samples per pixel in CRT.

CRT shader: 41.4ms → 36.9ms (11% faster)

Made-with: Cursor
halftone: Remove dead Math.random() call (ditherRange is always 0, so
the result is discarded). Replace Math.floor modulo with bitwise AND
for power-of-2 block size. Inline constant dotDensity=1.
4.3ms → 1.5ms (65% faster).

orderedBayer: Pre-compute dither limit lookup table and replace division
in lightnessStep/saturationStep with multiplication by reciprocal.

Made-with: Cursor
All shaders now use inlined pixel iteration loops instead of the
generic render() framework. The render() function, texture-filter
module (linear/nearest), and their associated test file are no
longer referenced by any production code.

Removed:
- renderer/render.ts (generic shader execution framework)
- renderer/texture-filter/ (linear, nearest sampling strategies)
- test/renderer.test.ts (tests for removed code)

Kept:
- renderer/common.ts (RGBA type, still imported by utils)

Made-with: Cursor
…ode (91%↑)

Replace sampleBilinear calls with inline computation to avoid 530K
tuple allocations per frame. Replace Map<string> color tracking with
pre-allocated Float64Array + direct float equality comparison,
eliminating number-to-string conversion overhead.

pixelate(dominant): 141ms → 13ms (11× faster)

Made-with: Cursor
Level 6 → 1 cuts deflateSync time from ~32ms to ~28ms with <0.4%
file size increase. For server-rendered GitHub cards where latency
matters more than bytes, this is a good tradeoff.

Made-with: Cursor
Exploit spatial coherence in pixelated input: when consecutive pixels
share the same RGB value (blockSize=4 → 75% of pixels), skip the
expensive RGB→HSL conversion + palette lookup + step quantization.
Cache previous pixel's HSL and derived dither parameters.

orderedBayer: 49ms → 39ms on random data; ~60-75% improvement
expected on real pixelated card data.

Made-with: Cursor
…34%↑)

Split bloom into interior (99.3% of pixels) and boundary paths.
Interior path hoists bilinear weights from green channel computation
(fractional parts are identical for integer bloom offsets) and uses
simplified index arithmetic (base + constant offsets instead of full
multiply-add per sample). Boundary path retains original per-sample
computation for correctness at edges.

CRT shader: 45ms → 30ms (34% faster)

Made-with: Cursor
…on in glow threshold

pixelateCenter: inlined sampleBilinear to eliminate function call and
tuple allocation per block (2.55ms → 2.19ms, 14%↑).
Removed unused sampleBilinear import from pixelate.ts.

glow: hoisted per-pixel /255 division out of calculateAdaptiveThreshold
loop — accumulates raw weighted sum, divides once at end. Negligible
standalone impact but correct arithmetic simplification.

Made-with: Cursor
Three optimizations to the glow shader:

1. Interior+all-bright fast path in both horizontal and vertical blur
   passes: when all neighbors within the blur radius are above the
   brightness threshold AND the pixel is far from image edges, skip
   per-sample Math.min/Math.max clamping and luminance threshold
   checks. Uses precomputed totalWeight and sequential memory access
   with running index.

2. Reuse hBlurLuminance Float64Array across layers instead of
   allocating a new one per layer (saves 5× 4MB allocations for
   the CRT pipeline's 5-layer glow).

3. Hoist division in calculateAdaptiveThreshold (from previous commit,
   included in this measurement).

Results (1226×430):
- glow(r3,l2): 26.66ms → 20.34ms (24%↑)
- glow(r5,l5): 115.86ms → 101.03ms (13%↑)
- pipeline CRT+glow: 136.92ms → 118.14ms (14%↑)
- pipeline scanline+glow+curve: 32.86ms → 26.16ms (20%↑)

Made-with: Cursor
…der optimization

All shaders write every output pixel, making zero-initialization wasteful.
Replace Buffer.alloc with Buffer.allocUnsafe across all 8 shader files.

PNG encoder: use allocUnsafe for raw buffer (fully written by filter+copy),
and eliminate temporary crcBuf allocation by computing CRC directly on the
chunk buffer where type+data bytes already reside.

Pipeline benchmarks: CRT+glow -1.4%, screen -2.9%, encodePng -3.3%.

Made-with: Cursor
For each unique (R,G,B), pre-compute all 8 possible dither outcomes
(2 hue × 2 lightness × 2 saturation) and cache as a Uint8Array(24).
Per-pixel work reduces to a cache lookup + 3 comparisons + byte copy,
eliminating repeated HSL→RGB conversion (hue2rgb calls).

Benchmark: 27.1ms → 19.0ms median on 900×594 image.
Made-with: Cursor
Extract fillBlockBands helper that writes the first row of each block
band pixel-by-pixel, then uses Buffer.copy to replicate it to remaining
rows. Correctly handles non-integer blockSize via Math.floor mapping.

Benchmark on 900×594 (blockSize=10):
  center:   1.81ms → 0.50ms (72%↑)
  average:  2.46ms → 1.03ms (58%↑)
  dominant: 3.32ms → 1.96ms (41%↑)

Made-with: Cursor
…ns (11%↑)

Write PNG signature, IHDR, IDAT, IEND directly into a pre-sized output
buffer instead of creating separate chunk buffers and concatenating.
Precompute IEND chunk and type byte constants at module load.

Benchmark: 21.2ms → 18.8ms median on 900×594 image.
Made-with: Cursor
Wrap Resvg's RenderedImage.pixels Uint8Array with Buffer.from(buffer,
offset, length) instead of Buffer.from(uint8array) to avoid a 2MB+
copy of the rendered pixel data. Similarly, use Jimp's bitmap.data
directly instead of copying it, since downstream shaders create new
output buffers.

Made-with: Cursor
… 8%↑)

Hoist by0*w4, by1*w4 row offsets and use bx<<2 column offsets instead
of recomputing (byN*width+bxN)*4 for each corner. Applied to curve's
single bilinear sample and CRT's R/G/B channels + bloom boundary path.

Benchmark: curve 6.21→5.52ms, crt 38.3→35.1ms.
Made-with: Cursor
halftone: replace nested JS array with flat Float64Array, use Uint32
writes for the two output colors (38%↑: 1.96→1.22ms).

glow: inline luminance computation after horizontal pass (eliminates
one full-image pass per layer), precompute per-layer intensity/oneMinusT,
add white-color fast path skipping 3 multiplications per pixel per layer
(14%↑ for r3l2: 39.43→33.85ms).

Vertical pass tiling evaluated but not included — hardware prefetcher
handles the regular stride pattern well; tiling added overhead without
measurable cache benefit.

Made-with: Cursor
blendBorder decodes and resizes the constant border PNG via Jimp on every
call (~32ms). Cache the decoded RGBA pixels plus precomputed alpha/oneMinusAlpha
Float64Arrays keyed by target dimensions. Subsequent calls skip Jimp entirely
and run only the blend loop (~2.3ms) — a ~14x speedup on the blendBorder step.

Also:
- Use Buffer.allocUnsafe for blend output (every pixel written)
- Add optional level parameter to encodePng for caller-controlled compression

Made-with: Cursor
Replace string-based LZW dictionary with numeric hash table, eliminating
string concatenation per pixel. Build shared palette from all frames using
frequency-based selection (most-used colors first) instead of scan-order.
Cache palette lookups across frames via Map. Use Buffer.allocUnsafe and
Buffer.copy throughout instead of byte-by-byte operations.

Also optimize scanlineWithOffset to use bulk Buffer.copy for non-scanline
rows instead of per-pixel byte copy.

Benchmark (8 frames, 1226x430):
  Before: 1241ms median (155ms/frame)
  After:  149ms median (19ms/frame)

Made-with: Cursor
Wire the existing animation pipeline to the server API. Users can now
request animated GIF cards by adding ?animation=true to the stats URL.
Optionally specify ?animation_effect=crt-flicker|glow-pulse|scanline-scroll
to override the default animation effect.

Content-Type automatically switches to image/gif when animation is enabled.

Made-with: Cursor
…build errors

Add paletteDither shader supporting any user-defined color palette via
ordered dithering in RGB space. Built-in palettes: gameboy, nokia,
grayscale, sepia, neon, cga. Exposed via ?dithering_palette= API param
(accepts palette name or comma-separated hex colors like ff0000,00ff00).

Performance: ~3.8ms for 1226×430 — on par with orderedBayer.

Also fixes pre-existing TypeScript 5.8 Buffer/Uint8Array type
incompatibilities that were breaking the DTS build on Vercel.

Made-with: Cursor
New card type at /api/github-contributions?username=<user> renders a
pixel-art contribution heatmap with 53-week calendar grid, day/month
labels, level legend, and total contribution count.

Supports all existing features: themes, screen effects (scanline/curve/
glow), dithering, palette dithering, and animated GIF output.

5 snapshot tests, 44 total tests passing. Build + DTS clean.

Made-with: Cursor
LuciNyan added 22 commits April 11, 2026 22:18
Fix Buffer/Uint8Array type incompatibilities in png-encoder.ts (8 errors)
and MapIterator downlevel iteration in gif-encoder.ts (1 error).

pnpm test-type now passes clean. No visual changes.

Made-with: Cursor
…r 371 cells (2.7x faster)

Replace 371 individual <div> elements with a single pre-rendered PNG
data URI. Satori layout for the contribution grid drops from ~18ms to
~1ms. End-to-end contributions render: 30.7ms → 11.5ms (2.7x↑).

Also fixes all tsc --noEmit type errors in png-encoder.ts and
gif-encoder.ts for TS 5.8 compatibility.

Made-with: Cursor
…paletteDither)

Shaders that read/write same pixel index now modify the input buffer
directly instead of allocating a separate output buffer. This eliminates
~2MB buffer allocation + data copy per in-place shader pass.

- scanline: skip non-scanline row copies, modify only affected rows
- orderedBayer: write dithered values directly to source buffer
- paletteDither: write palette-mapped values directly to source buffer
- animation: clone basePixels per frame to protect multi-frame reuse
- render-utils: clean up unused background image compositing code

Measured improvement (15-run median, Apple Silicon):
  summer+screen (scanline+curve): 25.2ms → 23.3ms (7.5%)
  dithering (orderedBayer):       11.0ms → 9.9ms  (10%)
  paletteDither gameboy:          11.1ms → 10.5ms (5.4%)

44 tests pass, type-checking clean, no snapshot changes.

Made-with: Cursor
Add a new card type for displaying repository statistics in pixel-art
style, completing the README TODO item "Github repo card."

New endpoint: /api/github-repo?username=owner&repo=name
  Supports all existing options: theme, screen_effect, dithering,
  dithering_palette, color, background, animation

Components:
  - RepoData type with name, owner, description, language, stars, forks
  - GraphQL fetcher for repo metadata (repo-fetcher.ts)
  - JSX template with owner/name header, description, language dot,
    stars/forks stats, and [ARCHIVED]/[FORK] badges (repo-template.tsx)
  - Renderer with full pipeline integration (repo-renderer.ts)
  - Hono server handler (github-repo.ts)
  - 6 snapshot tests covering default, theme, screen effect, dithering,
    archived+forked repos, and no-language edge case

50 total tests pass (44 existing + 6 new), type-checking clean.

Made-with: Cursor
Use worker_threads to parallelize the CRT shader computation across
multiple CPU cores. Each worker processes a horizontal row range of
the output image independently via SharedArrayBuffer.

- Extract crtCore() as a closure-free pure function for serialization
  via Function.toString() to eval-based workers
- Worker pool (N-1 cores, max 6) with lazy initialization and
  graceful fallback to sync path on failure
- executePipelineAsync/executePipelineSmart for transparent async
  shader support while keeping sync pipeline unchanged
- All renderers use smart pipeline (auto-detects parallel shaders)

Benchmark (1226x430, 15-run median, 10-core machine):
  CRT sync:     34.3ms
  CRT parallel:  8.6ms  (-75%, 4.0x)

CRT card (900x594):
  sync:     35.3ms
  parallel:  9.4ms  (-73%, 3.8x)

Also: hide contribution heatmap route from server (keep code+tests).
All 50 snapshot tests pass unchanged. Type checking clean.

Made-with: Cursor
Refactor glow.ts to extract closure-free core functions (horizontalPassCore,
verticalPassCore, glowLayerCore) serializable via Function.toString().
Extend the universal worker pool to dispatch glow layers in parallel across
threads using SharedArrayBuffer for zero-copy data transfer.

Benchmark results (1226×430):
- r=5, layers=5: 49ms → 21ms (58% faster, 2.4x)
- r=3, layers=2: 21ms → 13ms (42% faster)
- r=5, layers=1: 12ms → 11ms (10% — single layer, minimal parallelism)

Combined with CRT parallelization, the full CRT+glow pipeline now runs
~3x faster than baseline for the heaviest shader configurations.

Made-with: Cursor
Convert renderAnimatedGif from sync executePipeline to async
executePipelineSmart, enabling Worker thread parallelization for CRT
and glow shaders across animation frames.

Benchmark results (1226×430, 8 frames):
- Screen effect (glow r=3 l=2): 336ms → 263ms (22% faster)
- CRT pipeline (crt + glow r=5 l=1): 532ms → 320ms (40% faster)
- CRT theme (crt + glow r=5 l=5): 822ms → 389ms (53% faster, 2.1x)

This is a "free" speedup — leverages existing Worker infrastructure
with no new complexity, just switching from sync to async pipeline.

Made-with: Cursor
Replace Map-based color counting and lookup with:
- Uint32Array[65536] for O(1) color frequency counting in buildPalette
- Lazy Uint16Array[65536] LUT for color→palette index mapping (array
  index lookup vs Map.get(), populated on first access per color)
- Flat Uint8Array[768] palette for cache-friendly access
- Power-of-2 LZW hash table (bitwise mask instead of modulo)

Benchmark results (1226×430, 8 frames):
- encodeGif: 61.5ms → 37.5ms (39% faster)

Made-with: Cursor
Factor the per-pixel vignette calculation sqrt(sqrt(uvX*(1-uvX)*uvY*(1-uvY)*15))
into precomputed per-column array × per-row scalar using the identity
(a*b*c)^0.25 = a^0.25 * b^0.25 * c^0.25. This eliminates ~1M sqrt
operations (527K pixels × 2 sqrt each) and replaces them with
width + height pow(x, 0.25) calls + per-pixel multiply.

Also replace division (px/width) with multiplication (px*invW).

Benchmark: 6.4ms → 5.0ms (21% faster)
Made-with: Cursor
Extract curveCore as closure-free function with row-range support.
Add curve as third task type in the universal worker pool (CRT, glow,
curve). Split image rows across worker threads using SharedArrayBuffer.

Benchmark: 4.6ms → 2.0ms (56% faster, 2.3x)

Three compute-heavy shaders (CRT, glow, curve) are now all parallelized.

Made-with: Cursor
Previously each glow layer was assigned to a separate worker, bottlenecked
by the heaviest layer (55ms for radius=25). Now all workers collaborate on
each layer's horizontal and vertical passes via row-splitting with a
2-barrier batch protocol: one barrier after all layers' horizontal passes,
one after all vertical passes.

Real image data (1226x430, 5 layers): 70.5ms → 55.9ms (21% improvement).
Lighter configs (2 layers): 17.3ms → 14.4ms (17% improvement).

Made-with: Cursor
Eliminates the serial main-thread composite loop by integrating layer
blending directly into the vertical pass workers. Each worker now
computes vertical blur AND blends all layers for its row range, writing
the final result to a shared output buffer.

glow r=5 l=5: 55.9ms → 51.0ms (9% faster)
glow r=3 l=2: 14.4ms → 10.2ms (29% faster)
Screen effect pipeline: 17.1ms → 13.8ms (19% faster)

Also removes dead glow-layer handler (replaced by row-parallel batch)
and unused buildLuminanceMap from worker code.

Made-with: Cursor
Replace Buffer.allocUnsafe + copy pattern with direct Buffer.from(sab)
views. The SharedArrayBuffer is kept alive by the returned Buffer
reference, avoiding a ~2MB memcpy per dispatch call.

Applied to dispatchCrt, dispatchCurve, and dispatchGlow.

Made-with: Cursor
When parallel shaders chain (e.g., CRT → glow), the first shader's
output is a Buffer backed by a SharedArrayBuffer. The next shader
previously copied this into a fresh SAB. Now toSharedSource() detects
the existing SAB backing and reuses it, eliminating a ~2MB memcpy
per chained dispatch.

Made-with: Cursor
For glow-pulse animation, the glow blur layers are identical across all
frames — only the composite intensity varies per frame. Previously we
re-computed the full blur for every frame (~55ms × 8 = ~440ms).

Now precomputeGlowLayers() runs the horizontal + vertical blur passes
once, and compositeGlowFromPrecomputed() rapidly composites each frame
with varying intensity (~3ms per frame).

CRT glow-pulse 8 frames: ~460ms → 131ms (3.5x faster)
Screen effect glow-pulse 8 frames: ~157ms → 61ms (2.6x faster)

New worker message types: glow-v-only (vertical pass without composite)
and glow-composite (composite only from precomputed layers).

Made-with: Cursor
…mation tests

Remove the contribution heatmap card and repo card features entirely:
- Delete fetchers, renderers, templates, server routes, tests, and snapshots
- Remove buildContributionsPipeline from pipeline.ts
- Remove Contribution/Repo types from types.ts
- Clean up exports from index.ts and server app.ts
- Update both READMEs

Add 10 unit tests for GIF encoder and animation rendering:
- GIF89a header, dimensions, trailer byte, NETSCAPE loop extension
- Multi-frame encoding, gradient data with many colors
- renderAnimatedGif with empty pipeline, frameCount, scanline-scroll, defaults

49 tests passing (39 existing + 10 new).

Made-with: Cursor
Replace object-based sorting with typed-array quickSelect for top-256
color extraction. Precompute a full 65536-entry lookup table mapping
every 16-bit color key to its nearest palette index, making indexFrame
O(1) per pixel. Micro-benchmarks show 7-11% improvement on GIF encoding.

Made-with: Cursor
Workers now send a ready signal on startup. The pool waits up to 3s for
all workers to respond; if they don't (e.g. in Vercel's serverless
bundled environment where eval workers may not function), the pool is
marked as failed and all shaders fall back to synchronous execution.
This prevents the infinite hang that caused 504 GATEWAY_TIMEOUT on
Vercel when screen_effect=true or theme=crt was used.

Made-with: Cursor
scanline-scroll: run base pipeline once instead of per-frame, then
apply only the scanline offset for each frame. 169→72ms (2.34x).

crt-flicker: precompute glow blur layers from the base CRT output,
then for each frame run only the CRT pass (with varying noise/scanline
params) and composite the precomputed glow via a differential blend.
532→232ms (2.29x).

Made-with: Cursor
The glow parameters are constant across crt-flicker frames (only CRT
noise/scanline vary), so compute glow composite once and apply a
differential blend per frame. Saves 7 redundant glow composites.
crt-flicker 8 frames: 232→211ms (9% faster).

Made-with: Cursor
… fused dispatch

- Extract buildColumnPrefixMap for reuse; verticalPassCore accepts
  precomputed prefix to skip redundant O(H×W) scans in workers
- glow() sync path: compute luminance map once, share across all
  layers (was recomputing per-layer — 4 redundant 4.2MB allocations
  eliminated for 5-layer config)
- dispatchGlow fused from 3 worker rounds to 2 using glow-vc-batch
  (vertical + composite in one message)
- buildLuminanceMap/buildColumnPrefixMap accept optional output arrays
  for in-place SAB computation, avoiding temp alloc + memcpy
- precomputeGlowLayers (animation path) also benefits from shared
  colPrefix via SAB

Measured: glow r3 l2 parallel ~17% faster (13.7→11.3ms median),
glow r5 l5 sync ~3% faster (102→98.8ms). Parallel r5 l5 within noise
due to memory bandwidth saturation.

Also fixes bench.mjs pixelate params (was using wrong option names,
effectively a no-op) and adds CRT theme pipeline benchmark.

Made-with: Cursor
horizontalPassFusedCore processes all glow layers row-by-row instead
of layer-by-layer, sharing the row prefix scan across layers and
keeping source pixel data hot in L1 cache. Applied to both sync
glow() path and parallel worker h-batch handler.

For 5-layer config: eliminates 4 redundant row prefix computations
per row (344 prefix scans → 86). Source data read once per row
instead of once per layer per row (5× fewer L2/L3 fetches when
working set exceeds L1).

Made-with: Cursor
@vercel
Copy link
Copy Markdown

vercel Bot commented Apr 14, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
pixel-profile Ready Ready Preview, Comment Apr 14, 2026 2:15pm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant