Skip to content

linalg,core: SIMD ReduceMin (mirror the max reducer)#2368

Open
czoli1976 wants to merge 2 commits into
sonos:mainfrom
czoli1976:perf/reduce-min
Open

linalg,core: SIMD ReduceMin (mirror the max reducer)#2368
czoli1976 wants to merge 2 commits into
sonos:mainfrom
czoli1976:perf/reduce-min

Conversation

@czoli1976

Copy link
Copy Markdown
Contributor

What

min_t (the f32 ReduceMin reducer) had no SIMD path — it ran a scalar branchy partial-ord fold, while max_t uses a hand-vectorized max_f32 kernel. This mirrors the max reducer for min:

  • generic SMin4 (fallback), arm64 NEON arm64simd_min_f32_16n (fmin/fminv), and x86 AVX2 x86_64_fma_min_f32_32n (vminps) — same structure/wiring as their max counterparts, registered as Ops::min_f32.
  • min_t routes the contiguous f32 case through min_f32, falling back to the scalar fold for non-f32 / strided / empty slices (same shape as max_t).
  • min_frame_tests! macro (mirror of max_frame_tests!) validates each kernel against the reference; plus a core reduce-min correctness test.

Honest note on the generic path

While implementing this I measured that the generic-framework reducer is actually slower than the scalar fold (~3 vs ~8 GB/s) — the per-row framework overhead outweighs its 4-wide inner loop. So the win comes entirely from the hand-written NEON/AVX2 kernels, not the generic path. This matches how max already behaves: its generic SMax4 is a correctness fallback, and the speed comes from the arm64/x86 kernels. (On purely generic/wasm builds, min_f32 uses the generic reducer, same as max_f32.)

Benchmark

M-series, f32 min over the trailing (contiguous) axis, scalar fold vs NEON, via the added core/examples/reduce_min_bench.rs:

shape scalar NEON speedup
1024 × 4096 2.05 ms 0.34 ms 6.0× (49 GB/s)
4096 × 1024 1.84 ms 0.38 ms 4.8× (44 GB/s)
256 × 65536 7.79 ms 1.05 ms 7.4× (64 GB/s)

Benefits ReduceMin and MinPool.

Testing

  • New min_frame_tests! runs against the generic + NEON kernels (and the x86 kernel on x86 CI), each checked vs the reference.
  • New core test reduce_min_f32_contiguous_and_strided (contiguous incl. tail, strided).
  • Full tract-core + tract-linalg suites pass on arm64.

The x86 AVX2 kernel mirrors the already-validated max kernel (vmaxpsvminps); I can't perf-test it on this Apple-Silicon host, but its correctness is covered by the min frame test in x86 CI and perf parity follows by construction.

Stacked on #2367 (ReduceMax fix) — both touch reduce.rs; this branch includes that commit. Mergeable independently once #2367 lands.

🤖 Generated with Claude Code

czoli1976 and others added 2 commits June 13, 2026 14:18
`max_t` (the f32 ReduceMax reducer) called the vectorized `max_f32` linalg
kernel, *discarded its result*, then unconditionally recomputed the max with a
scalar partial-ord fold over the same slice — so ReduceMax did the reduction
twice and was effectively scalar-bound (the "optimized" path was strictly slower
than having no kernel at all).

Return the SIMD kernel's result for the f32 contiguous case; fall through to the
scalar fold only for non-f32 dtypes, non-contiguous (strided) slices, or empty
slices. Adds a correctness test covering both branches (contiguous + tail,
strided, single-element).

Benchmark (M-series, f32 max over the trailing axis, via the added
reduce_max_bench example):

  shape          before            after            speedup
  1024 x 4096    2.44 ms / 6.9GB/s 0.32 ms / 52GB/s  7.5x
  4096 x 1024    2.46 ms / 6.8GB/s 0.42 ms / 40GB/s  5.9x
  256  x 65536   9.44 ms / 7.1GB/s 1.04 ms / 65GB/s  9.1x

Identical results. Benefits ReduceMax, MaxPool and the softmax max pre-pass.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
`min_t` had no SIMD path — it ran a scalar branchy partial-ord fold while
`max_t` uses a hand-vectorized `max_f32` kernel. Mirror the max reducer for min:

- generic `SMin4` (fallback), arm64 NEON `arm64simd_min_f32_16n` (fminv), and
  x86 AVX2 `x86_64_fma_min_f32_32n` (vminps) — same structure/wiring as their
  max counterparts, registered as `Ops::min_f32`.
- `min_t` now routes the contiguous f32 case through `min_f32` and falls back to
  the scalar fold for non-f32 / strided / empty slices (same shape as `max_t`).
- `min_frame_tests!` macro (mirror of `max_frame_tests!`) validates each kernel
  against the reference; + a core reduce-min correctness test.

Note: the generic-framework reducer is actually *slower* than the scalar fold
(measured ~3 vs ~8 GB/s), so the win comes from the hand-written NEON/AVX2
kernels, not the generic path — matching how max behaves (generic is a
correctness fallback only).

Benchmark (M-series, f32 min over the trailing axis, scalar fold vs NEON, via
the added reduce_min_bench example):

  shape          scalar    NEON      speedup
  1024 x 4096    2.05 ms   0.34 ms   6.0x  (49 GB/s)
  4096 x 1024    1.84 ms   0.38 ms   4.8x  (44 GB/s)
  256  x 65536   7.79 ms   1.05 ms   7.4x  (64 GB/s)

(x86 AVX2 kernel mirrors the validated max kernel; correctness covered by the
min frame test in CI, perf parity by construction.)

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant