linalg,core: SIMD ReduceMin (mirror the max reducer) by czoli1976 · Pull Request #2368 · sonos/tract

czoli1976 · 2026-06-13T13:44:05Z

What

min_t (the f32 ReduceMin reducer) had no SIMD path — it ran a scalar branchy partial-ord fold, while max_t uses a hand-vectorized max_f32 kernel. This mirrors the max reducer for min:

generic SMin4 (fallback), arm64 NEON arm64simd_min_f32_16n (fmin/fminv), and x86 AVX2 x86_64_fma_min_f32_32n (vminps) — same structure/wiring as their max counterparts, registered as Ops::min_f32.
min_t routes the contiguous f32 case through min_f32, falling back to the scalar fold for non-f32 / strided / empty slices (same shape as max_t).
min_frame_tests! macro (mirror of max_frame_tests!) validates each kernel against the reference; plus a core reduce-min correctness test.

Honest note on the generic path

While implementing this I measured that the generic-framework reducer is actually slower than the scalar fold (~3 vs ~8 GB/s) — the per-row framework overhead outweighs its 4-wide inner loop. So the win comes entirely from the hand-written NEON/AVX2 kernels, not the generic path. This matches how max already behaves: its generic SMax4 is a correctness fallback, and the speed comes from the arm64/x86 kernels. (On purely generic/wasm builds, min_f32 uses the generic reducer, same as max_f32.)

Benchmark

M-series, f32 min over the trailing (contiguous) axis, scalar fold vs NEON, via the added core/examples/reduce_min_bench.rs:

shape	scalar	NEON	speedup
1024 × 4096	2.05 ms	0.34 ms	6.0× (49 GB/s)
4096 × 1024	1.84 ms	0.38 ms	4.8× (44 GB/s)
256 × 65536	7.79 ms	1.05 ms	7.4× (64 GB/s)

Benefits ReduceMin and MinPool.

Testing

New min_frame_tests! runs against the generic + NEON kernels (and the x86 kernel on x86 CI), each checked vs the reference.
New core test reduce_min_f32_contiguous_and_strided (contiguous incl. tail, strided).
Full tract-core + tract-linalg suites pass on arm64.

The x86 AVX2 kernel mirrors the already-validated max kernel (vmaxps→vminps); I can't perf-test it on this Apple-Silicon host, but its correctness is covered by the min frame test in x86 CI and perf parity follows by construction.

Stacked on #2367 (ReduceMax fix) — both touch reduce.rs; this branch includes that commit. Mergeable independently once #2367 lands.

🤖 Generated with Claude Code

`max_t` (the f32 ReduceMax reducer) called the vectorized `max_f32` linalg kernel, *discarded its result*, then unconditionally recomputed the max with a scalar partial-ord fold over the same slice — so ReduceMax did the reduction twice and was effectively scalar-bound (the "optimized" path was strictly slower than having no kernel at all). Return the SIMD kernel's result for the f32 contiguous case; fall through to the scalar fold only for non-f32 dtypes, non-contiguous (strided) slices, or empty slices. Adds a correctness test covering both branches (contiguous + tail, strided, single-element). Benchmark (M-series, f32 max over the trailing axis, via the added reduce_max_bench example): shape before after speedup 1024 x 4096 2.44 ms / 6.9GB/s 0.32 ms / 52GB/s 7.5x 4096 x 1024 2.46 ms / 6.8GB/s 0.42 ms / 40GB/s 5.9x 256 x 65536 9.44 ms / 7.1GB/s 1.04 ms / 65GB/s 9.1x Identical results. Benefits ReduceMax, MaxPool and the softmax max pre-pass. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

`min_t` had no SIMD path — it ran a scalar branchy partial-ord fold while `max_t` uses a hand-vectorized `max_f32` kernel. Mirror the max reducer for min: - generic `SMin4` (fallback), arm64 NEON `arm64simd_min_f32_16n` (fminv), and x86 AVX2 `x86_64_fma_min_f32_32n` (vminps) — same structure/wiring as their max counterparts, registered as `Ops::min_f32`. - `min_t` now routes the contiguous f32 case through `min_f32` and falls back to the scalar fold for non-f32 / strided / empty slices (same shape as `max_t`). - `min_frame_tests!` macro (mirror of `max_frame_tests!`) validates each kernel against the reference; + a core reduce-min correctness test. Note: the generic-framework reducer is actually *slower* than the scalar fold (measured ~3 vs ~8 GB/s), so the win comes from the hand-written NEON/AVX2 kernels, not the generic path — matching how max behaves (generic is a correctness fallback only). Benchmark (M-series, f32 min over the trailing axis, scalar fold vs NEON, via the added reduce_min_bench example): shape scalar NEON speedup 1024 x 4096 2.05 ms 0.34 ms 6.0x (49 GB/s) 4096 x 1024 1.84 ms 0.38 ms 4.8x (44 GB/s) 256 x 65536 7.79 ms 1.05 ms 7.4x (64 GB/s) (x86 AVX2 kernel mirrors the validated max kernel; correctness covered by the min frame test in CI, perf parity by construction.) Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

czoli1976 and others added 2 commits June 13, 2026 14:18

czoli1976 force-pushed the perf/reduce-min branch from 412ad5e to 602c714 Compare June 13, 2026 14:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

linalg,core: SIMD ReduceMin (mirror the max reducer)#2368

linalg,core: SIMD ReduceMin (mirror the max reducer)#2368
czoli1976 wants to merge 2 commits into
sonos:mainfrom
czoli1976:perf/reduce-min

czoli1976 commented Jun 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

czoli1976 commented Jun 13, 2026

What

Honest note on the generic path

Benchmark

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant