Add weight-only quantization MoE example by mykolas-perevicius · Pull Request #4507 · pytorch/ao

mykolas-perevicius · 2026-06-18T01:13:56Z

Why this PR

Issue #729 notes that quantize_() should already support Mixture-of-Experts (MoE) models, but there is no example or tutorial demonstrating weight-only quantization on an MoE — users have no reference for the workflow even though the API reportedly covers it. The existing examples/quantize_llama_4.py quantizes routed experts to float8 w8a8 dynamic (activation + weight), not weight-only, and torchao/prototype/moe_training/ is MoE training, a different feature. So the weight-only-MoE showcase gap is real.

I verified the maintainer's claim before writing anything: applying Int8WeightOnlyConfig to an MoE block's expert linears works unmodified — expert weights become Int8Tensor, the model shrinks ~3.9x, and outputs stay within ~45 dB SQNR of fp32. The contribution is therefore purely additive: a runnable example plus a unit test, with no core changes.

What this PR does

Adds examples/quantize_moe.py: it builds a small token-choice top-2 MoE block (a softmax router plus nn.Linear experts) and quantizes only the expert weights via quantize_(model, Int8WeightOnlyConfig(), filter_fn=is_expert_linear). The router is deliberately left in high precision — quantizing it would change token-to-expert routing decisions, not just numerics.

The script is self-verifying: it prints before/after weight types and serialized sizes, runs a forward pass, reports SQNR vs the fp32 baseline, and asserts (experts quantized, router not, ≥1.5x smaller, SQNR > 25 dB), exiting non-zero on any failure.

--dtype int8|int4 and --device flags; int4 is gated behind a hardware/dependency warning.
The docstring points users with real fused-3D-expert checkpoints (e.g. meta-llama/Llama-4-Scout-17B-16E-Instruct) at the FqnToConfig + PerRow(1) pattern, mirroring examples/quantize_llama_4.py.

It also adds a unit test — TestQuantFlow.test_int8_weight_only_moe_experts_only in test/quantization/test_quant_api.py (with a module-level ToyMoEModel modeled on the file's ToyLinearModel) — since CI lints examples but does not execute them, so without a test MoE support stays demonstrated-but-unguarded.

Relevant issues

Closes #729

Test plan / evidence

CI lints examples/ but does not execute them (only tutorials/ run via run_tutorials.yml), so the example is self-asserting and the unit test provides the regression guard.

CPU (macOS, torch 2.12): example runs twice identically — experts → Int8Tensor, router float32, serialized 8.40 → 2.15 MB (3.90x), SQNR 45.1 dB, all asserts pass. pytest test/quantization/test_quant_api.py -k moe → 1 passed.

CUDA (RTX 3090, SM 8.6, torch 2.12.1+cu130):

$ python examples/quantize_moe.py --device cuda
baseline | expert weight: Parameter (torch.float32) | serialized: 8.40 MB
int8 wo  | expert weight: Int8Tensor | router: Parameter (torch.float32) | serialized: 2.15 MB (3.90x smaller) | SQNR vs float32: 45.1 dB
quantization of the MoE model succeeded

$ pytest test/quantization/test_quant_api.py -k moe -q
1 passed

$ pytest test/quantization/test_quant_api.py
18 passed, 28 skipped, 0 failed

The 28 skips are all hardware-gated (Need SM 8.9+ / Checkpoints are produced in SM90+ — float8 paths on this SM 8.6 card) or pre-existing unconditional skips; none are introduced by this change.

Acceptance criteria

Tests added for new behavior (test_int8_weight_only_moe_experts_only + ToyMoEModel)
All tests passing (CPU and CUDA; pre-existing skips/failures unrelated, verified via git stash)
Follows project style guide (ruff check + ruff format clean, ruff 0.11.6; BSD-3 header via scripts/check_copyright_header.py)
No breaking changes (purely additive — new example + new test, no core changes)
Documentation updated where applicable (example docstring with run instructions, hardware note, and fused-3D-expert pointer)

Open questions for maintainers

Example location — top-level examples/quantize_moe.py (matching examples/quantize_llama_4.py, added in add an example for quantizing LLaMa 4 Scout #3408) vs examples/inference/ (hinted by the literalinclude in quant_api.py)?
Keep or drop the unit test? add an example for quantizing LLaMa 4 Scout #3408 was single-file. Since CI never runs examples, I added the test so MoE support is regression-checked — happy to drop it if you prefer the single-file shape.
examples/README.md entry — should I add one? add an example for quantizing LLaMa 4 Scout #3408 didn't.
Supported public path for int4? The int4 default path requires mslk >= 1.0.0, but the mslk package on public PyPI is a 0.0.0 placeholder and torchao/utils.py:1226 gates real availability on is_fbcode(). On a fresh RTX 3090 with the public wheel, --dtype int4 raises ImportError: Requires mslk >= 1.0.0 (same on CPU and CUDA). What's the supported public way to exercise int4 weight-only? (The example gates int4 behind a flag + warning, so the int8 showcase is unaffected.)

Note

Minor and separate from this PR: torchao/quantization/quant_api.py:1499 has an invalid escape '\.' in a docstring that triggers SyntaxWarning: invalid escape sequence on import under Python ≥ 3.12. Happy to send a one-line r""" micro-fix as its own PR.

Adds examples/quantize_moe.py: a self-contained, self-verifying script that applies int8 (CPU) or int4 (CUDA + mslk) weight-only quantization to the experts of a small token-choice top-2 MoE block via quantize_(), keeping the router in high precision. Prints weight types, serialized size reduction (~3.9x for int8), and SQNR vs the float32 baseline, and points users with fused-3D-expert checkpoints at the FqnToConfig + PerRow(1) pattern from quantize_llama_4.py. Addresses pytorch#729

Adds ToyMoEModel and a CPU test that quantizes only the expert linears via quantize_(filter_fn=...), asserting expert weights become Int8Tensor, the router weight stays unquantized, and SQNR vs float32 stays above 25 dB. Addresses pytorch#729

pytorch-bot · 2026-06-18T01:14:00Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/4507

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

mykolas-perevicius · 2026-06-18T01:14:20Z

Hi @jcaip @msaroufim — this is my first contribution to torchao, addressing #729.

It adds a runnable weight-only quantization example for MoE models (examples/quantize_moe.py) plus a unit test. It quantizes only the expert linears via quantize_(..., filter_fn=...) — the router is kept in high precision so routing decisions are unchanged — and the script self-verifies the size reduction (~3.9x int8) and SQNR. Validated end-to-end on CPU (macOS) and CUDA (RTX 3090).

I left a few open questions in the description that I'd value your steer on — especially the preferred example location (examples/ vs examples/inference/) and whether you'd like the unit test kept (happy to drop it for a single-file PR). Would appreciate a review when you have time. Thank you!

mykolas-perevicius added 2 commits June 17, 2026 21:12

mykolas-perevicius requested review from jerryzh168 and vkuzo as code owners June 18, 2026 01:13

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 18, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add weight-only quantization MoE example#4507

Add weight-only quantization MoE example#4507
mykolas-perevicius wants to merge 2 commits into
pytorch:mainfrom
mykolas-perevicius:fix-issue-729-moe-quant-example

mykolas-perevicius commented Jun 18, 2026

Uh oh!

pytorch-bot Bot commented Jun 18, 2026

Uh oh!

mykolas-perevicius commented Jun 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mykolas-perevicius commented Jun 18, 2026

Why this PR

What this PR does

Relevant issues

Test plan / evidence

Acceptance criteria

Open questions for maintainers

Note

Uh oh!

pytorch-bot Bot commented Jun 18, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/4507

Uh oh!

mykolas-perevicius commented Jun 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant