Skip to content

Add weight-only quantization MoE example#4507

Open
mykolas-perevicius wants to merge 2 commits into
pytorch:mainfrom
mykolas-perevicius:fix-issue-729-moe-quant-example
Open

Add weight-only quantization MoE example#4507
mykolas-perevicius wants to merge 2 commits into
pytorch:mainfrom
mykolas-perevicius:fix-issue-729-moe-quant-example

Conversation

@mykolas-perevicius

Copy link
Copy Markdown

Why this PR

Issue #729 notes that quantize_() should already support Mixture-of-Experts (MoE) models, but there is no example or tutorial demonstrating weight-only quantization on an MoE — users have no reference for the workflow even though the API reportedly covers it. The existing examples/quantize_llama_4.py quantizes routed experts to float8 w8a8 dynamic (activation + weight), not weight-only, and torchao/prototype/moe_training/ is MoE training, a different feature. So the weight-only-MoE showcase gap is real.

I verified the maintainer's claim before writing anything: applying Int8WeightOnlyConfig to an MoE block's expert linears works unmodified — expert weights become Int8Tensor, the model shrinks ~3.9x, and outputs stay within ~45 dB SQNR of fp32. The contribution is therefore purely additive: a runnable example plus a unit test, with no core changes.

What this PR does

Adds examples/quantize_moe.py: it builds a small token-choice top-2 MoE block (a softmax router plus nn.Linear experts) and quantizes only the expert weights via quantize_(model, Int8WeightOnlyConfig(), filter_fn=is_expert_linear). The router is deliberately left in high precision — quantizing it would change token-to-expert routing decisions, not just numerics.

The script is self-verifying: it prints before/after weight types and serialized sizes, runs a forward pass, reports SQNR vs the fp32 baseline, and asserts (experts quantized, router not, ≥1.5x smaller, SQNR > 25 dB), exiting non-zero on any failure.

  • --dtype int8|int4 and --device flags; int4 is gated behind a hardware/dependency warning.
  • The docstring points users with real fused-3D-expert checkpoints (e.g. meta-llama/Llama-4-Scout-17B-16E-Instruct) at the FqnToConfig + PerRow(1) pattern, mirroring examples/quantize_llama_4.py.

It also adds a unit test — TestQuantFlow.test_int8_weight_only_moe_experts_only in test/quantization/test_quant_api.py (with a module-level ToyMoEModel modeled on the file's ToyLinearModel) — since CI lints examples but does not execute them, so without a test MoE support stays demonstrated-but-unguarded.

Relevant issues

Closes #729

Test plan / evidence

CI lints examples/ but does not execute them (only tutorials/ run via run_tutorials.yml), so the example is self-asserting and the unit test provides the regression guard.

CPU (macOS, torch 2.12): example runs twice identically — experts → Int8Tensor, router float32, serialized 8.40 → 2.15 MB (3.90x), SQNR 45.1 dB, all asserts pass. pytest test/quantization/test_quant_api.py -k moe → 1 passed.

CUDA (RTX 3090, SM 8.6, torch 2.12.1+cu130):

$ python examples/quantize_moe.py --device cuda
baseline | expert weight: Parameter (torch.float32) | serialized: 8.40 MB
int8 wo  | expert weight: Int8Tensor | router: Parameter (torch.float32) | serialized: 2.15 MB (3.90x smaller) | SQNR vs float32: 45.1 dB
quantization of the MoE model succeeded

$ pytest test/quantization/test_quant_api.py -k moe -q
1 passed

$ pytest test/quantization/test_quant_api.py
18 passed, 28 skipped, 0 failed

The 28 skips are all hardware-gated (Need SM 8.9+ / Checkpoints are produced in SM90+ — float8 paths on this SM 8.6 card) or pre-existing unconditional skips; none are introduced by this change.

Acceptance criteria

  • Tests added for new behavior (test_int8_weight_only_moe_experts_only + ToyMoEModel)
  • All tests passing (CPU and CUDA; pre-existing skips/failures unrelated, verified via git stash)
  • Follows project style guide (ruff check + ruff format clean, ruff 0.11.6; BSD-3 header via scripts/check_copyright_header.py)
  • No breaking changes (purely additive — new example + new test, no core changes)
  • Documentation updated where applicable (example docstring with run instructions, hardware note, and fused-3D-expert pointer)

Open questions for maintainers

  1. Example location — top-level examples/quantize_moe.py (matching examples/quantize_llama_4.py, added in add an example for quantizing LLaMa 4 Scout #3408) vs examples/inference/ (hinted by the literalinclude in quant_api.py)?
  2. Keep or drop the unit test? add an example for quantizing LLaMa 4 Scout #3408 was single-file. Since CI never runs examples, I added the test so MoE support is regression-checked — happy to drop it if you prefer the single-file shape.
  3. examples/README.md entry — should I add one? add an example for quantizing LLaMa 4 Scout #3408 didn't.
  4. Supported public path for int4? The int4 default path requires mslk >= 1.0.0, but the mslk package on public PyPI is a 0.0.0 placeholder and torchao/utils.py:1226 gates real availability on is_fbcode(). On a fresh RTX 3090 with the public wheel, --dtype int4 raises ImportError: Requires mslk >= 1.0.0 (same on CPU and CUDA). What's the supported public way to exercise int4 weight-only? (The example gates int4 behind a flag + warning, so the int8 showcase is unaffected.)

Note

Minor and separate from this PR: torchao/quantization/quant_api.py:1499 has an invalid escape '\.' in a docstring that triggers SyntaxWarning: invalid escape sequence on import under Python ≥ 3.12. Happy to send a one-line r""" micro-fix as its own PR.

Adds examples/quantize_moe.py: a self-contained, self-verifying script
that applies int8 (CPU) or int4 (CUDA + mslk) weight-only quantization
to the experts of a small token-choice top-2 MoE block via quantize_(),
keeping the router in high precision. Prints weight types, serialized
size reduction (~3.9x for int8), and SQNR vs the float32 baseline, and
points users with fused-3D-expert checkpoints at the FqnToConfig +
PerRow(1) pattern from quantize_llama_4.py.

Addresses pytorch#729
Adds ToyMoEModel and a CPU test that quantizes only the expert linears
via quantize_(filter_fn=...), asserting expert weights become Int8Tensor,
the router weight stays unquantized, and SQNR vs float32 stays above
25 dB.

Addresses pytorch#729
@pytorch-bot

pytorch-bot Bot commented Jun 18, 2026

Copy link
Copy Markdown

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/4507

Note: Links to docs will display an error until the docs builds have been completed.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 18, 2026
@mykolas-perevicius

Copy link
Copy Markdown
Author

Hi @jcaip @msaroufim — this is my first contribution to torchao, addressing #729.

It adds a runnable weight-only quantization example for MoE models (examples/quantize_moe.py) plus a unit test. It quantizes only the expert linears via quantize_(..., filter_fn=...) — the router is kept in high precision so routing decisions are unchanged — and the script self-verifies the size reduction (~3.9x int8) and SQNR. Validated end-to-end on CPU (macOS) and CUDA (RTX 3090).

I left a few open questions in the description that I'd value your steer on — especially the preferred example location (examples/ vs examples/inference/) and whether you'd like the unit test kept (happy to drop it for a single-file PR). Would appreciate a review when you have time. Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

MoE example

1 participant