Commit 1619421
authored
Added support for MoE for vllm >= 0.14.0rc1 (#1162)
### What does this PR do?
Type of change: Bug fix
`_QuantFusedMoEBase.forward()` previously replaced
`vllm_fused_moe_package.invoke_fused_moe_kernel`, which was replaced
starting in vLLM v0.14.0rc1,
There are two paths for FusedMoE forward:
```
Path 1 (Modular — standard CUDA path):
FusedMoE.forward()
→ self.runner.forward()
→ TritonExperts.apply()
→ invoke_fused_moe_triton_kernel() ← called twice (w1, w2)
Path 2 (legacy):
inplace_fused_experts / outplace_fused_experts
→ fused_experts_impl()
→ dispatch_fused_moe_kernel()
→ invoke_fused_moe_triton_kernel()
or invoke_fused_moe_wna16_triton_kernel()
or invoke_fused_moe_wna16_cuda_kernel()
```
This caused an `AttributeError` / assertion failure for any MoE model
quantized with vLLM ≥ v0.14.0rc1.
The fix refactors the kernel-patching logic into a `_patch_moe_kernel()`
context manager that probes for both attribute names (the two names are
mutually exclusive across vLLM versions — confirmed by inspecting every
release from v0.10.0 to v0.19.1).
### Usage
NA
### Testing
```
docker run --gpus all -it --shm-size=160GB --network host --rm -v <modelopt path>:/home/modelopt \
vllm/vllm-openai:v0.15.0 bash -c "cd /home/modelopt && pip install . && pip install datasets && \
QUANT_CFG=NVFP4_DEFAULT_CFG python3 /home/modelopt/examples/vllm_serve/vllm_serve_fakequant.py \
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 -tp 1 --served-model-name NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 \
--host 0.0.0.0 --port 8001 --trust-remote-code --disable-custom-all-reduce \
--gpu-memory-utilization 0.8"
```
### Before your PR is "*Ready for review*"
Make sure you read and follow [Contributor
guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md)
and your commits are signed (`git commit -s -S`).
Make sure you read and follow the [Security Best
Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors)
(e.g. avoiding hardcoded `trust_remote_code=True`, `torch.load(...,
weights_only=False)`, `pickle`, etc.).
- Is this change backward compatible?: ✅
- If you copied code from any other sources or added a new PIP
dependency, did you follow guidance in `CONTRIBUTING.md`: N/A
- Did you write any new necessary tests?: N/A
- Did you update
[Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?:
N/A
### Additional Information
<!-- E.g. related issue. -->
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
* **Refactor**
* Ensures quantized expert weights are correctly used by the fused-MoE
execution path so inference uses the intended quantized tensors.
* Replaces fragile manual swapping of the runtime kernel with a safer,
context-managed swap that reliably caches and restores the original.
* Adds runtime detection and selection among available fused-MoE kernel
entrypoints to support multiple variants.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
---------
Signed-off-by: Kinjal Patel <kinjalpravin@nvidia.com>1 parent 3131195 commit 1619421
4 files changed
Lines changed: 100 additions & 28 deletions
File tree
- examples/vllm_serve
- modelopt/torch/quantization/plugins
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
4 | 4 | | |
5 | 5 | | |
6 | 6 | | |
7 | | - | |
| 7 | + | |
8 | 8 | | |
9 | 9 | | |
10 | 10 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
134 | 134 | | |
135 | 135 | | |
136 | 136 | | |
137 | | - | |
| 137 | + | |
138 | 138 | | |
139 | 139 | | |
140 | 140 | | |
141 | 141 | | |
142 | 142 | | |
143 | 143 | | |
144 | 144 | | |
145 | | - | |
| 145 | + | |
| 146 | + | |
| 147 | + | |
| 148 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
62 | 62 | | |
63 | 63 | | |
64 | 64 | | |
65 | | - | |
66 | 65 | | |
67 | 66 | | |
68 | 67 | | |
69 | | - | |
70 | 68 | | |
71 | 69 | | |
72 | | - | |
| 70 | + | |
73 | 71 | | |
74 | 72 | | |
75 | 73 | | |
| |||
82 | 80 | | |
83 | 81 | | |
84 | 82 | | |
85 | | - | |
| 83 | + | |
| 84 | + | |
| 85 | + | |
| 86 | + | |
| 87 | + | |
| 88 | + | |
| 89 | + | |
| 90 | + | |
| 91 | + | |
| 92 | + | |
| 93 | + | |
86 | 94 | | |
87 | 95 | | |
88 | 96 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
15 | 15 | | |
16 | 16 | | |
17 | 17 | | |
| 18 | + | |
18 | 19 | | |
| 20 | + | |
19 | 21 | | |
| 22 | + | |
20 | 23 | | |
21 | 24 | | |
22 | 25 | | |
| |||
85 | 88 | | |
86 | 89 | | |
87 | 90 | | |
| 91 | + | |
| 92 | + | |
| 93 | + | |
| 94 | + | |
| 95 | + | |
| 96 | + | |
| 97 | + | |
| 98 | + | |
| 99 | + | |
| 100 | + | |
| 101 | + | |
| 102 | + | |
| 103 | + | |
| 104 | + | |
| 105 | + | |
88 | 106 | | |
89 | 107 | | |
90 | 108 | | |
| |||
340 | 358 | | |
341 | 359 | | |
342 | 360 | | |
| 361 | + | |
| 362 | + | |
| 363 | + | |
| 364 | + | |
| 365 | + | |
| 366 | + | |
| 367 | + | |
| 368 | + | |
| 369 | + | |
| 370 | + | |
| 371 | + | |
| 372 | + | |
| 373 | + | |
| 374 | + | |
| 375 | + | |
| 376 | + | |
| 377 | + | |
| 378 | + | |
| 379 | + | |
| 380 | + | |
| 381 | + | |
343 | 382 | | |
344 | 383 | | |
345 | 384 | | |
346 | 385 | | |
347 | 386 | | |
348 | | - | |
349 | | - | |
350 | | - | |
351 | | - | |
352 | | - | |
| 387 | + | |
| 388 | + | |
| 389 | + | |
| 390 | + | |
| 391 | + | |
| 392 | + | |
| 393 | + | |
| 394 | + | |
| 395 | + | |
| 396 | + | |
| 397 | + | |
| 398 | + | |
353 | 399 | | |
354 | | - | |
| 400 | + | |
355 | 401 | | |
356 | 402 | | |
357 | 403 | | |
358 | 404 | | |
359 | | - | |
360 | | - | |
361 | | - | |
362 | | - | |
363 | | - | |
| 405 | + | |
| 406 | + | |
| 407 | + | |
| 408 | + | |
| 409 | + | |
| 410 | + | |
| 411 | + | |
| 412 | + | |
| 413 | + | |
| 414 | + | |
| 415 | + | |
| 416 | + | |
364 | 417 | | |
365 | | - | |
| 418 | + | |
366 | 419 | | |
367 | 420 | | |
368 | 421 | | |
| |||
372 | 425 | | |
373 | 426 | | |
374 | 427 | | |
375 | | - | |
| 428 | + | |
| 429 | + | |
| 430 | + | |
376 | 431 | | |
377 | 432 | | |
378 | 433 | | |
379 | 434 | | |
380 | 435 | | |
381 | 436 | | |
| 437 | + | |
382 | 438 | | |
383 | | - | |
384 | | - | |
385 | | - | |
386 | | - | |
| 439 | + | |
| 440 | + | |
| 441 | + | |
| 442 | + | |
| 443 | + | |
| 444 | + | |
| 445 | + | |
| 446 | + | |
| 447 | + | |
387 | 448 | | |
388 | 449 | | |
389 | 450 | | |
390 | | - | |
391 | | - | |
392 | | - | |
| 451 | + | |
| 452 | + | |
393 | 453 | | |
394 | 454 | | |
395 | 455 | | |
| |||
409 | 469 | | |
410 | 470 | | |
411 | 471 | | |
412 | | - | |
| 472 | + | |
| 473 | + | |
413 | 474 | | |
414 | 475 | | |
415 | 476 | | |
| |||
0 commit comments