Added support for KV cache quantization for vllm fakequant (#686)

kinjalpatel27 · web-flow · commit 72336168255c · 2025-12-16T09:35:57.000-08:00
## What does this PR do? **Type of change:** New feature **Overview:** - Added support to quantize KV cache in vLLM fakequant by adding quantization support for [Attention](https://github.com/vllm-project/vllm/blob/v0.12.0/vllm/attention/layer.py#L161) - Modified initialization of parallel state to incorporate vLLM parallel state groups for correct quantization parameter syncing ## Usage Please refer to [Readme](https://github.com/NVIDIA/Model-Optimizer/tree/kinjal/vllm_att_quant/examples/vllm_serve#calibrate-and-serve-fake-quant-model-in-vllm) ``` KV_QUANT_CFG=NVFP4_KV_CFG QUANT_CFG=NVFP4_DEFAULT_CFG python vllm_serve_fakequant.py meta-llama/Llama-3.2-1B-Instruct --served-model-name meta-llama/Llama-3.2-1B-Instruct --host 0.0.0.0 --port 8001 --trust-remote-code ``` ## Testing Locally tested KV Cache quantization ``` model.layers.0.self_attn.qkv_proj.input_quantizer TensorQuantizer((2, 1) bit fake block_sizes={-1: 16, 'type': 'dynamic', 'scale_bits': (4, 3)}, amax=5.0312 calibrator=MaxCalibratorquant) model.layers.0.self_attn.qkv_proj.weight_quantizer TensorQuantizer((2, 1) bit fake block_sizes={-1: 16, 'type': 'dynamic', 'scale_bits': (4, 3)}, amax=0.6758 calibrator=MaxCalibratorquant) model.layers.0.self_attn.qkv_proj.output_quantizer TensorQuantizer(disabled) model.layers.0.self_attn.o_proj.input_quantizer TensorQuantizer((2, 1) bit fake block_sizes={-1: 16, 'type': 'dynamic', 'scale_bits':(4,3)}, amax=1.3438 calibrator=MaxCalibrator quant) model.layers.0.self_attn.o_proj.weight_quantizer TensorQuantizer((2, 1) bit fake block_sizes={-1: 16, 'type': 'dynamic','scale_bits': (4, 3)}, amax=0.3145 calibrator=MaxCalibratorquant) model.layers.0.self_attn.o_proj.output_quantizer TensorQuantizer(disabled) model.layers.0.self_attn.attn.q_bmm_quantizer TensorQuantizer(disabled) model.layers.0.self_attn.attn.k_bmm_quantizer TensorQuantizer((2, 1) bit fake block_sizes={-1: 16, 'type': 'dynamic', 'scale_bits': (4, 3)}, amax=13.8125 calibrator=MaxCalibrator quant) model.layers.0.self_attn.attn.v_bmm_quantizer TensorQuantizer((2, 1) bit fake block_sizes={-1: 16, 'type': 'dynamic','scale_bits': (4, 3)}, amax=1.3438 calibrator=MaxCalibratorquant) model.layers.0.mlp.gate_up_proj.input_quantizer TensorQuantizer((2, 1) bit fake block_sizes={-1: 16, 'type': 'dynamic', 'scale_bits': (4, 3)}, amax=3.2812 calibrator=MaxCalibratorquant) model.layers.0.mlp.gate_up_proj.weight_quantizer TensorQuantizer((2, 1) bit fake block_sizes={-1: 16, 'type': 'dynamic', 'scale_bits': (4, 3)}, amax=0.5938 calibrator=MaxCalibratorquant) model.layers.0.mlp.gate_up_proj.output_quantizer TensorQuantizer(disabled) model.layers.0.mlp.down_proj.input_quantizer TensorQuantizer((2, 1) bit fake block_sizes={-1: 16, 'type': 'dynamic', 'scale_bits': (4, 3)}, amax=33.7500 calibrator=MaxCalibrator quant) model.layers.0.mlp.down_proj.weight_quantizer TensorQuantizer((2, 1) bit fake block_sizes={-1: 16, 'type': 'dynamic', 'scale_bits': (4, 3)}, amax=0.6211 calibrator=MaxCalibratorquant) model.layers.0.mlp.down_proj.output_quantizer TensorQuantizer(disabled) ``` ## Before your PR is "*Ready for review*"  - **Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md)** and your commits are signed. - **Is this change backward compatible?**: Yes - **Did you write any new necessary tests?**: NA - **Did you add or update any necessary documentation?**: Yes - **Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?**: Yes ## Additional Information  --------- Signed-off-by: Kinjal Patel <kinjalpravin@nvidia.com>
diff --git a/CHANGELOG.rst b/CHANGELOG.rst
@@ -9,6 +9,7 @@ NVIDIA Model Optimizer Changelog (Linux)
 - Add support for Transformer Engine quantization for Megatron Core models.
 - Add support for Qwen3-Next model quantization.
 - Add support for dynamically linked TensorRT plugins in the ONNX quantization workflow.
+- Add support for KV Cache Quantization for vLLM FakeQuant PTQ script. See `examples/vllm_serve/README.md <https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/vllm_serve#Calibrate-and-serve-fake-quant-model-in-vLLM>`__ for more details.
 
 **Deprecations**
 
diff --git a/examples/vllm_serve/README.md b/examples/vllm_serve/README.md
@@ -24,6 +24,7 @@ You can either edit the `quant_config` dictionary in `vllm_serve_fakequant.py`,
 | QUANT_DATASET   | Dataset name for calibration                     | cnn_dailymail       |
 | QUANT_CALIB_SIZE| Number of samples used for calibration           | 512                 |
 | QUANT_CFG       | Quantization format                              | NVFP4_DEFAULT_CFG   |
+| KV_QUANT_CFG    | Quantization format for KV Cache                 | None                |
 | AMAX_FILE_PATH  | Optional path to amax file (for loading amax)    | None                |
 
 Set these variables in your shell or Docker environment as needed to customize calibration.
@@ -68,25 +69,8 @@ Step 2: configure <quant_amax.pth> from exported model using AMAX_FILE_PATH envi
 AMAX_FILE_PATH=<vllm_amax.pth> QUANT_CFG=<quant_config> python vllm_serve_fakequant.py <model_path> -tp 8 --host 0.0.0.0 --port 8000
 ```
 
-## Important Notes
-
-**Amax Synchronization across Tensor Parallel (TP):**
-
-- **For non-per-tensor quantization**: It is **recommended** to use an amax file (via `AMAX_FILE_PATH`) because amax synchronization across TP/EP is not automatically handled. Without an amax file, the amax values can be different across different TP ranks, leading to inconsistent results compared to real-quantization.
-
-- **For per-tensor quantization**: If you are not using an amax file, you need to enable amax synchronization across TP ranks. An example implementation is provided in `fakequant_worker.py` (lines 190-198):
-
-```python
-for name, buffer in model.named_buffers():
-    if name.endswith("_amax"):
-        torch.distributed.all_reduce(
-            buffer, op=torch.distributed.ReduceOp.MAX, group=get_tp_group().device_group
-        )
-torch.distributed.barrier()
-```
-
 ## Known Problems
 
 1. AWQ is not yet supported in vLLM.
-2. PTQ/QAT checkpoint doesn't work with KV Cache quantization enabled.
+2. QAT checkpoint export doesn't have KV Cache quantization enabled. KV Cache fake quantization works for PTQ.
 3. Mixed precision checkpoint doesn't work currently.
diff --git a/examples/vllm_serve/fakequant_worker.py b/examples/vllm_serve/fakequant_worker.py
@@ -150,6 +150,7 @@ def disable_compilation(model):
     "dataset": os.environ.get("QUANT_DATASET", "cnn_dailymail"),
     "calib_size": int(os.environ.get("QUANT_CALIB_SIZE", 512)),
     "quant_cfg": os.environ.get("QUANT_CFG", "NVFP4_DEFAULT_CFG"),
+    "kv_quant_cfg": os.environ.get("KV_QUANT_CFG", None),
     "amax_file_path": os.environ.get("AMAX_FILE_PATH", None),
 }
 
@@ -236,6 +237,10 @@ def calibrate_loop(model: Any = None) -> None:
                     self.sample_tokens(None)
 
     quant_cfg = getattr(mtq, quant_config["quant_cfg"])
+    if quant_config["kv_quant_cfg"] is not None:
+        quant_cfg = mtq.utils.update_quant_cfg_with_kv_cache_quant(
+            quant_cfg, getattr(mtq, quant_config["kv_quant_cfg"])["quant_cfg"]
+        )
 
     model = self.model_runner.model
     if hasattr(model, "unwrap"):
@@ -290,17 +295,6 @@ def calibrate_loop(model: Any = None) -> None:
         model.load_state_dict(current_state_dict)
         torch.distributed.barrier()
 
-    if amax_file_path is None:
-        # Sync amax across TP can be done here if needed
-        pass
-        # for name, buffer in model.named_buffers():
-        #     if name.endswith("_amax"):
-        #         print("syncing amax across TP for", name)
-        #         torch.distributed.all_reduce(
-        #             buffer, op=torch.distributed.ReduceOp.MAX, group=get_tp_group().device_group
-        #         )
-        # torch.distributed.barrier()
-
     if not torch.distributed.is_initialized() or torch.distributed.get_rank() == 0:
         mtq.print_quant_summary(model)
 
diff --git a/modelopt/torch/quantization/plugins/vllm.py b/modelopt/torch/quantization/plugins/vllm.py
@@ -18,8 +18,12 @@
 import importlib
 
 import torch
+import vllm.attention as vllm_attention
 import vllm.model_executor.layers.fused_moe.layer as vllm_fused_moe_layer
 import vllm.model_executor.layers.linear as vllm_linear
+from vllm.attention.layers.cross_attention import CrossAttention
+from vllm.attention.layers.encoder_only_attention import EncoderOnlyAttention
+from vllm.distributed.parallel_state import get_dp_group, get_ep_group, get_tp_group
 
 from ...utils.distributed import ParallelState
 from ..nn import QuantLinearConvBase, QuantModule, QuantModuleRegistry, TensorQuantizer
@@ -90,6 +94,14 @@ def apply(
         return output
 
 
+def create_parallel_state():
+    """Create a parallel state for vLLM."""
+    dp_group = get_dp_group().device_group
+    tp_group = get_tp_group().device_group
+    ep_group = get_ep_group().device_group
+    return ParallelState(dp_group, tp_group, ep_group)
+
+
 class _VLLMParallelLinear(QuantModule):
     def _setup(self):
         self.input_quantizer = TensorQuantizer(QuantLinearConvBase.default_quant_desc_input)
@@ -100,7 +112,7 @@ def _setup(self):
             f"quant_method is {type(self.quant_method)}"
         )
         self.fake_quant_method = FakeQuantMethod(self.quant_method)
-        self.parallel_state = ParallelState(-1, -1)
+        self.parallel_state = create_parallel_state()
 
     def forward(self, input_):
         # This context manager will conflict with torch.compile
@@ -151,7 +163,7 @@ def _setup(self):
         assert type(self.quant_method) is vllm_fused_moe_layer.UnquantizedFusedMoEMethod, (
             f"quant_method is {type(self.quant_method)}"
         )
-        self.parallel_state = ParallelState(-1, -1)
+        self.parallel_state = create_parallel_state()
 
     def invoke_fused_moe_quantized(
         self,
@@ -243,3 +255,29 @@ class _QuantVLLMFusedMoE(_QuantFusedMoEBase):
     )
     class _QuantVLLMSharedFusedMoE(_QuantFusedMoEBase):
         pass
+
+
+@QuantModuleRegistry.register({vllm_attention.Attention: "vllm_Attention"})
+class _QuantVLLMAttention(QuantModule):
+    def _setup(self):
+        self.q_bmm_quantizer = TensorQuantizer()
+        self.k_bmm_quantizer = TensorQuantizer()
+        self.v_bmm_quantizer = TensorQuantizer()
+        self.parallel_state = create_parallel_state()
+
+    def forward(self, query, key, value, *args, **kwargs):
+        query = self.q_bmm_quantizer(query)
+        key = self.k_bmm_quantizer(key)
+        value = self.v_bmm_quantizer(value)
+
+        return super().forward(query, key, value, *args, **kwargs)
+
+
+@QuantModuleRegistry.register({CrossAttention: "vllm_CrossAttention"})
+class _QuantVLLMCrossAttention(_QuantVLLMAttention):
+    pass
+
+
+@QuantModuleRegistry.register({EncoderOnlyAttention: "vllm_EncoderOnlyAttention"})
+class _QuantVLLMEncoderOnlyAttention(_QuantVLLMAttention):
+    pass