Commit 7233616
authored
Added support for KV cache quantization for vllm fakequant (#686)
## What does this PR do?
**Type of change:** New feature
**Overview:**
- Added support to quantize KV cache in vLLM fakequant by adding
quantization support for
[Attention](https://github.com/vllm-project/vllm/blob/v0.12.0/vllm/attention/layer.py#L161)
- Modified initialization of parallel state to incorporate vLLM parallel
state groups for correct quantization parameter syncing
## Usage
Please refer to
[Readme](https://github.com/NVIDIA/Model-Optimizer/tree/kinjal/vllm_att_quant/examples/vllm_serve#calibrate-and-serve-fake-quant-model-in-vllm)
```
KV_QUANT_CFG=NVFP4_KV_CFG QUANT_CFG=NVFP4_DEFAULT_CFG python vllm_serve_fakequant.py meta-llama/Llama-3.2-1B-Instruct --served-model-name meta-llama/Llama-3.2-1B-Instruct --host 0.0.0.0 --port 8001 --trust-remote-code
```
## Testing
Locally tested KV Cache quantization
```
model.layers.0.self_attn.qkv_proj.input_quantizer TensorQuantizer((2, 1) bit fake block_sizes={-1: 16, 'type': 'dynamic', 'scale_bits': (4, 3)}, amax=5.0312 calibrator=MaxCalibratorquant)
model.layers.0.self_attn.qkv_proj.weight_quantizer TensorQuantizer((2, 1) bit fake block_sizes={-1: 16, 'type': 'dynamic', 'scale_bits': (4, 3)}, amax=0.6758 calibrator=MaxCalibratorquant)
model.layers.0.self_attn.qkv_proj.output_quantizer TensorQuantizer(disabled)
model.layers.0.self_attn.o_proj.input_quantizer TensorQuantizer((2, 1) bit fake block_sizes={-1: 16, 'type': 'dynamic', 'scale_bits':(4,3)}, amax=1.3438 calibrator=MaxCalibrator quant)
model.layers.0.self_attn.o_proj.weight_quantizer TensorQuantizer((2, 1) bit fake block_sizes={-1: 16, 'type': 'dynamic','scale_bits': (4, 3)}, amax=0.3145 calibrator=MaxCalibratorquant)
model.layers.0.self_attn.o_proj.output_quantizer TensorQuantizer(disabled)
model.layers.0.self_attn.attn.q_bmm_quantizer TensorQuantizer(disabled)
model.layers.0.self_attn.attn.k_bmm_quantizer TensorQuantizer((2, 1) bit fake block_sizes={-1: 16, 'type': 'dynamic', 'scale_bits': (4, 3)}, amax=13.8125 calibrator=MaxCalibrator quant)
model.layers.0.self_attn.attn.v_bmm_quantizer TensorQuantizer((2, 1) bit fake block_sizes={-1: 16, 'type': 'dynamic','scale_bits': (4, 3)}, amax=1.3438 calibrator=MaxCalibratorquant)
model.layers.0.mlp.gate_up_proj.input_quantizer TensorQuantizer((2, 1) bit fake block_sizes={-1: 16, 'type': 'dynamic', 'scale_bits': (4, 3)}, amax=3.2812 calibrator=MaxCalibratorquant)
model.layers.0.mlp.gate_up_proj.weight_quantizer TensorQuantizer((2, 1) bit fake block_sizes={-1: 16, 'type': 'dynamic', 'scale_bits': (4, 3)}, amax=0.5938 calibrator=MaxCalibratorquant)
model.layers.0.mlp.gate_up_proj.output_quantizer TensorQuantizer(disabled)
model.layers.0.mlp.down_proj.input_quantizer TensorQuantizer((2, 1) bit fake block_sizes={-1: 16, 'type': 'dynamic', 'scale_bits': (4, 3)}, amax=33.7500 calibrator=MaxCalibrator quant)
model.layers.0.mlp.down_proj.weight_quantizer TensorQuantizer((2, 1) bit fake block_sizes={-1: 16, 'type': 'dynamic', 'scale_bits': (4, 3)}, amax=0.6211 calibrator=MaxCalibratorquant)
model.layers.0.mlp.down_proj.output_quantizer TensorQuantizer(disabled)
```
## Before your PR is "*Ready for review*"
<!-- If you haven't finished some of the above items you can still open
`Draft` PR. -->
- **Make sure you read and follow [Contributor
guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md)**
and your commits are signed.
- **Is this change backward compatible?**: Yes
- **Did you write any new necessary tests?**: NA
- **Did you add or update any necessary documentation?**: Yes
- **Did you update
[Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?**:
Yes
## Additional Information
<!-- E.g. related issue. -->
---------
Signed-off-by: Kinjal Patel <kinjalpravin@nvidia.com>1 parent d8d5a29 commit 7233616
4 files changed
Lines changed: 48 additions & 31 deletions
File tree
- examples/vllm_serve
- modelopt/torch/quantization/plugins
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
9 | 9 | | |
10 | 10 | | |
11 | 11 | | |
| 12 | + | |
12 | 13 | | |
13 | 14 | | |
14 | 15 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
24 | 24 | | |
25 | 25 | | |
26 | 26 | | |
| 27 | + | |
27 | 28 | | |
28 | 29 | | |
29 | 30 | | |
| |||
68 | 69 | | |
69 | 70 | | |
70 | 71 | | |
71 | | - | |
72 | | - | |
73 | | - | |
74 | | - | |
75 | | - | |
76 | | - | |
77 | | - | |
78 | | - | |
79 | | - | |
80 | | - | |
81 | | - | |
82 | | - | |
83 | | - | |
84 | | - | |
85 | | - | |
86 | | - | |
87 | | - | |
88 | 72 | | |
89 | 73 | | |
90 | 74 | | |
91 | | - | |
| 75 | + | |
92 | 76 | | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
150 | 150 | | |
151 | 151 | | |
152 | 152 | | |
| 153 | + | |
153 | 154 | | |
154 | 155 | | |
155 | 156 | | |
| |||
236 | 237 | | |
237 | 238 | | |
238 | 239 | | |
| 240 | + | |
| 241 | + | |
| 242 | + | |
| 243 | + | |
239 | 244 | | |
240 | 245 | | |
241 | 246 | | |
| |||
290 | 295 | | |
291 | 296 | | |
292 | 297 | | |
293 | | - | |
294 | | - | |
295 | | - | |
296 | | - | |
297 | | - | |
298 | | - | |
299 | | - | |
300 | | - | |
301 | | - | |
302 | | - | |
303 | | - | |
304 | 298 | | |
305 | 299 | | |
306 | 300 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
18 | 18 | | |
19 | 19 | | |
20 | 20 | | |
| 21 | + | |
21 | 22 | | |
22 | 23 | | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
23 | 27 | | |
24 | 28 | | |
25 | 29 | | |
| |||
90 | 94 | | |
91 | 95 | | |
92 | 96 | | |
| 97 | + | |
| 98 | + | |
| 99 | + | |
| 100 | + | |
| 101 | + | |
| 102 | + | |
| 103 | + | |
| 104 | + | |
93 | 105 | | |
94 | 106 | | |
95 | 107 | | |
| |||
100 | 112 | | |
101 | 113 | | |
102 | 114 | | |
103 | | - | |
| 115 | + | |
104 | 116 | | |
105 | 117 | | |
106 | 118 | | |
| |||
151 | 163 | | |
152 | 164 | | |
153 | 165 | | |
154 | | - | |
| 166 | + | |
155 | 167 | | |
156 | 168 | | |
157 | 169 | | |
| |||
243 | 255 | | |
244 | 256 | | |
245 | 257 | | |
| 258 | + | |
| 259 | + | |
| 260 | + | |
| 261 | + | |
| 262 | + | |
| 263 | + | |
| 264 | + | |
| 265 | + | |
| 266 | + | |
| 267 | + | |
| 268 | + | |
| 269 | + | |
| 270 | + | |
| 271 | + | |
| 272 | + | |
| 273 | + | |
| 274 | + | |
| 275 | + | |
| 276 | + | |
| 277 | + | |
| 278 | + | |
| 279 | + | |
| 280 | + | |
| 281 | + | |
| 282 | + | |
| 283 | + | |
0 commit comments