Skip to content

Add FP8 support for the ONNX backend#4072

Open
andrey-churkin wants to merge 5 commits into
openvinotoolkit:developfrom
andrey-churkin:ac/fp8_onnx
Open

Add FP8 support for the ONNX backend#4072
andrey-churkin wants to merge 5 commits into
openvinotoolkit:developfrom
andrey-churkin:ac/fp8_onnx

Conversation

@andrey-churkin

@andrey-churkin andrey-churkin commented May 15, 2026

Copy link
Copy Markdown
Contributor

Changes

  • Add support for nncf.CompressWeightsMode.FP8_E4M3 mode in the nncf.compress_weights() method for the ONNX backend.
  • Add support for quantization using nncf.QuantizationMode.FP8_E4M3 and nncf.QuantizationMode.FP8_E5M2 modes in the nncf.quantize() method for the ONNX backend.

Reason for changes

Add support for FP8 quantization and weight compression in the ONNX backend.

Related tickets

Tests

TBD

Weight compression - success

@andrey-churkin andrey-churkin requested a review from a team as a code owner May 15, 2026 08:24
@github-actions github-actions Bot added the NNCF ONNX Pull requests that updates NNCF ONNX label May 15, 2026

@daniil-lyakhov daniil-lyakhov left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No major comments, please add some tests

Comment thread src/nncf/quantization/algorithms/min_max/onnx_backend.py
Comment on lines +367 to +371
if weight_dtype == onnx.TensorProto.FLOAT8E4M3FN:
np_dtype = helper.tensor_dtype_to_np_dtype(weight_dtype)
vals = onnx.numpy_helper.saturate_cast(np.asarray(quantized_weights), np_dtype).flatten()
else:
vals = quantized_weights

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two similar code blocks, maybe worth a private method?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've rewritten it slightly. Given that it's only two lines, I don't think introducing a separate method provides much value.

@daniil-lyakhov daniil-lyakhov left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

NNCF ONNX Pull requests that updates NNCF ONNX

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants