Skip to content

[python] Support BlobView fields#7837

Draft
leaves12138 wants to merge 3 commits into
apache:masterfrom
leaves12138:codex/python-blob-view
Draft

[python] Support BlobView fields#7837
leaves12138 wants to merge 3 commits into
apache:masterfrom
leaves12138:codex/python-blob-view

Conversation

@leaves12138
Copy link
Copy Markdown
Contributor

Summary

Codex

This PR was prepared and submitted by Codex.

Validation

  • PYTHONPATH=paimon-python pytest -q paimon-python/pypaimon/tests/blob_test.py::BlobTest::test_blob_view_struct_roundtrip
  • PYTHONPATH=paimon-python pytest -q paimon-python/pypaimon/tests/blob_table_test.py::DataBlobWriterTest::test_blob_view_fields_resolve_upstream_blob paimon-python/pypaimon/tests/blob_table_test.py::DataBlobWriterTest::test_blob_view_fields_rejects_non_view_input paimon-python/pypaimon/tests/blob_table_test.py::DataBlobWriterTest::test_blob_inline_fields_reject_overlap_and_unknown_fields paimon-python/pypaimon/tests/blob_table_test.py::DataBlobWriterTest::test_blob_descriptor_fields_mixed_mode paimon-python/pypaimon/tests/blob_table_test.py::DataBlobWriterTest::test_to_arrow_batch_reader paimon-python/pypaimon/tests/blob_table_test.py::DataBlobWriterTest::test_blob_descriptor_fields_rejects_non_descriptor_input
  • PYTHONPATH=paimon-python python3 -m flake8 --config=paimon-python/dev/cfg.ini paimon-python/pypaimon/table/row/blob.py paimon-python/pypaimon/common/options/core_options.py paimon-python/pypaimon/schema/schema.py paimon-python/pypaimon/write/writer/data_blob_writer.py paimon-python/pypaimon/read/reader/data_file_batch_reader.py paimon-python/pypaimon/read/reader/blob_descriptor_convert_reader.py paimon-python/pypaimon/utils/blob_view_lookup.py paimon-python/pypaimon/tests/blob_test.py paimon-python/pypaimon/tests/blob_table_test.py
  • python3 -m compileall -q paimon-python/pypaimon/table/row/blob.py paimon-python/pypaimon/common/options/core_options.py paimon-python/pypaimon/schema/schema.py paimon-python/pypaimon/write/writer/data_blob_writer.py paimon-python/pypaimon/read/reader/data_file_batch_reader.py paimon-python/pypaimon/read/reader/blob_descriptor_convert_reader.py paimon-python/pypaimon/utils/blob_view_lookup.py
  • git diff --check

@leaves12138 leaves12138 marked this pull request as ready for review May 13, 2026 06:38
@leaves12138
Copy link
Copy Markdown
Contributor Author

Thanks for the PR. The overall direction looks right to me: the Python BlobViewStruct wire format matches the Java side, and the write/read path for inline blob-view-field plus blob-as-descriptor=true is mostly aligned with the BlobView design.

I found one blocker around the data-evolution read path though.

DataFileBatchReader already resolves inline blob-view-field values to the actual blob payload when blob-as-descriptor=false. In the data-evolution path, DataEvolutionSplitRead then wraps the merged reader with BlobDescriptorConvertReader, and that reader scans the already-resolved bytes again with BlobViewStruct.is_blob_view_struct() / deserialize(). This means a perfectly valid blob payload whose first bytes happen to match the BlobView magic header can be parsed as a BlobViewStruct and fail, or even be incorrectly resolved as another view.

The problematic chain is:

  • DataFileBatchReader._convert_inline_blob_columns resolves view fields before returning file batches.
  • DataEvolutionSplitRead.create_reader wraps the result with BlobDescriptorConvertReader whenever blob descriptor/view fields exist.
  • BlobDescriptorConvertReader._convert_batch attempts to interpret the final bytes as BlobViewStruct again.

I reproduced this locally by writing a source blob payload starting with the BlobView version + magic bytes and then reading it through a downstream blob-view-field; the default read failed with ValueError: Invalid BlobViewStruct data: too short even though the source payload is just normal blob data.

I think we should ensure inline blob values are converted exactly once. One possible fix is to keep conversion in DataFileBatchReader and not wrap data-evolution readers with BlobDescriptorConvertReader for fields that have already been converted; alternatively, disable the file-reader-side conversion in the data-evolution path and do conversion only after merge/filter.

One more compatibility point: Java validation requires fields listed in blob-descriptor-field / blob-view-field to also be listed in blob-field. The Python schema validation currently only checks that those names are BLOB fields and do not overlap. If this relaxation is intentional because Python treats all large_binary columns as BLOBs, please document it; otherwise it would be better to align the option validation with Java to avoid creating schemas that Java/Flink would reject.

Local validation I ran:

  • python3.8 -m pytest -q paimon-python/pypaimon/tests/blob_table_test.py paimon-python/pypaimon/tests/blob_test.py -> 87 passed, 1 skipped
  • targeted BlobView tests -> passed
  • flake8 on changed Python files -> passed
  • compileall on changed Python files -> passed
  • git diff --check -> passed

@leaves12138 leaves12138 marked this pull request as draft May 14, 2026 05:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant