Skip to content

Refactor Arrow extraction to follow the RecordExtractor contract#18434

Merged
Jackie-Jiang merged 2 commits intoapache:masterfrom
Jackie-Jiang:arrow_record_extractor
May 7, 2026
Merged

Refactor Arrow extraction to follow the RecordExtractor contract#18434
Jackie-Jiang merged 2 commits intoapache:masterfrom
Jackie-Jiang:arrow_record_extractor

Conversation

@Jackie-Jiang
Copy link
Copy Markdown
Contributor

@Jackie-Jiang Jackie-Jiang commented May 7, 2026

Summary

Introduce ArrowRecordExtractor (extends BaseRecordExtractor) with schema-driven dispatch by ArrowTypeID; drop the bespoke ArrowToGenericRowConverter. The reader and decoder bind reader-scoped state via setReader(ArrowReader) (caches the dictionary map and pre-resolves the include list) and per-batch state via prepareBatch(Record) (decodes each dictionary-encoded column once into the Record for reuse across the batch's rows).

Add ArrowRecordExtractorConfig with extractRawTimeValues — matches the Avro / Parquet flag; Date / Time / Timestamp surface as raw int / long in the schema's unit instead of the contract Java type.

ArrowMessageDecoder.decode now branches on row count:

  • 0null
  • 1 → fields populated directly into the destination
  • >1 → wrapped under GenericRow.MULTIPLE_RECORDS_KEY

The decoder also validates that a plugin-supplied extractor class extends ArrowRecordExtractor (so the per-batch setReader / prepareBatch hooks are honored) and fails with a clear error if not.

Bug fixes vs the prior converter

  • DateDayVector returns Integer (not LocalDateTime); the old code cast unconditionally to LocalDateTime and would throw at runtime for DateDay columns.
  • UInt2Vector returns Character (not a Number); the old code passed it through unchanged, violating the Int(16) → Integer contract.
  • UInt1Vector was sign-extended (200 → -56) instead of zero-extended.
  • All three are now schema-aware (dispatch on ArrowType.Int.getIsSigned() / ArrowType.Date.getUnit()).
  • Dictionary-encoded columns are now decoded once per batch (prior behavior decoded the entire vector per row inside extractValue, O(N²) in batch size).

Tests

  • New ArrowRecordExtractorTest covering every Arrow vector type, raw and contract modes, complex types (List, Struct, Map), dictionary encoding, and include-list filtering. Each test runs through a real ArrowStreamWriterArrowStreamReader IPC roundtrip so setReader / prepareBatch are exercised against an actual ArrowReader (no mocks).
  • ArrowMessageDecoderTest slimmed to decoder-specific concerns (lifecycle, error handling, empty / single / multi-row batch shapes).
  • ArrowRecordReaderTest keeps the inherited AbstractRecordReaderTest round-trip; redundant filter test removed.

@Jackie-Jiang Jackie-Jiang added enhancement Improvement to existing functionality ingestion Related to data ingestion pipeline refactor Code restructuring without changing behavior labels May 7, 2026
@Jackie-Jiang Jackie-Jiang force-pushed the arrow_record_extractor branch 2 times, most recently from 2c0ea0f to 796e0ed Compare May 7, 2026 01:01
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 7, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 63.64%. Comparing base (5ccd383) to head (6e34047).

Additional details and impacted files
@@             Coverage Diff              @@
##             master   #18434      +/-   ##
============================================
+ Coverage     63.57%   63.64%   +0.06%     
  Complexity     1717     1717              
============================================
  Files          3252     3252              
  Lines        199132   199132              
  Branches      30875    30875              
============================================
+ Hits         126596   126729     +133     
+ Misses        62454    62312     -142     
- Partials      10082    10091       +9     
Flag Coverage Δ
custom-integration1 100.00% <ø> (ø)
integration 100.00% <ø> (ø)
integration1 100.00% <ø> (ø)
integration2 0.00% <ø> (ø)
java-21 63.64% <ø> (+0.06%) ⬆️
temurin 63.64% <ø> (+0.06%) ⬆️
unittests 63.63% <ø> (+0.06%) ⬆️
unittests1 55.68% <ø> (+0.02%) ⬆️
unittests2 34.95% <ø> (+0.04%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors Arrow ingestion to align with Pinot’s RecordExtractor contract by introducing a schema-driven ArrowRecordExtractor (extending BaseRecordExtractor) and removing the bespoke ArrowToGenericRowConverter. It also updates Arrow decoding behavior to return null for empty batches, write directly into the destination for single-row batches, and wrap multi-row batches under GenericRow.MULTIPLE_RECORDS_KEY.

Changes:

  • Add ArrowRecordExtractor + ArrowRecordExtractorConfig (including extractRawTimeValues) and wire them into ArrowRecordReader and ArrowMessageDecoder.
  • Remove ArrowToGenericRowConverter and migrate tests to contract-based extraction coverage (ArrowRecordExtractorTest) while slimming decoder tests.
  • Update decoder output shape based on Arrow batch row count (0 → null, 1 → direct, >1 → MULTIPLE_RECORDS_KEY list).

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
pinot-plugins/pinot-input-format/pinot-arrow/src/main/java/org/apache/pinot/plugin/inputformat/arrow/ArrowMessageDecoder.java Switch decoder to use ArrowRecordExtractor; implement 0/1/multi-row output shapes; add extractor/config plugin wiring.
pinot-plugins/pinot-input-format/pinot-arrow/src/main/java/org/apache/pinot/plugin/inputformat/arrow/ArrowRecordExtractor.java New schema-driven extractor implementing Pinot’s extraction contract across Arrow scalar/temporal/complex types.
pinot-plugins/pinot-input-format/pinot-arrow/src/main/java/org/apache/pinot/plugin/inputformat/arrow/ArrowRecordExtractorConfig.java New extractor config supporting extractRawTimeValues.
pinot-plugins/pinot-input-format/pinot-arrow/src/main/java/org/apache/pinot/plugin/inputformat/arrow/ArrowRecordReader.java Replace per-row conversion with ArrowRecordExtractor and bind reader-scoped state via setReader.
pinot-plugins/pinot-input-format/pinot-arrow/src/main/java/org/apache/pinot/plugin/inputformat/arrow/ArrowRecordReaderConfig.java Extend reader config to carry extractRawTimeValues through to the extractor.
pinot-plugins/pinot-input-format/pinot-arrow/src/main/java/org/apache/pinot/plugin/inputformat/arrow/ArrowToGenericRowConverter.java Deleted in favor of ArrowRecordExtractor.
pinot-plugins/pinot-input-format/pinot-arrow/src/test/java/org/apache/pinot/plugin/inputformat/arrow/ArrowRecordExtractorTest.java New comprehensive per-type contract tests (including raw-time mode, complex types, dictionaries, include-list).
pinot-plugins/pinot-input-format/pinot-arrow/src/test/java/org/apache/pinot/plugin/inputformat/arrow/ArrowMessageDecoderTest.java Slim decoder tests to lifecycle/edge-cases + row-count branching behavior.
pinot-plugins/pinot-input-format/pinot-arrow/src/test/java/org/apache/pinot/plugin/inputformat/arrow/ArrowRecordReaderTest.java Remove converter-specific assertions and rely on contract-shaped outputs (MV as Object[]).
pinot-plugins/pinot-input-format/pinot-arrow/src/test/java/org/apache/pinot/plugin/inputformat/arrow/ArrowTestDataUtils.java New minimal test payload helper for decoder-focused tests.
pinot-plugins/pinot-input-format/pinot-arrow/src/test/java/org/apache/pinot/plugin/inputformat/arrow/util/ArrowTestDataUtil.java Deleted legacy test utility.

@Jackie-Jiang Jackie-Jiang added the bug Something is not working as expected label May 7, 2026
@Jackie-Jiang Jackie-Jiang force-pushed the arrow_record_extractor branch from 796e0ed to 7227fd7 Compare May 7, 2026 07:17
Introduce ArrowRecordExtractor (extends BaseRecordExtractor) with schema-driven
dispatch by ArrowTypeID; drop the bespoke ArrowToGenericRowConverter. The reader
and decoder bind reader-scoped state once via setReader(ArrowReader), which
caches the dictionary map and pre-resolves the include list against the
VectorSchemaRoot.

Add ArrowRecordExtractorConfig with extractRawTimeValues — matches the
Avro / Parquet flag; Date / Time / Timestamp surface as raw int / long in the
schema's unit instead of the contract Java type.

ArrowMessageDecoder.decode now branches on row count: 0 → null,
1 → fields populated directly into destination, >1 → wrapped under
GenericRow.MULTIPLE_RECORDS_KEY.

Bug fixes vs the prior converter:
- DateDayVector returns Integer (not LocalDateTime); old cast threw at runtime.
- UInt2Vector returns Character (not a Number); old code passed it through
  unchanged, violating the Int(16) -> Integer contract.
- UInt1Vector was sign-extended (200 -> -56) instead of zero-extended.
  All three are now schema-aware.

Add ArrowRecordExtractorTest covering every Arrow vector type, raw and contract
modes, complex types, dictionary encoding, and include-list filtering. Slim
ArrowMessageDecoderTest to decoder-specific concerns; drop the redundant
type-coverage tests.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 1 comment.

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
@Jackie-Jiang Jackie-Jiang merged commit a8ac165 into apache:master May 7, 2026
11 checks passed
@Jackie-Jiang Jackie-Jiang deleted the arrow_record_extractor branch May 7, 2026 09:34
@xiangfu0
Copy link
Copy Markdown
Contributor

xiangfu0 commented May 7, 2026

Opened a follow-up docs PR for this change: pinot-contrib/pinot-docs#802

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something is not working as expected enhancement Improvement to existing functionality ingestion Related to data ingestion pipeline refactor Code restructuring without changing behavior

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants