Refactor Arrow extraction to follow the RecordExtractor contract by Jackie-Jiang · Pull Request #18434 · apache/pinot

Jackie-Jiang · 2026-05-07T00:45:58Z

Summary

Introduce ArrowRecordExtractor (extends BaseRecordExtractor) with schema-driven dispatch by ArrowTypeID; drop the bespoke ArrowToGenericRowConverter. The reader and decoder bind reader-scoped state via setReader(ArrowReader) (caches the dictionary map and pre-resolves the include list) and per-batch state via prepareBatch(Record) (decodes each dictionary-encoded column once into the Record for reuse across the batch's rows).

Add ArrowRecordExtractorConfig with extractRawTimeValues — matches the Avro / Parquet flag; Date / Time / Timestamp surface as raw int / long in the schema's unit instead of the contract Java type.

ArrowMessageDecoder.decode now branches on row count:

0 → null
1 → fields populated directly into the destination
>1 → wrapped under GenericRow.MULTIPLE_RECORDS_KEY

The decoder also validates that a plugin-supplied extractor class extends ArrowRecordExtractor (so the per-batch setReader / prepareBatch hooks are honored) and fails with a clear error if not.

Bug fixes vs the prior converter

DateDayVector returns Integer (not LocalDateTime); the old code cast unconditionally to LocalDateTime and would throw at runtime for DateDay columns.
UInt2Vector returns Character (not a Number); the old code passed it through unchanged, violating the Int(16) → Integer contract.
UInt1Vector was sign-extended (200 → -56) instead of zero-extended.
All three are now schema-aware (dispatch on ArrowType.Int.getIsSigned() / ArrowType.Date.getUnit()).
Dictionary-encoded columns are now decoded once per batch (prior behavior decoded the entire vector per row inside extractValue, O(N²) in batch size).

Tests

New ArrowRecordExtractorTest covering every Arrow vector type, raw and contract modes, complex types (List, Struct, Map), dictionary encoding, and include-list filtering. Each test runs through a real ArrowStreamWriter → ArrowStreamReader IPC roundtrip so setReader / prepareBatch are exercised against an actual ArrowReader (no mocks).
ArrowMessageDecoderTest slimmed to decoder-specific concerns (lifecycle, error handling, empty / single / multi-row batch shapes).
ArrowRecordReaderTest keeps the inherited AbstractRecordReaderTest round-trip; redundant filter test removed.

codecov-commenter · 2026-05-07T01:50:55Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 63.64%. Comparing base (5ccd383) to head (6e34047).

Additional details and impacted files

@@             Coverage Diff              @@
##             master   #18434      +/-   ##
============================================
+ Coverage     63.57%   63.64%   +0.06%     
  Complexity     1717     1717              
============================================
  Files          3252     3252              
  Lines        199132   199132              
  Branches      30875    30875              
============================================
+ Hits         126596   126729     +133     
+ Misses        62454    62312     -142     
- Partials      10082    10091       +9

Flag	Coverage Δ
custom-integration1	`100.00% <ø> (ø)`
integration	`100.00% <ø> (ø)`
integration1	`100.00% <ø> (ø)`
integration2	`0.00% <ø> (ø)`
java-21	`63.64% <ø> (+0.06%)`	⬆️
temurin	`63.64% <ø> (+0.06%)`	⬆️
unittests	`63.63% <ø> (+0.06%)`	⬆️
unittests1	`55.68% <ø> (+0.02%)`	⬆️
unittests2	`34.95% <ø> (+0.04%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copilot

Pull request overview

This PR refactors Arrow ingestion to align with Pinot’s RecordExtractor contract by introducing a schema-driven ArrowRecordExtractor (extending BaseRecordExtractor) and removing the bespoke ArrowToGenericRowConverter. It also updates Arrow decoding behavior to return null for empty batches, write directly into the destination for single-row batches, and wrap multi-row batches under GenericRow.MULTIPLE_RECORDS_KEY.

Changes:

Add ArrowRecordExtractor + ArrowRecordExtractorConfig (including extractRawTimeValues) and wire them into ArrowRecordReader and ArrowMessageDecoder.
Remove ArrowToGenericRowConverter and migrate tests to contract-based extraction coverage (ArrowRecordExtractorTest) while slimming decoder tests.
Update decoder output shape based on Arrow batch row count (0 → null, 1 → direct, >1 → MULTIPLE_RECORDS_KEY list).

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
pinot-plugins/pinot-input-format/pinot-arrow/src/main/java/org/apache/pinot/plugin/inputformat/arrow/ArrowMessageDecoder.java	Switch decoder to use `ArrowRecordExtractor`; implement 0/1/multi-row output shapes; add extractor/config plugin wiring.
pinot-plugins/pinot-input-format/pinot-arrow/src/main/java/org/apache/pinot/plugin/inputformat/arrow/ArrowRecordExtractor.java	New schema-driven extractor implementing Pinot’s extraction contract across Arrow scalar/temporal/complex types.
pinot-plugins/pinot-input-format/pinot-arrow/src/main/java/org/apache/pinot/plugin/inputformat/arrow/ArrowRecordExtractorConfig.java	New extractor config supporting `extractRawTimeValues`.
pinot-plugins/pinot-input-format/pinot-arrow/src/main/java/org/apache/pinot/plugin/inputformat/arrow/ArrowRecordReader.java	Replace per-row conversion with `ArrowRecordExtractor` and bind reader-scoped state via `setReader`.
pinot-plugins/pinot-input-format/pinot-arrow/src/main/java/org/apache/pinot/plugin/inputformat/arrow/ArrowRecordReaderConfig.java	Extend reader config to carry `extractRawTimeValues` through to the extractor.
pinot-plugins/pinot-input-format/pinot-arrow/src/main/java/org/apache/pinot/plugin/inputformat/arrow/ArrowToGenericRowConverter.java	Deleted in favor of `ArrowRecordExtractor`.
pinot-plugins/pinot-input-format/pinot-arrow/src/test/java/org/apache/pinot/plugin/inputformat/arrow/ArrowRecordExtractorTest.java	New comprehensive per-type contract tests (including raw-time mode, complex types, dictionaries, include-list).
pinot-plugins/pinot-input-format/pinot-arrow/src/test/java/org/apache/pinot/plugin/inputformat/arrow/ArrowMessageDecoderTest.java	Slim decoder tests to lifecycle/edge-cases + row-count branching behavior.
pinot-plugins/pinot-input-format/pinot-arrow/src/test/java/org/apache/pinot/plugin/inputformat/arrow/ArrowRecordReaderTest.java	Remove converter-specific assertions and rely on contract-shaped outputs (MV as `Object[]`).
pinot-plugins/pinot-input-format/pinot-arrow/src/test/java/org/apache/pinot/plugin/inputformat/arrow/ArrowTestDataUtils.java	New minimal test payload helper for decoder-focused tests.
pinot-plugins/pinot-input-format/pinot-arrow/src/test/java/org/apache/pinot/plugin/inputformat/arrow/util/ArrowTestDataUtil.java	Deleted legacy test utility.

Introduce ArrowRecordExtractor (extends BaseRecordExtractor) with schema-driven dispatch by ArrowTypeID; drop the bespoke ArrowToGenericRowConverter. The reader and decoder bind reader-scoped state once via setReader(ArrowReader), which caches the dictionary map and pre-resolves the include list against the VectorSchemaRoot. Add ArrowRecordExtractorConfig with extractRawTimeValues — matches the Avro / Parquet flag; Date / Time / Timestamp surface as raw int / long in the schema's unit instead of the contract Java type. ArrowMessageDecoder.decode now branches on row count: 0 → null, 1 → fields populated directly into destination, >1 → wrapped under GenericRow.MULTIPLE_RECORDS_KEY. Bug fixes vs the prior converter: - DateDayVector returns Integer (not LocalDateTime); old cast threw at runtime. - UInt2Vector returns Character (not a Number); old code passed it through unchanged, violating the Int(16) -> Integer contract. - UInt1Vector was sign-extended (200 -> -56) instead of zero-extended. All three are now schema-aware. Add ArrowRecordExtractorTest covering every Arrow vector type, raw and contract modes, complex types, dictionary encoding, and include-list filtering. Slim ArrowMessageDecoderTest to decoder-specific concerns; drop the redundant type-coverage tests.

Copilot

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 1 comment.

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

xiangfu0 · 2026-05-07T16:11:15Z

Opened a follow-up docs PR for this change: pinot-contrib/pinot-docs#802

Jackie-Jiang added enhancement Improvement to existing functionality ingestion Related to data ingestion pipeline refactor Code restructuring without changing behavior labels May 7, 2026

Jackie-Jiang force-pushed the arrow_record_extractor branch 2 times, most recently from 2c0ea0f to 796e0ed Compare May 7, 2026 01:01

Jackie-Jiang requested review from Copilot and xiangfu0 May 7, 2026 02:14

Copilot started reviewing on behalf of Jackie-Jiang May 7, 2026 02:15 View session

Copilot AI reviewed May 7, 2026

View reviewed changes

xiangfu0 approved these changes May 7, 2026

View reviewed changes

Jackie-Jiang added the bug Something is not working as expected label May 7, 2026

Jackie-Jiang force-pushed the arrow_record_extractor branch from 796e0ed to 7227fd7 Compare May 7, 2026 07:17

Jackie-Jiang force-pushed the arrow_record_extractor branch from 7227fd7 to 5a97afc Compare May 7, 2026 07:19

Jackie-Jiang requested a review from Copilot May 7, 2026 07:22

Copilot started reviewing on behalf of Jackie-Jiang May 7, 2026 07:26 View session

Copilot AI reviewed May 7, 2026

View reviewed changes

Comment thread ...not-arrow/src/test/java/org/apache/pinot/plugin/inputformat/arrow/ArrowRecordReaderTest.java Outdated

Minor fix

6e34047

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

Jackie-Jiang merged commit a8ac165 into apache:master May 7, 2026
11 checks passed

Jackie-Jiang deleted the arrow_record_extractor branch May 7, 2026 09:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor Arrow extraction to follow the RecordExtractor contract#18434

Refactor Arrow extraction to follow the RecordExtractor contract#18434
Jackie-Jiang merged 2 commits intoapache:masterfrom
Jackie-Jiang:arrow_record_extractor

Jackie-Jiang commented May 7, 2026 •

edited

Loading

Uh oh!

codecov-commenter commented May 7, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

xiangfu0 commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

Jackie-Jiang commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Bug fixes vs the prior converter

Tests

Uh oh!

codecov-commenter commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

xiangfu0 commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Jackie-Jiang commented May 7, 2026 •

edited

Loading

codecov-commenter commented May 7, 2026 •

edited

Loading