Skip to content

Use Boolean and List for ORCRecordReader boolean and list columns#18391

Open
rsrkpatwari1234 wants to merge 8 commits intoapache:masterfrom
rsrkpatwari1234:rsrkpatwari1234-18222
Open

Use Boolean and List for ORCRecordReader boolean and list columns#18391
rsrkpatwari1234 wants to merge 8 commits intoapache:masterfrom
rsrkpatwari1234:rsrkpatwari1234-18222

Conversation

@rsrkpatwari1234
Copy link
Copy Markdown
Contributor

@rsrkpatwari1234 rsrkpatwari1234 commented May 1, 2026

Fixed #18222

Summary

ORC rows are built by ORCRecordExtractor (the reader only reads the file and calls the extractor). This change makes common types match what users expect:

  • Boolean columns are exposed as Boolean, not as a string.
  • List / array columns are exposed as List, not as Object[]. Empty lists stay as empty lists; null entries inside a list are still null.

Segment building and tests already assumed multi-value fields as Object[] in many places. Those paths now accept List as well (via RecordReaderUtils.toObjectArray and small updates in stats collection and special-value handling), so ORC ingestion keeps working.

Test Plan

Unit tests are updated so the exact Java types for boolean and list extraction are checked, including empty lists and null list elements.

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 1, 2026

Codecov Report

❌ Patch coverage is 56.00000% with 11 lines in your changes missing coverage. Please review.
✅ Project coverage is 63.50%. Comparing base (a4bef7b) to head (bc30d8c).
⚠️ Report is 1 commits behind head on master.

Files with missing lines Patch % Lines
...cal/recordtransformer/SpecialValueTransformer.java 0.00% 3 Missing and 1 partial ⚠️
...ache/pinot/spi/data/readers/RecordReaderUtils.java 33.33% 2 Missing and 2 partials ⚠️
.../impl/stats/SegmentPreIndexStatsCollectorImpl.java 66.66% 1 Missing and 1 partial ⚠️
...not/plugin/inputformat/orc/ORCRecordExtractor.java 66.66% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #18391      +/-   ##
============================================
+ Coverage     63.48%   63.50%   +0.02%     
  Complexity     1701     1701              
============================================
  Files          3254     3254              
  Lines        199114   199119       +5     
  Branches      30833    30834       +1     
============================================
+ Hits         126399   126443      +44     
+ Misses        62643    62587      -56     
- Partials      10072    10089      +17     
Flag Coverage Δ
custom-integration1 100.00% <ø> (ø)
integration 100.00% <ø> (ø)
integration1 100.00% <ø> (ø)
integration2 0.00% <ø> (ø)
java-21 63.50% <56.00%> (+0.02%) ⬆️
temurin 63.50% <56.00%> (+0.02%) ⬆️
unittests 63.49% <56.00%> (+0.02%) ⬆️
unittests1 55.43% <27.27%> (-0.01%) ⬇️
unittests2 35.01% <56.00%> (+0.03%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

values[j] = extractValue(field, listColumnVector.child, childType, offset + j);
values.add(extractValue(field, listColumnVector.child, childType, offset + j));
}
return values;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Returning List here changes the RecordExtractor/GenericRow multi-value contract from Object[] to List. This PR only patches a few ingestion call sites, but many existing consumers still cast MV values to Object[] or test for instanceof Object[], so ORC rows can still break outside the narrow segment-build path. Please keep the extractor output as Object[] and add any user-facing List adaptation at a higher layer instead of changing the reader contract.

@Jackie-Jiang
Copy link
Copy Markdown
Contributor

Jackie-Jiang commented May 5, 2026

Thanks for the contribution. I'm doing a universal standardization of all record extractors (#18378 and #18400), which should cover this topic.
For MV, no need to convert to list. The contract is defined in RecordExtractor interface

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Normalize ORCRecordReader BOOLEAN and LIST output types

4 participants