Skip to content

Auto-create dictionary when secondary indexes need it (inverted, FST, range)#17269

Merged
xiangfu0 merged 40 commits intoapache:masterfrom
xiangfu0:dictionary-auto-creation-when-corresponding-indexes-enabled
May 7, 2026
Merged

Auto-create dictionary when secondary indexes need it (inverted, FST, range)#17269
xiangfu0 merged 40 commits intoapache:masterfrom
xiangfu0:dictionary-auto-creation-when-corresponding-indexes-enabled

Conversation

@xiangfu0
Copy link
Copy Markdown
Contributor

@xiangfu0 xiangfu0 commented Nov 25, 2025

Summary

When a column is configured with encodingType: RAW (or appears in noDictionaryColumns) but a secondary index that needs a dictionary is also configured (inverted, fst, ifst, dict-id range), Pinot now auto-creates a standalone dictionary for that column so the secondary index can function. The forward index stays RAW; the dictionary lives alongside it.

Previously, such configs either silently produced wrong results, fell back to a deprecated raw-value bitmap inverted index, or required the user to manually add a "dictionary": {} block to the FieldConfig.indexes map. This PR makes the common case work without explicit configuration and removes the legacy raw-value bitmap inverted index path entirely.

Builds on the SPI surfaces introduced by #18364 (encoding) and #18365 (requiresDictionary / shouldInvalidateOnDictionaryChange).

What's covered

Auto-create dictionary

  • Config deserialization (DictionaryIndexType.fromFieldConfigs): when a FieldConfig declares encodingType=RAW AND any enabled index returns IndexType.requiresDictionary()=true (FST, IFST, INVERTED, etc.), the dictionary config is left at its default-enabled state instead of being marked DISABLED. Validation and the rest of the pipeline see a dict-enabled column from the start.
  • Segment creation (BaseSegmentCreator): DictionaryIndexConfig.requiresDictionary() drives auto-dict creation when a dict-requiring index is configured on a RAW column. If the user explicitly disabled the dictionary, a warning is logged and the dictionary is still created (alternative is a hard segment-build failure or silent index loss).
  • Realtime mutable segments (MutableSegmentImpl): respects the encoding plumbing.
  • Reload (ForwardIndexHandler.ENABLE_DICTIONARY op): if an existing RAW segment is loaded under a config that now requests a dict-requiring index, ForwardIndexHandler materializes a standalone dictionary; subsequent handlers (InvertedIndexHandler, RangeIndexHandler) build their dict-id-based indexes from it.
  • Default columns (BaseDefaultColumnHandler): same logic for newly-added columns at reload.
  • Reconciliation (FieldIndexConfigsUtil): fail-fast if the user-supplied ForwardIndexConfig.encodingType disagrees with the column-level FieldConfig.encodingType / noDictionaryColumns.

ForwardIndexHandler.computeOperations simplification

The reload-time decision logic in ForwardIndexHandler.computeOperations was rewritten to four orthogonal questions per column, with the dictionary state expressed as a single rule that drives every dict-toggle decision:

desiredDict = newIsDict || any-enabled-index-requires-dict

State is tracked as FieldConfig.EncodingType (DICTIONARY, RAW, or null for "forward index disabled / not on disk"), making the state space explicit. The transition steps run in this order:

  1. Forward-index transition — based on existingFwdEncoding vs newFwdEncoding. Three sub-cases: forward disabled, forward re-enabled (rebuild from dict + inverted), or encoding flips with forward staying on.
  2. Dictionary transition — based on existingHasDict vs desiredDict. Auto-enables the dict whenever a secondary index requires it; refuses to remove the dict if any enabled index in the new config still needs it.
  3. Compression-type change — only when no encoding change happened.
  4. Cross-cutting guards — sorted-column toggles ignored, range-index format compatibility check, enable-forward needs dict + inverted on disk, enable-dict needs forward to be on.

The ENABLE_FORWARD_INDEX operation is split into ENABLE_DICT_FORWARD_INDEX and ENABLE_RAW_FORWARD_INDEX so the intent is explicit. Both variants cover two on-disk shapes:

  • Forward absent → rebuild from dict + inverted as the requested encoding.
  • Forward present but encoding flips DICT⇄RAW → rewrite in place. The new helper convertDictForwardToRawKeepingDictionary handles the DICT→RAW + dict-kept case (e.g. when an inverted index is added to a dict-encoded column without explicitly enabling dict).

Range index correctness gates (RAW + dict combination)

Shared-dict + RAW exposes a subtle correctness hazard around range indexes: when a dictionary exists at index build time, both RangeIndexCreator (v1) and BitSlicedRangeIndexCreator (v2) build over dict IDs. v1 is non-exact — its query-time partial matches fall back to ScanBasedFilterOperator, which on a RAW forward column would have to apply a dict-based evaluator against raw values, silently producing wrong results.

  • RangeIndexType.validate: rejects RAW forward + dictionary + RangeIndexCreator.VERSION (v1). Forces the BitSliced v2 (exact) range index for shared-dict + RAW columns so the failure mode is config validation rather than wrong query answers.
  • RangeIndexHandler.needUpdateIndices / updateIndices: read the on-disk range index version (first int of the buffer) and rebuild when it differs from the configured version. v1 and v2 have incompatible on-disk layouts and different exact/non-exact semantics, so a version change requires a full rebuild.
  • RangeIndexBasedFilterOperator.canEvaluate: returns false when dataSource.getDictionary() != null and the predicate evaluator isn't dict-based — falls through to ScanBasedFilterOperator, which correctly applies raw values against the raw forward index. Defends against any other path that produces a raw-value evaluator on a column whose range index was built over dict IDs.
  • PredicateEvaluatorProvider.getDictionaryUsableForFiltering: per-predicate-type decision. For RANGE on a RAW forward column, the dictionary is kept only when the range index is exact (isExact() == true) so non-exact range readers can't be paired with a dict-based evaluator that scan-fallback can't apply.

Remove legacy raw-value bitmap inverted index

The pre-shared-dict format embedded its own dictionary inline inside the <col>.bitmap.inv.idx file (written by the now-deleted RawValueBitmapInvertedIndexCreator). Reads went through RawValueBitmapInvertedIndexReader and RawValueInvertedIndexFilterOperator. All three are deleted in favor of the standard BitmapInvertedIndexReader over a real standalone dictionary.

  • SegmentPreProcessor.removeLegacyRawValueInvertedIndexes pre-pass detects the legacy 44-byte big-endian header (version=1 + cardinality + offsets) on segment load, deletes the file, and lets the new ForwardIndexHandler + InvertedIndexHandler chain rebuild the index in dict-id format.
  • InvertedIndexType.ReaderFactory drops the hasDictionary branch; asserts a dictionary exists.

Dict-id-based rebuild path

  • DictionaryBasedIndexBuilder (new, ~200 lines): shared helper extracted from InvertedIndexHandler and RangeIndexHandler — reads raw forward values, looks each up in the dictionary, feeds (value, dictId) pairs into a DictionaryBasedInvertedIndexCreator. Single per-data-type dispatch handles SV, MV, INT/LONG/FLOAT/DOUBLE/BIG_DECIMAL/STRING/BYTES.
  • InvertedIndexAndDictionaryBasedForwardIndexCreator: split _dictionaryEnabled into two flags — _dictionaryPresent (a standalone dict file exists) and _dictionaryBasedForwardIndex (the forward index stores dict IDs). The two are now independent (RAW forward + standalone dict is the new third state).
  • BaseIndexHandler: wires the two-flag model.
  • Handler ordering contract: InvertedIndexHandler's class-level Javadoc documents that it requires a dictionary by the time updateIndices runs, and that SegmentPreProcessor enforces this by always running ForwardIndexHandler first.

Query path: predicate evaluator selection on shared-dict columns

With shared-dict columns, predicates on the same column may need different evaluators: an EQ can use the dict-id-based inverted index, while a LIKE '%x%' on the same column must scan the raw forward.

  • PredicateEvaluatorProvider.getDictionaryUsableForFiltering: per-predicate-type decision returning the dictionary only when a dict-consuming filter operator (sorted, inverted, exact range) is actually available and enabled for that specific predicate type. Inverted-only is dropped for RANGE; non-exact range is dropped for RANGE/EQ. RAW forward + scan reads raw values directly.
  • FilterMvTransformFunction: per-value evaluation via transformToDictIdsMV requires forward.getDictIdMV() to actually serve dict ids cheaply. RAW forward indexes throw UnsupportedOperationException from that method, so when the inner Identifier wraps a RAW column (with or without a shared dictionary), the dictionary is dropped here and filterMv falls back to per-value raw matching.

DataFetcher

DataFetcher.addDataSource drops the (shared) dictionary when the forward index is RAW — the existing per-method branches in ColumnValueReader then take the raw-value paths uniformly without touching the read methods. Callers that genuinely need dict ids on a RAW + shared-dict column read raw values and consult the dictionary directly. DefaultGroupByExecutor similarly gates on forwardIndex.isDictionaryEncoded() so shared-dict + RAW columns route to the no-dict GROUP BY path.

addRaw(Object) SPI extraction

  • ForwardIndexCreator.addRaw(Object) (new default method): extracted from the add(Object, dictId) body so handlers can write raw values without going through the dict-id routing branch. Required by the rebuild path that converts dict-id forward back to raw forward when dropping a dictionary.
  • 7 concrete creators (CLPV1, CLPV2, MV/SV Fixed/VarByte, SameValue) implement add()addRaw().

Tests

  • IndexCombinationValidationTest (~640 lines, new) — exhaustive matrix over (encoding, dictionary, secondary index, column type, compression codec) combinations.
  • LegacyRawValueInvertedIndexMigrationIntegrationTest (new, in pinot-integration-tests) — full-cluster integration test with a synthetically-constructed legacy segment (raw forward + legacy embedded-dict inverted file written by a resurrected LegacyRawValueBitmapInvertedIndexCreator). Tars it, uploads to a real cluster, verifies the server preprocessor migrates the segment and that EQ / IN / NOT_EQ queries return correct counts.
  • RawForwardIndexWithDictionaryTest (~580 lines) — 33 cases covering SV/MV × EQ/RANGE/IN/REGEXP_LIKE on RAW forward + dict, plus comprehensive mixed-predicate matrix combining inverted-index dict path and raw-scan path on the same column with explicit non-zero expected counts.
  • RawForwardIndexInvertedIndexTest — query-result equivalence vs DICTIONARY-encoded baseline.
  • SegmentPreProcessorTest.testRangeIndexRebuiltOnDictionaryToggle (new) — asserts the range index is rebuilt with the right format every time the dictionary state toggles, covering both the auto path (driven by inverted-index add/remove) and the explicit toggle path.
  • SegmentPreProcessorTest.testFlipDictForwardToRawForwardKeepingDictionaryForInvertedIndex (new) — encoding-flip end-to-end test asserting that a dict-encoded INT column with an inverted index, when reloaded under a config that puts the column in noDictionaryColumns, ends up as dict + inverted + raw forward (dict kept, encoding flipped, inverted intact).
  • InvertedIndexHandlerTestisLegacyRawValueInvertedIndexFormat detection + SegmentPreProcessor rebuild path.
  • ForwardIndexHandlerTest (+477 lines) — reload-time auto-create-dict coverage.
  • PredicateEvaluatorProviderTest (new) — per-predicate-type dict-drop decisions, including non-exact range and dict-required-by-inverted scenarios.
  • FilterMvTransformFunctionTest — RAW + dict + inverted column filterMv parity with dict-encoded baseline; per-value path drops dict for RAW forward.
  • SegmentPreProcessorTest, ColumnMetadataImplTest (new), SegmentGeneratorConfigTest, TableConfigUtilsTest, DictionaryIndexTypeTest, LazyRowTest, ForwardIndexHandlerReloadQueriesTest, TableIndexingTest + CSV — combination coverage and reload regressions.
  • RealtimeSegmentConverterTest + CrcUtilsTest — CRC values updated for the new metadata key footprint.

Backward compatibility

  • Old segments without FORWARD_INDEX_ENCODING in metadata.properties: ColumnMetadataImpl.fromPropertiesConfiguration falls back to inferring encoding from HAS_DICTIONARY (the field added by Forward-index encoding: introduce Encoding SPI surface and use it for raw vs dict checks #18364, which this PR depends on).
  • Old segments with the legacy raw-value bitmap inverted format on disk: detected by the 44-byte header signature and migrated by SegmentPreProcessor on first load under the new code. Covered by LegacyRawValueInvertedIndexMigrationIntegrationTest.
  • Old segments with v1 (non-exact) range index on a column that now needs a shared dictionary: RangeIndexHandler detects the on-disk version mismatch and rebuilds in v2 format on reload.
  • Existing valid configs: noDictionaryColumns + invertedIndexColumns overlap, previously implicit, now produces a real dict + standard inverted via auto-create. No config change required.

Example FieldConfig

Implicit (auto-create dictionary because inverted needs it):

{
  "fieldConfigList": [
    {
      "name": "myColumn",
      "encodingType": "RAW",
      "indexes": { "inverted": {} }
    }
  ]
}

Explicit (also valid; both dictionary and inverted listed):

{
  "fieldConfigList": [
    {
      "name": "myColumn",
      "encodingType": "RAW",
      "compressionCodec": "LZ4",
      "indexes": { "dictionary": {}, "inverted": {} }
    }
  ]
}

Out of scope (follow-up PR)

The full operator-routing migration to gate on BlockValSet.isDictionaryEncoded() (instead of BlockValSet.getDictionary() != null) is split into a separate branch blockvalset-isDictionaryEncoded-routing-for-shared-dict-raw. That branch wires the gate through DistinctExecutorFactory, NoDictionaryMultiColumnGroupKeyGenerator, and all DistinctCount* aggregation functions. This PR includes only the DefaultGroupByExecutor, BinaryOperatorTransformFunction, and FilterMvTransformFunction gate updates needed to keep the new shared-dict + RAW tests green.

🤖 Generated with Claude Code

@xiangfu0 xiangfu0 requested a review from Copilot November 25, 2025 02:51
@xiangfu0 xiangfu0 added inverted-index Related to inverted index implementation index Related to indexing (general) enhancement Improvement to existing functionality labels Nov 25, 2025
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR enables dictionary inference for indexes that require dictionaries (inverted, FST, range) and enhances raw forward index handling throughout the codebase. It introduces automatic dictionary creation when required by certain index types, plumbs raw-encoding awareness through forward index creator/reader factories, and extends the forward index handler to support raw-forward columns in various operations.

Key changes:

  • Automatic dictionary inference when indexes requiring dictionaries are configured
  • Raw-encoding flag in ForwardIndexConfig to distinguish raw vs. dictionary-encoded forward indexes
  • Enhanced ForwardIndexHandler to support raw forward indexes with dictionary add/remove operations
  • New integration test verifying raw forward index + inverted index functionality

Reviewed changes

Copilot reviewed 29 out of 29 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
githubComplexTypeEvents_offline_table_config.json Adds raw-encoded forward index with inverted index configuration example
ForwardIndexCreator.java Refactors raw value handling into separate addRaw method
IndexType.java Adds requiresDictionary method to indicate dictionary requirement
ForwardIndexUtils.java New utility to identify raw forward index columns from table config
ForwardIndexConfig.java Adds rawEncoding flag and getter/builder methods
FieldIndexConfigsUtil.java Implements automatic dictionary enablement for required indexes
DictionaryIndexConfig.java Adds utility methods to check if dictionary is required by any index
ForwardIndexTypeTest.java Updates tests to include rawEncoding flag in expected configs
ColumnMinMaxValueGenerator.java Passes ForwardIndexConfig when creating readers
InvertedIndexAndDictionaryBasedForwardIndexCreator.java Respects rawEncoding flag when determining dictionary usage
ForwardIndexHandler.java Adds ENABLE_DICTIONARY_FOR_RAW_FORWARD_INDEX operation and handling logic
InvertedIndexType.java Implements requiresDictionary to return true
IFSTIndexType.java Implements requiresDictionary to return true
FstIndexType.java Implements requiresDictionary to return true
ForwardIndexType.java Updates getFileExtension to return multiple possible extensions
ForwardIndexReaderFactory.java Uses rawEncoding flag to determine reader type
ForwardIndexCreatorFactory.java Uses rawEncoding flag to determine creator type
SingleValueVarByteRawIndexCreator.java Overrides add method to delegate to addRaw
SingleValueFixedByteRawIndexCreator.java Overrides add method to delegate to addRaw
MultiValueVarByteRawIndexCreator.java Overrides add method to delegate to addRaw
MultiValueFixedByteRawIndexCreator.java Overrides add method to delegate to addRaw
CLPForwardIndexCreatorV2.java Overrides add method for CLP encoding
CLPForwardIndexCreatorV1.java Overrides add method for CLP encoding
SegmentIndexCreationDriverImpl.java Fixes inverted logic bug (isDisabled → isEnabled)
SegmentColumnarIndexCreator.java Adds warning when dictionary required but explicitly disabled
SameValueForwardIndexCreator.java Overrides add method to delegate to addRaw
RawForwardIndexInvertedIndexTest.java New integration test for raw forward index with inverted index
ForwardIndexHandlerReloadQueriesTest.java Updates test expectations and adds raw forward index test coverage
DataFetcher.java Adds _useDictionary field for proper dictionary usage check

@xiangfu0 xiangfu0 force-pushed the dictionary-auto-creation-when-corresponding-indexes-enabled branch from 880c72e to e042d90 Compare November 25, 2025 03:41
@xiangfu0 xiangfu0 requested a review from Copilot November 25, 2025 03:41
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 29 out of 29 changed files in this pull request and generated no new comments.

@xiangfu0 xiangfu0 force-pushed the dictionary-auto-creation-when-corresponding-indexes-enabled branch 2 times, most recently from 74727cd to d202d42 Compare November 25, 2025 12:36
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Nov 25, 2025

Codecov Report

❌ Patch coverage is 64.31298% with 187 lines in your changes missing coverage. Please review.
✅ Project coverage is 63.65%. Comparing base (5ccd383) to head (0e11efc).
⚠️ Report is 6 commits behind head on master.

Files with missing lines Patch % Lines
...der/invertedindex/DictionaryBasedIndexBuilder.java 12.19% 69 Missing and 3 partials ⚠️
...ocal/segment/index/loader/ForwardIndexHandler.java 79.12% 14 Missing and 24 partials ⚠️
...ator/transform/function/ItemTransformFunction.java 0.00% 19 Missing ⚠️
...segment/spi/index/creator/ForwardIndexCreator.java 0.00% 15 Missing ⚠️
...dex/loader/invertedindex/InvertedIndexHandler.java 62.96% 7 Missing and 3 partials ⚠️
.../segment/index/dictionary/DictionaryIndexType.java 86.53% 1 Missing and 6 partials ⚠️
...ertedindex/LegacyRawValueInvertedIndexCleanup.java 75.00% 3 Missing and 3 partials ⚠️
...r/filter/predicate/PredicateEvaluatorProvider.java 80.95% 1 Missing and 3 partials ⚠️
.../index/loader/invertedindex/RangeIndexHandler.java 82.60% 2 Missing and 2 partials ⚠️
...operator/filter/RangeIndexBasedFilterOperator.java 50.00% 1 Missing and 2 partials ⚠️
... and 5 more
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #17269      +/-   ##
============================================
+ Coverage     63.57%   63.65%   +0.08%     
- Complexity     1717     1723       +6     
============================================
  Files          3252     3254       +2     
  Lines        199132   199459     +327     
  Branches      30875    30977     +102     
============================================
+ Hits         126596   126966     +370     
+ Misses        62454    62361      -93     
- Partials      10082    10132      +50     
Flag Coverage Δ
custom-integration1 100.00% <ø> (ø)
integration 100.00% <ø> (ø)
integration1 100.00% <ø> (ø)
integration2 0.00% <ø> (ø)
java-21 63.65% <64.31%> (+0.08%) ⬆️
temurin 63.65% <64.31%> (+0.08%) ⬆️
unittests 63.65% <64.31%> (+0.08%) ⬆️
unittests1 55.72% <39.50%> (+0.06%) ⬆️
unittests2 34.95% <55.34%> (+0.04%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@xiangfu0 xiangfu0 force-pushed the dictionary-auto-creation-when-corresponding-indexes-enabled branch 2 times, most recently from c2dfdcc to 4b4bcd7 Compare November 27, 2025 12:26
@xiangfu0 xiangfu0 force-pushed the dictionary-auto-creation-when-corresponding-indexes-enabled branch 4 times, most recently from 9e63630 to 677c60f Compare December 10, 2025 11:15
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 37 out of 37 changed files in this pull request and generated 7 comments.

Copy link
Copy Markdown
Contributor

@Jackie-Jiang Jackie-Jiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For future proof, how do you plan to support both dictionary encoded and raw forward index on the same column? Specifically, how does the FieldConfig look like?

Comment thread pinot-core/src/main/java/org/apache/pinot/core/common/DataFetcher.java Outdated
Comment thread pinot-segment-spi/src/main/java/org/apache/pinot/segment/spi/index/IndexType.java Outdated
@xiangfu0 xiangfu0 force-pushed the dictionary-auto-creation-when-corresponding-indexes-enabled branch from 677c60f to 7463dd1 Compare December 19, 2025 00:31
@xiangfu0 xiangfu0 force-pushed the dictionary-auto-creation-when-corresponding-indexes-enabled branch 2 times, most recently from a664045 to e936ca7 Compare December 31, 2025 05:28
@xiangfu0 xiangfu0 requested a review from Copilot December 31, 2025 09:21
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 39 out of 39 changed files in this pull request and generated 6 comments.

@xiangfu0 xiangfu0 force-pushed the dictionary-auto-creation-when-corresponding-indexes-enabled branch 3 times, most recently from f023298 to 78efe79 Compare January 2, 2026 06:19
xiangfu0 and others added 27 commits May 6, 2026 23:41
testFetchDictIdsFromRawForwardIndexWithSharedDictionary previously called
DataFetcher#fetchDictIds and expected the implicit per-row dictionary
lookup. With the new contract, fetchDictIds throws on a RAW forward
index and callers must opt in via fetchDictIdsFromRawValues. The test
now asserts both: the throw (1) and the explicit-path dict ids (2).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per @Jackie-Jiang's PR comment: readDictIds should focus on reading dict
ids directly from a dictionary-encoded forward index;
readDictIdsFromRawValues handles the explicit per-row dictionary lookup
when the forward index is RAW but a (shared) dictionary exists. Each
ColumnValueReader method now does exactly one thing.

The dispatch between the two paths moves up to the public
DataFetcher#fetchDictIds entry point so existing callers are unaffected.
A new public DataFetcher#fetchDictIdsFromRawValues (SV + MV) lets future
callers opt explicitly into the per-row dictionary lookup when they have
already verified the column is RAW + shared-dict.

Also rename the ColumnValueReader field _useDictionary to
_dictionaryEncoded to match the boolean's actual meaning, and remove the
hot-path Preconditions guards in readDictIdsFromRawValues* — the caller
contract documents the precondition and the underlying readers will
throw if invariants are violated.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
fetchDictIds (SV + MV) now reads dict ids directly from the forward
index with no implicit fallback. On a RAW + shared-dict column the
underlying ForwardIndexReader#readDictIds throws
UnsupportedOperationException — callers that genuinely need dict ids on
such a column must call the new public fetchDictIdsFromRawValues
(SV + MV) explicitly. This makes the per-row Dictionary#indexOf cost a
deliberate caller decision instead of a hidden one.

Update DataFetcherTest to assert the new contract: fetchDictIds throws
on a RAW reader, and fetchDictIdsFromRawValues returns the dict ids
produced by per-row dictionary lookups.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…OUP BY

The previous gate (columnContext.getDictionary() == null) routed
shared-dict + RAW columns to DictionaryBasedGroupKeyGenerator, which
calls BlockValSet#getDictionaryIdsSV → DataFetcher#fetchDictIds. With
fetchDictIds no longer falling back to per-row dictionary lookups, that
path now throws.

Extend the gate to also require the underlying forward index to be
dictionary-encoded so columns with a shared standalone dictionary on a
RAW forward index take the NoDictionarySingleColumnGroupKeyGenerator
path. Other operators (DistinctExecutorFactory, aggregation functions,
NoDictionaryMultiColumnGroupKeyGenerator, IdentifierTransformFunction)
will be migrated to the same gate in a follow-up PR.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ndex need

Refactor computeOperations into a per-column helper that decomposes into
four orthogonal questions: forward-index transition, dictionary
transition, compression-only change, and cross-cutting guards (sorted
columns, range-index format compatibility, enable-forward prerequisites,
enable-dictionary needs forward).

The dictionary decision is now a single rule: desiredDict = newIsDict ||
DictionaryIndexConfig.requiresDictionary(fieldSpec, newConf). The "force
dict on if any index requires it" path means adding an inverted index
to a raw column now auto-creates the dictionary (and the inverted index
gets built against it by InvertedIndexHandler in the same reload),
where the pre-PR behavior silently skipped this combo and left the
inverted-index request orphaned. Removing that index later auto-removes
the dictionary again.

shouldDisableDictionary now consults DictionaryIndexConfig.requiresDictionary
to refuse dict removal whenever any enabled index in the new config
still needs it (covers inverted, FST, IFST, and any future
dict-requiring index type). The on-disk inverted/FST hasIndex check was
removed — InvertedIndexHandler runs after ForwardIndexHandler and
removes orphaned indexes, so refusing dict removal based on transient
on-disk state was overly conservative.

getStatsCollector gains a requireUniqueValues parameter so the
dict-creation paths skip the no-dict-optimized collector and use the
type-specific collector that tracks unique values. Without this, the
auto-create-dict path NPEs when ClusterConfigForTable's optimized
no-dict collector is in effect.

Tests:
- testRangeIndexRebuiltOnDictionaryToggle: new SegmentPreProcessorTest
  case asserting the range index is rebuilt with the right format every
  time the dictionary state toggles, covering both the auto path
  (driven by inverted-index add/remove) and the explicit toggle path
  (noDictionaryColumns add/remove).
- testIfNeedProcess: updated to expect ENABLE_DICTIONARY when adding
  inverted to a raw column on v3 (the new auto path); v1 still skips.
- testComputeOperationDisableForwardIndex TEST13: updated to expect
  no operation queued when inverted is in the new config (dictionary
  must stay).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…FORWARD_INDEX

State representation:
Replace the existingHasFwd / newIsFwd booleans with FieldConfig.EncodingType
values where null means "forward index disabled / not on disk", DICTIONARY
means dict-encoded forward, and RAW means raw forward. This makes the state
space explicit and lets the compression-only branch test
existingFwdEncoding == FieldConfig.EncodingType.RAW directly instead of
calling isForwardIndexDictionaryEncoded(column).

Operation split:
ENABLE_FORWARD_INDEX is split into ENABLE_DICT_FORWARD_INDEX and
ENABLE_RAW_FORWARD_INDEX. computeColumnOperations picks the right variant
based on newFwdEncoding so the intent is explicit at the operation level.
Both variants flow through createForwardIndexIfNeeded (which already reads
the target encoding from the new config), but the post-rebuild assertions
in updateIndices are now type-specific: ENABLE_DICT_FORWARD_INDEX requires
the dictionary to remain after rebuild, ENABLE_RAW_FORWARD_INDEX requires
it to be absent.

Updated test cases in ForwardIndexHandlerTest accordingly: TESTs that
explicitly set FieldConfig.EncodingType.RAW for a forward-index-disabled
column now expect ENABLE_RAW_FORWARD_INDEX, while default-dict cases keep
ENABLE_DICT_FORWARD_INDEX.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Fix the wrong assertion in ENABLE_RAW_FORWARD_INDEX (dict + inverted +
raw forward is a valid post-rebuild state when an inverted index is
enabled — the assertion that the dictionary must be absent contradicted
the auto-keep-dictionary-when-required-by-index logic from the previous
commits) and extend the operation set to actually perform the encoding
flip when the dict transition step won't.

The forward-index transition path now has three sub-cases:
- Existing forward, new disables it → DISABLE_FORWARD_INDEX.
- Existing disabled, new re-enables it → ENABLE_DICT/RAW_FORWARD_INDEX,
  reusing createForwardIndexIfNeeded.
- Forward stays on but encoding flips DICT⇄RAW → queued only when the
  dict transition won't already cover the conversion (i.e. existingHasDict
  == desiredDict). The DICT→RAW + drop-dict and RAW→DICT + create-dict
  combos still fall out of DISABLE_DICTIONARY / ENABLE_DICTIONARY
  unchanged, so the encoding-flip op is reserved for the dict-stays cases:
    - DICT→RAW with dict kept (e.g. inverted index added to a dict-encoded
      column without explicitly enabling dict)
    - RAW→DICT with shared dict already present

For the DICT→RAW + dict kept case, add convertDictForwardToRawKeepingDictionary:
the rewrite reuses the existing rewriteDictToRawForwardIndex helper but
keeps the dictionary on disk and does not call removeDictRelatedIndexes
(secondary indexes against unchanged dict ids stay valid).

Tests:
- testFlipDictForwardToRawForwardKeepingDictionaryForInvertedIndex:
  end-to-end test asserting the encoding flip on a dict-encoded INT column
  with an inverted index leaves the column as dict + inverted + raw forward.
- testComputeOperationDisableDictionary TEST3: previously asserted no-op
  for "disable dict + add inverted on dict-encoded column" (the pre-PR
  silent-skip behavior). Updated to expect the correct op
  ENABLE_RAW_FORWARD_INDEX, which performs the encoding flip while the
  inverted-index requirement keeps the dictionary.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…urce

Revert all the DataFetcher / DataFetcherTest churn from earlier commits
(API split, _dictionaryEncoded rename, Preconditions on hot paths, etc.)
back to master and replace it with the minimal change that solves the
shared-dict + RAW correctness issue: when forwardIndexReader.isDictionaryEncoded()
is false, pass null instead of dataSource.getDictionary() to
ColumnValueReader.

ColumnValueReader's existing per-method branches all gate on
_dictionary != null, so dropping the dictionary at construction makes
every value-read method take the raw-value path uniformly without
touching the read methods themselves. Callers that need dict ids on a
RAW + shared-dict column still propagate UnsupportedOperationException
from the underlying RAW reader's readDictIds default — that surfaces as
a clear failure for any operator that hasn't been migrated to gate on
forward-index encoding (DefaultGroupByExecutor was migrated in an earlier
commit; the remaining migrations live in the BlockValSet follow-up
branch).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Centralize the "no dictionary path on RAW forward index" rule in
ColumnContext.fromDataSource. Every consumer of columnContext.getDictionary()
now naturally routes shared-dict + RAW columns to the raw-value path
without needing per-call check:

- IdentifierTransformFunction: simplified — its inline check
  (forwardIndex.isDictionaryEncoded() ? columnContext.getDictionary() : null)
  is now redundant because ColumnContext returns null directly.
- DefaultGroupByExecutor: revert the extra forward-index-encoding check
  added in an earlier commit; the master gate
  (columnContext.getDictionary() == null) is correct again now that
  ColumnContext drops the dictionary at the source.
- DictionaryBasedGroupKeyGenerator, DistinctExecutorFactory,
  NoDictionaryMultiColumnGroupKeyGenerator: no change needed — they
  already gate on columnContext.getDictionary() and now see null for
  shared-dict + RAW.

Callers that legitimately need the underlying dictionary (e.g.
DictionaryBasedDistinctOperator iterating dict values directly) can
still get it via columnContext.getDataSource().getDictionary().

This is the same pattern applied earlier to DataFetcher.addDataSource;
moving it to ColumnContext makes the rule the single source of truth
for the operator/transform-function tree.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per @xiangfu0's review: ColumnContext should expose both the data source
and the dictionary so upper functions can decide for themselves. Some
operators (e.g. DictionaryBasedDistinctOperator) iterate the dictionary
directly and don't care about forward-index encoding — they would lose
the dictionary if it were dropped at the source.

Restore ColumnContext.fromDataSource to master behavior. Apply the
"forward must be dict-encoded for dict-id reads" rule at each call site
that does dict-id reads:

- IdentifierTransformFunction: drop dict for shared-dict + RAW so
  transformToDictIdsSV/MV doesn't get advertised.
- DefaultGroupByExecutor: route shared-dict + RAW columns to the
  no-dict GROUP BY path so DictionaryBasedGroupKeyGenerator's dict-id
  reads don't fire.

Both checks are null-safe against DataSource without a forward index
(test mock case).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add INT_MV_DICT_RAW_COLUMN to BaseTransformFunctionTest, configured as
EncodingType.RAW with both an explicit dictionary and inverted index in
its FieldConfig — the canonical shared-dict + RAW shape introduced by
this PR. The values mirror INT_MV_COLUMN so callers can compare results
against the dict-encoded baseline.

testFilterMvOnSharedDictRawForwardColumn in FilterMvTransformFunctionTest
exercises every predicate in the existing IntPredicate matrix (EQ, NEQ,
RANGE, IN, NOT_IN, BETWEEN, AND, OR, NOT) and asserts:
- The data source actually has a RAW forward index plus a dictionary
  on disk (sanity: the configuration produced the intended shape).
- The TransformFunction reports getDictionary() == null and
  TransformResultMetadata.hasDictionary() == false. This is the signal
  that drives FilterMvPredicateEvaluator -> PredicateEvaluatorProvider
  to construct a raw-value matching evaluator (matchesInt etc.) instead
  of a matchesDictId evaluator.
- transformToIntValuesMV produces results that match the expected
  matcher applied to _intMVValues, confirming the predicate evaluator
  built without a dictionary still returns correct rows.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…tor overload

Split the existing INT_MV_DICT_RAW_COLUMN coverage into two distinct on-disk
shapes and exercise filterMv against both:
- INT_MV_DICT_RAW_COLUMN: RAW forward + explicit shared dictionary, no
  secondary index.
- INT_MV_DICT_RAW_INV_COLUMN (new): RAW forward + shared dictionary + inverted
  index — covers the case the user wanted documented in tests.

Both columns hold the same values as INT_MV_COLUMN. The new
testFilterMvOnSharedDictRawForwardWithInvertedColumn parameterizes over
the existing IntPredicate matrix and asserts the filterMv result is
identical for all three columns (dict-encoded baseline, RAW + dict, RAW +
dict + inverted) for every predicate. The inverted index is irrelevant
to filterMv's per-value evaluation but having it on disk must not change
the result.

Also add a DataSource-aware overload to FilterMvPredicateEvaluator so the
evaluator can pull the dictionary only when the underlying forward index
is dictionary-encoded:

  public static FilterMvPredicateEvaluator forPredicate(String predicate, DataSource dataSource)

FilterMvTransformFunction now uses this overload when the inner argument
is a direct column reference (IdentifierTransformFunction); for transform
arguments it keeps the dictionary-only path. The control flow is fixed
to use an else branch so the DataSource-aware evaluator isn't overwritten
by the legacy call.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per @xiangfu0's review: each external caller should evaluate whether it
has a DataSource. Consolidate PredicateEvaluatorProvider's public surface
to one method:

  public static PredicateEvaluator getPredicateEvaluator(Predicate predicate,
      @nullable DataSource dataSource, DataType dataType, @nullable QueryContext queryContext)

When the caller has a column DataSource, pass it so the gating logic in
getDictionaryUsableForFiltering can pick between dict-based and raw-value
evaluation for shared-dict + RAW columns. When the caller doesn't (post-
reduction matchers, intermediate-result aggregators, computed transforms),
pass null and the evaluator is built from raw values using the supplied
data type.

The two dictionary-based getPredicateEvaluator overloads (3-arg and 4-arg)
are now private — they remain as the internal dispatch implementation but
are no longer part of the public API.

Migrations:
- FilterPlanNode: pass dataSource.getDataSourceMetadata().getDataType() through.
- BinaryOperatorTransformFunction: drill into IdentifierTransformFunction
  for the column's DataSource; for non-Identifier transforms pass null.
- ExpressionFilterOperator: same Identifier-or-null pattern.
- PredicateRowMatcher, DistinctCountThetaSketchAggregationFunction: pass
  null DataSource (post-reduction / intermediate-result paths have no
  column to point at).
- PredicateEvaluatorProviderTest: pass FieldSpec.DataType.STRING through.

FilterMvPredicateEvaluator deliberately bypasses PredicateEvaluatorProvider
now: filterMv evaluates per-value at transform time, so the filter-plan-time
gating PredicateEvaluatorProvider applies (which would route shared-dict +
RAW columns through the dict-based evaluator) is the wrong policy at this
layer. FilterMvPredicateEvaluator builds its evaluator directly via the
Equals/NotEquals/In/NotIn/Range/RegexpLike factories based solely on
whether a dictionary is supplied — IdentifierTransformFunction already
returns null for shared-dict + RAW columns, so the existing forPredicate
(predicate, dataType, dictionary) signature is the right call surface.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PredicateEvaluatorProvider:
- For RANGE on RAW forward, drop the dictionary unless the range index is
  exact (Jackie: "we need to ensure the predicate can solely be resolved
  with range index. Mixing dictionary encoded range index and raw forward
  index scan will break"). Non-exact (legacy) range readers fall back to
  ScanBasedFilterOperator for partial matches; that scan applies the
  predicate evaluator on raw forward values, which would break a dict-based
  evaluator.
- Remove the sortedAvailable check inside the RAW-forward branch
  (Jackie: "this is dead code. Sorted index is run-length dictionary
  encoded forward index"). A sorted forward index is dict-encoded by
  definition, so it can never coexist with a RAW forward in this branch.
- Make getDictionaryUsableForFiltering package-private and annotate it
  with @VisibleForTesting (Jackie's minor suggestion).
- Update the public method's docstring to clarify caller responsibilities:
  leaf-filter callers may pass DataSource freely; transform-layer callers
  must only pass DataSource when the inner transform's getDictionary() is
  non-null (Jackie: "double check all callers of this, make sure it is
  not using forward index").

BinaryOperatorTransformFunction: pass leftDataSource only when the inner
transform exposes a dictionary, mirroring IdentifierTransformFunction's
"forward index is dict-encoded" contract. For RAW (including shared-dict
+ RAW) the inner transform returns null from getDictionary(), so the
predicate evaluator is built against raw values and applySV(value) on
the raw forward output stays consistent.

ExpressionFilterOperator: always pass null DataSource. The inner is
always a function (FilterPlanNode dispatches direct column refs through
the leaf-filter path) and ExpressionScanDocIdIterator scans the transform
output applying applySV(value) on raw values, so a dict-based evaluator
would never be safe here.

DataFetcher: add Preconditions check that dictionary != null when the
forward index claims to be dictionary-encoded (Jackie: "is it possible
that forward index is dictionary encoded by dictionary doesn't exist?").
Defensive against an impossible state — fails fast at DataFetcher
construction rather than silently producing a ColumnValueReader that
will NPE later in the read path.

BaseSegmentCreator.createDictionaryForColumn: when an index requires a
dictionary, create one regardless of the user's explicit dictionary
setting (Jackie: "should we always create dictionary when any index
requires it?"). Previously, explicitly disabling the dictionary while
enabling an index that requires it would lead to segment creation
failure (e.g. inverted index can't be built without a dict) or silent
index loss. Now we log a warning and proceed.

DefaultGroupByExecutor: add a comment explaining that
ColumnContext.getDataSource() can be null for computed (non-Identifier)
transforms (Jackie: "could datasource ever be null?"); in that case
getDictionary() == null already routes those columns onto the no-dict
GROUP BY path via the first condition.

Tests: drop two tests that exercised an impossible RAW + sorted scenario
and add a new test for the non-exact range index case.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
DataFetcher.addDataSource: drop the Preconditions check that asserted a
dictionary exists when forward index claims to be dictionary-encoded.
The segment loader sets isDictionaryEncoded() in lockstep with the
dictionary file's presence — the check was over-protecting an invariant
that's already guaranteed.

IdentifierTransformFunction.forwardIndexIsDictEncoded: drop the
ColumnContext.getDataSource() null guard. IdentifierTransformFunction is
only constructed via TransformFunctionFactory with ColumnContexts built
from ColumnContext.fromDataSource(...), where the DataSource is always
non-null. The remaining `forwardIndex != null` check is the real one —
it guards forward-index-disabled columns.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ctory dispatch

Address Jackie-Jiang's remaining review comments:

DictionaryIndexType: when a FieldConfig with encoding=RAW also configures
an index that always requires a dictionary (FST, IFST, INVERTED), do NOT
mark the dictionary as DISABLED in the deserialized config. The dict
falls through to its default-enabled state, validation passes, and the
runtime auto-creation paths (BaseSegmentCreator.createDictionaryForColumn
/ ForwardIndexHandler) build a shared-dict + RAW forward index.

This closes the loop on Jackie's "Dictionary should be generated on FST
without explicit configuration" comment — users no longer need to inject
a dictionary entry alongside FST/IFST/INVERTED on a RAW column.

PredicateEvaluatorProvider:
- Replace the now-private factory-dispatch overload with a public
  buildEvaluator(predicate, dictionary, dataType, queryContext) entry
  point. This is for callers whose value stream is statically known
  (e.g. per-value transform evaluation in filterMv); filter-plan-time
  callers still go through getPredicateEvaluator(predicate, DataSource,
  ...) which applies gating.

FilterMvPredicateEvaluator: drop the duplicated buildPerValueEvaluator
switch — it was a copy of the factory dispatch. Now delegates to
PredicateEvaluatorProvider.buildEvaluator.

IdentifierTransformFunction: revert the encoding-based dictionary drop.
The Identifier always exposes the column dictionary if one exists; it is
the consumer's responsibility to additionally check forward-index
encoding when deciding whether dict-id reads will be cheap. This matches
the principle that callers should make decisions with all available
information.

FilterMvTransformFunction: now performs its own forward-encoding gate
via columnContextMap to decide whether to use the dict-id matching path
or fall back to per-value matching. Also rebuilds the result metadata so
hasDictionary() reflects the gated dictionary actually used here, rather
than the inner Identifier's (which always reports the underlying dict).

BinaryOperatorTransformFunction: same gate inline — only pass DataSource
to the predicate evaluator when the LHS Identifier wraps a dict-encoded
forward index.

ForwardIndexHandler: inline the unused two-arg getStatsCollector
overload at its single call site.

TableConfigUtilsTest: update two assertions that previously expected
"Cannot create inverted index ... without dictionary" failures for
encoding=RAW + INVERTED. With auto-enable that combination now
validates and produces a shared-dict + RAW forward index.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…parameter

TableIndexingTest.java: the FST encoding=RAW dictionary workaround,
indexes.put→set rewrites, and getErrorMessage null-safety changes were
all unrelated to this PR's scope. Revert the file to match master.
The DictionaryIndexType.fromFieldConfigs auto-enable now handles the
FST + encoding=RAW case at config time, so the test no longer needs
the manual dictionary block.

ForwardIndexHandler.getStatsCollector: drop the requireUniqueValues
boolean parameter. The flag was redundant — when the auto-dict-creation
path runs (the new code paths that flip a column from no-dict to dict),
_fieldIndexConfigs already carries the new config with dict enabled, so
hasIndex(column, dictionary) returns true and the no-dict optimization
is already skipped, producing a per-type collector that tracks unique
values. Restoring the original two-arg signature.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…raw internally

PredicateEvaluatorProvider:
- Single public entry: getPredicateEvaluator(predicate, dataSource, dictionary, dataType, queryContext).
  When dataSource is non-null the gating in getDictionaryUsableForFiltering derives the dictionary;
  when null the dictionary parameter is used directly. buildEvaluator (factory dispatch) is now private.

FilterMvPredicateEvaluator:
- Accept (predicate, dataType, dictionary, dataSource). Drop the dictionary internally if the inner
  forward index is RAW — filterMv evaluates per-value and forward.getDictIdMV would need expensive
  Dictionary#indexOf for RAW. Expose isDictionaryBased() so callers can sync their result metadata.
- forPredicate now delegates to the unified PredicateEvaluatorProvider; no more duplicated factory
  dispatch.

FilterMvTransformFunction:
- Drop the dictionaryUsableForFilterMv helper; let FilterMvPredicateEvaluator do the gate. Pass the
  inner Identifier's DataSource through. Build resultMetadata from the gated dictionary so downstream
  consumers (ExpressionScanDocIdIterator) take the right value-stream path.

DictionaryIndexType.hasIndexRequiringDictionary:
- Replace the hardcoded FST/IFST/INVERTED list with an iteration over IndexService.getInstance()
  .getAllIndexes(), checking IndexType.requiresDictionary against the IndexType's default config and
  honoring the FieldConfig's per-index "disabled":true flag in the indexes JsonNode. Generalizes to
  any IndexType (including future plugins) that declares requiresDictionary.

TableIndexingTest: drop the now-redundant manual dictionary block in the FST + RAW case. The
DictionaryIndexType auto-enable now handles it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ct, getStatsCollector flag

Two regressions surfaced in GitHub Actions Pinot Unit Test Sets:

1) NullHandlingEnabledQueriesTest.testExpressionFilterOperatorNotFilterOnMultiValue:
   ExpressionFilterOperator was passing a null dictionary to the unified
   getPredicateEvaluator. ExpressionScanDocIdIterator follows the inner
   transform's resultMetadata.hasDictionary() to choose between the dict-id
   and raw-value paths, so when the inner transform reports hasDictionary=true
   it feeds dict ids into a raw-value evaluator — wrong results. Hand the
   transform's getDictionary() through directly so the evaluator matches the
   value stream.

2) SegmentPreProcessorTest.testRangeIndexRebuiltOnDictionaryToggle and
   testIfNeedProcess[v3]: NPE in createDictionaryForRawForwardIndex because
   getStatsCollector returned a NoDictColumnStatisticsCollector (no unique
   values) when _fieldIndexConfigs still reported dict=disabled. This happens
   on the auto-toggle path: legacy invertedIndexColumns triggers
   ENABLE_DICTIONARY in computeOperations, but the deserialized
   FieldIndexConfigs hasn't been mutated to reflect the auto-derived dict
   requirement. Restore the requireUniqueValues parameter on getStatsCollector
   and pass true at the two dict-building call sites; this forces a per-type
   collector that tracks unique values regardless of the stale config flag.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
FilterMvTransformFunction: gate the dictionary on the inner forward-index
encoding. IdentifierTransformFunction surfaces the column's dictionary
unchanged, but the dict-id matching path requires forward.getDictIdMV() to
serve dict ids cheaply — RAW forward indexes throw
UnsupportedOperationException from that method, so for shared-dict + RAW
columns we drop the dictionary internally and the predicate evaluator falls
back to per-value raw matching.

RangeIndexBasedFilterOperator.canEvaluate: also check that the predicate
evaluator's encoding matches the range index's encoding. RangeIndexCreator
and BitSlicedRangeIndexCreator both switch on hasDictionary at build/rebuild
time — the on-disk index stores dict IDs whenever a dictionary exists. Pairing
that index with a raw-value evaluator (e.g. our gating drops the dict on
shared-dict + RAW with a non-exact range index) silently returns wrong matches:
raw values would be compared against dict IDs by RangeIndexBasedFilterOperator.
Fall through to ScanBasedFilterOperator instead so raw values are scanned
against the raw forward index.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…rsion change

RangeIndexType.validate: reject the combination of RAW forward index +
dictionary + range version 1 (legacy RangeIndexCreator). The on-disk v1
range index over dict IDs is non-exact — query-time partial matches fall
back to ScanBasedFilterOperator, which on a RAW forward column would have
to apply a dict-based evaluator against raw values, silently producing
wrong results. Force the BitSliced range index (version 2, exact) for this
combination so the failure mode is config validation rather than wrong
query answers.

RangeIndexHandler.needUpdateIndices / updateIndices: also detect when the
on-disk range index version differs from the configured version. v1 and v2
have incompatible on-disk layouts and serve different query semantics
(non-exact vs exact), so a version change requires a full rebuild. Read the
version from the index buffer's first int (matching RangeIndexType.read's
dispatch) and rebuild when it doesn't match. The buffer is owned by
SegmentDirectory — don't close it from the handler, since the mmap region
is shared and closing it would crash other readers.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@xiangfu0
Copy link
Copy Markdown
Contributor Author

xiangfu0 commented May 8, 2026

Opened a follow-up docs PR for this change: pinot-contrib/pinot-docs#804

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement Improvement to existing functionality index Related to indexing (general) inverted-index Related to inverted index implementation testing Related to tests or test infrastructure

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants