Auto-create dictionary when secondary indexes need it (inverted, FST, range) by xiangfu0 · Pull Request #17269 · apache/pinot

xiangfu0 · 2025-11-25T02:47:27Z

Summary

When a column is configured with encodingType: RAW (or appears in noDictionaryColumns) but a secondary index that needs a dictionary is also configured (inverted, fst, ifst, dict-id range), Pinot now auto-creates a standalone dictionary for that column so the secondary index can function. The forward index stays RAW; the dictionary lives alongside it.

Previously, such configs either silently produced wrong results, fell back to a deprecated raw-value bitmap inverted index, or required the user to manually add a "dictionary": {} block to the FieldConfig.indexes map. This PR makes the common case work without explicit configuration and removes the legacy raw-value bitmap inverted index path entirely.

Builds on the SPI surfaces introduced by #18364 (encoding) and #18365 (requiresDictionary / shouldInvalidateOnDictionaryChange).

What's covered

Auto-create dictionary

Config deserialization (DictionaryIndexType.fromFieldConfigs): when a FieldConfig declares encodingType=RAW AND any enabled index returns IndexType.requiresDictionary()=true (FST, IFST, INVERTED, etc.), the dictionary config is left at its default-enabled state instead of being marked DISABLED. Validation and the rest of the pipeline see a dict-enabled column from the start.
Segment creation (BaseSegmentCreator): DictionaryIndexConfig.requiresDictionary() drives auto-dict creation when a dict-requiring index is configured on a RAW column. If the user explicitly disabled the dictionary, a warning is logged and the dictionary is still created (alternative is a hard segment-build failure or silent index loss).
Realtime mutable segments (MutableSegmentImpl): respects the encoding plumbing.
Reload (ForwardIndexHandler.ENABLE_DICTIONARY op): if an existing RAW segment is loaded under a config that now requests a dict-requiring index, ForwardIndexHandler materializes a standalone dictionary; subsequent handlers (InvertedIndexHandler, RangeIndexHandler) build their dict-id-based indexes from it.
Default columns (BaseDefaultColumnHandler): same logic for newly-added columns at reload.
Reconciliation (FieldIndexConfigsUtil): fail-fast if the user-supplied ForwardIndexConfig.encodingType disagrees with the column-level FieldConfig.encodingType / noDictionaryColumns.

`ForwardIndexHandler.computeOperations` simplification

The reload-time decision logic in ForwardIndexHandler.computeOperations was rewritten to four orthogonal questions per column, with the dictionary state expressed as a single rule that drives every dict-toggle decision:

desiredDict = newIsDict || any-enabled-index-requires-dict

State is tracked as FieldConfig.EncodingType (DICTIONARY, RAW, or null for "forward index disabled / not on disk"), making the state space explicit. The transition steps run in this order:

Forward-index transition — based on existingFwdEncoding vs newFwdEncoding. Three sub-cases: forward disabled, forward re-enabled (rebuild from dict + inverted), or encoding flips with forward staying on.
Dictionary transition — based on existingHasDict vs desiredDict. Auto-enables the dict whenever a secondary index requires it; refuses to remove the dict if any enabled index in the new config still needs it.
Compression-type change — only when no encoding change happened.
Cross-cutting guards — sorted-column toggles ignored, range-index format compatibility check, enable-forward needs dict + inverted on disk, enable-dict needs forward to be on.

The ENABLE_FORWARD_INDEX operation is split into ENABLE_DICT_FORWARD_INDEX and ENABLE_RAW_FORWARD_INDEX so the intent is explicit. Both variants cover two on-disk shapes:

Forward absent → rebuild from dict + inverted as the requested encoding.
Forward present but encoding flips DICT⇄RAW → rewrite in place. The new helper convertDictForwardToRawKeepingDictionary handles the DICT→RAW + dict-kept case (e.g. when an inverted index is added to a dict-encoded column without explicitly enabling dict).

Range index correctness gates (RAW + dict combination)

Shared-dict + RAW exposes a subtle correctness hazard around range indexes: when a dictionary exists at index build time, both RangeIndexCreator (v1) and BitSlicedRangeIndexCreator (v2) build over dict IDs. v1 is non-exact — its query-time partial matches fall back to ScanBasedFilterOperator, which on a RAW forward column would have to apply a dict-based evaluator against raw values, silently producing wrong results.

RangeIndexType.validate: rejects RAW forward + dictionary + RangeIndexCreator.VERSION (v1). Forces the BitSliced v2 (exact) range index for shared-dict + RAW columns so the failure mode is config validation rather than wrong query answers.
RangeIndexHandler.needUpdateIndices / updateIndices: read the on-disk range index version (first int of the buffer) and rebuild when it differs from the configured version. v1 and v2 have incompatible on-disk layouts and different exact/non-exact semantics, so a version change requires a full rebuild.
RangeIndexBasedFilterOperator.canEvaluate: returns false when dataSource.getDictionary() != null and the predicate evaluator isn't dict-based — falls through to ScanBasedFilterOperator, which correctly applies raw values against the raw forward index. Defends against any other path that produces a raw-value evaluator on a column whose range index was built over dict IDs.
PredicateEvaluatorProvider.getDictionaryUsableForFiltering: per-predicate-type decision. For RANGE on a RAW forward column, the dictionary is kept only when the range index is exact (isExact() == true) so non-exact range readers can't be paired with a dict-based evaluator that scan-fallback can't apply.

Remove legacy raw-value bitmap inverted index

The pre-shared-dict format embedded its own dictionary inline inside the <col>.bitmap.inv.idx file (written by the now-deleted RawValueBitmapInvertedIndexCreator). Reads went through RawValueBitmapInvertedIndexReader and RawValueInvertedIndexFilterOperator. All three are deleted in favor of the standard BitmapInvertedIndexReader over a real standalone dictionary.

SegmentPreProcessor.removeLegacyRawValueInvertedIndexes pre-pass detects the legacy 44-byte big-endian header (version=1 + cardinality + offsets) on segment load, deletes the file, and lets the new ForwardIndexHandler + InvertedIndexHandler chain rebuild the index in dict-id format.
InvertedIndexType.ReaderFactory drops the hasDictionary branch; asserts a dictionary exists.

Dict-id-based rebuild path

DictionaryBasedIndexBuilder (new, ~200 lines): shared helper extracted from InvertedIndexHandler and RangeIndexHandler — reads raw forward values, looks each up in the dictionary, feeds (value, dictId) pairs into a DictionaryBasedInvertedIndexCreator. Single per-data-type dispatch handles SV, MV, INT/LONG/FLOAT/DOUBLE/BIG_DECIMAL/STRING/BYTES.
InvertedIndexAndDictionaryBasedForwardIndexCreator: split _dictionaryEnabled into two flags — _dictionaryPresent (a standalone dict file exists) and _dictionaryBasedForwardIndex (the forward index stores dict IDs). The two are now independent (RAW forward + standalone dict is the new third state).
BaseIndexHandler: wires the two-flag model.
Handler ordering contract: InvertedIndexHandler's class-level Javadoc documents that it requires a dictionary by the time updateIndices runs, and that SegmentPreProcessor enforces this by always running ForwardIndexHandler first.

Query path: predicate evaluator selection on shared-dict columns

With shared-dict columns, predicates on the same column may need different evaluators: an EQ can use the dict-id-based inverted index, while a LIKE '%x%' on the same column must scan the raw forward.

PredicateEvaluatorProvider.getDictionaryUsableForFiltering: per-predicate-type decision returning the dictionary only when a dict-consuming filter operator (sorted, inverted, exact range) is actually available and enabled for that specific predicate type. Inverted-only is dropped for RANGE; non-exact range is dropped for RANGE/EQ. RAW forward + scan reads raw values directly.
FilterMvTransformFunction: per-value evaluation via transformToDictIdsMV requires forward.getDictIdMV() to actually serve dict ids cheaply. RAW forward indexes throw UnsupportedOperationException from that method, so when the inner Identifier wraps a RAW column (with or without a shared dictionary), the dictionary is dropped here and filterMv falls back to per-value raw matching.

`DataFetcher`

DataFetcher.addDataSource drops the (shared) dictionary when the forward index is RAW — the existing per-method branches in ColumnValueReader then take the raw-value paths uniformly without touching the read methods. Callers that genuinely need dict ids on a RAW + shared-dict column read raw values and consult the dictionary directly. DefaultGroupByExecutor similarly gates on forwardIndex.isDictionaryEncoded() so shared-dict + RAW columns route to the no-dict GROUP BY path.

`addRaw(Object)` SPI extraction

ForwardIndexCreator.addRaw(Object) (new default method): extracted from the add(Object, dictId) body so handlers can write raw values without going through the dict-id routing branch. Required by the rebuild path that converts dict-id forward back to raw forward when dropping a dictionary.
7 concrete creators (CLPV1, CLPV2, MV/SV Fixed/VarByte, SameValue) implement add() → addRaw().

Tests

IndexCombinationValidationTest (~640 lines, new) — exhaustive matrix over (encoding, dictionary, secondary index, column type, compression codec) combinations.
LegacyRawValueInvertedIndexMigrationIntegrationTest (new, in pinot-integration-tests) — full-cluster integration test with a synthetically-constructed legacy segment (raw forward + legacy embedded-dict inverted file written by a resurrected LegacyRawValueBitmapInvertedIndexCreator). Tars it, uploads to a real cluster, verifies the server preprocessor migrates the segment and that EQ / IN / NOT_EQ queries return correct counts.
RawForwardIndexWithDictionaryTest (~580 lines) — 33 cases covering SV/MV × EQ/RANGE/IN/REGEXP_LIKE on RAW forward + dict, plus comprehensive mixed-predicate matrix combining inverted-index dict path and raw-scan path on the same column with explicit non-zero expected counts.
RawForwardIndexInvertedIndexTest — query-result equivalence vs DICTIONARY-encoded baseline.
SegmentPreProcessorTest.testRangeIndexRebuiltOnDictionaryToggle (new) — asserts the range index is rebuilt with the right format every time the dictionary state toggles, covering both the auto path (driven by inverted-index add/remove) and the explicit toggle path.
SegmentPreProcessorTest.testFlipDictForwardToRawForwardKeepingDictionaryForInvertedIndex (new) — encoding-flip end-to-end test asserting that a dict-encoded INT column with an inverted index, when reloaded under a config that puts the column in noDictionaryColumns, ends up as dict + inverted + raw forward (dict kept, encoding flipped, inverted intact).
InvertedIndexHandlerTest — isLegacyRawValueInvertedIndexFormat detection + SegmentPreProcessor rebuild path.
ForwardIndexHandlerTest (+477 lines) — reload-time auto-create-dict coverage.
PredicateEvaluatorProviderTest (new) — per-predicate-type dict-drop decisions, including non-exact range and dict-required-by-inverted scenarios.
FilterMvTransformFunctionTest — RAW + dict + inverted column filterMv parity with dict-encoded baseline; per-value path drops dict for RAW forward.
SegmentPreProcessorTest, ColumnMetadataImplTest (new), SegmentGeneratorConfigTest, TableConfigUtilsTest, DictionaryIndexTypeTest, LazyRowTest, ForwardIndexHandlerReloadQueriesTest, TableIndexingTest + CSV — combination coverage and reload regressions.
RealtimeSegmentConverterTest + CrcUtilsTest — CRC values updated for the new metadata key footprint.

Backward compatibility

Old segments without FORWARD_INDEX_ENCODING in metadata.properties: ColumnMetadataImpl.fromPropertiesConfiguration falls back to inferring encoding from HAS_DICTIONARY (the field added by Forward-index encoding: introduce Encoding SPI surface and use it for raw vs dict checks #18364, which this PR depends on).
Old segments with the legacy raw-value bitmap inverted format on disk: detected by the 44-byte header signature and migrated by SegmentPreProcessor on first load under the new code. Covered by LegacyRawValueInvertedIndexMigrationIntegrationTest.
Old segments with v1 (non-exact) range index on a column that now needs a shared dictionary: RangeIndexHandler detects the on-disk version mismatch and rebuilds in v2 format on reload.
Existing valid configs: noDictionaryColumns + invertedIndexColumns overlap, previously implicit, now produces a real dict + standard inverted via auto-create. No config change required.

Example FieldConfig

Implicit (auto-create dictionary because inverted needs it):

{
  "fieldConfigList": [
    {
      "name": "myColumn",
      "encodingType": "RAW",
      "indexes": { "inverted": {} }
    }
  ]
}

Explicit (also valid; both dictionary and inverted listed):

{
  "fieldConfigList": [
    {
      "name": "myColumn",
      "encodingType": "RAW",
      "compressionCodec": "LZ4",
      "indexes": { "dictionary": {}, "inverted": {} }
    }
  ]
}

Out of scope (follow-up PR)

The full operator-routing migration to gate on BlockValSet.isDictionaryEncoded() (instead of BlockValSet.getDictionary() != null) is split into a separate branch blockvalset-isDictionaryEncoded-routing-for-shared-dict-raw. That branch wires the gate through DistinctExecutorFactory, NoDictionaryMultiColumnGroupKeyGenerator, and all DistinctCount* aggregation functions. This PR includes only the DefaultGroupByExecutor, BinaryOperatorTransformFunction, and FilterMvTransformFunction gate updates needed to keep the new shared-dict + RAW tests green.

🤖 Generated with Claude Code

Copilot

Pull request overview

This PR enables dictionary inference for indexes that require dictionaries (inverted, FST, range) and enhances raw forward index handling throughout the codebase. It introduces automatic dictionary creation when required by certain index types, plumbs raw-encoding awareness through forward index creator/reader factories, and extends the forward index handler to support raw-forward columns in various operations.

Key changes:

Automatic dictionary inference when indexes requiring dictionaries are configured
Raw-encoding flag in ForwardIndexConfig to distinguish raw vs. dictionary-encoded forward indexes
Enhanced ForwardIndexHandler to support raw forward indexes with dictionary add/remove operations
New integration test verifying raw forward index + inverted index functionality

Reviewed changes

Copilot reviewed 29 out of 29 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
githubComplexTypeEvents_offline_table_config.json	Adds raw-encoded forward index with inverted index configuration example
ForwardIndexCreator.java	Refactors raw value handling into separate `addRaw` method
IndexType.java	Adds `requiresDictionary` method to indicate dictionary requirement
ForwardIndexUtils.java	New utility to identify raw forward index columns from table config
ForwardIndexConfig.java	Adds `rawEncoding` flag and getter/builder methods
FieldIndexConfigsUtil.java	Implements automatic dictionary enablement for required indexes
DictionaryIndexConfig.java	Adds utility methods to check if dictionary is required by any index
ForwardIndexTypeTest.java	Updates tests to include rawEncoding flag in expected configs
ColumnMinMaxValueGenerator.java	Passes ForwardIndexConfig when creating readers
InvertedIndexAndDictionaryBasedForwardIndexCreator.java	Respects rawEncoding flag when determining dictionary usage
ForwardIndexHandler.java	Adds ENABLE_DICTIONARY_FOR_RAW_FORWARD_INDEX operation and handling logic
InvertedIndexType.java	Implements requiresDictionary to return true
IFSTIndexType.java	Implements requiresDictionary to return true
FstIndexType.java	Implements requiresDictionary to return true
ForwardIndexType.java	Updates getFileExtension to return multiple possible extensions
ForwardIndexReaderFactory.java	Uses rawEncoding flag to determine reader type
ForwardIndexCreatorFactory.java	Uses rawEncoding flag to determine creator type
SingleValueVarByteRawIndexCreator.java	Overrides add method to delegate to addRaw
SingleValueFixedByteRawIndexCreator.java	Overrides add method to delegate to addRaw
MultiValueVarByteRawIndexCreator.java	Overrides add method to delegate to addRaw
MultiValueFixedByteRawIndexCreator.java	Overrides add method to delegate to addRaw
CLPForwardIndexCreatorV2.java	Overrides add method for CLP encoding
CLPForwardIndexCreatorV1.java	Overrides add method for CLP encoding
SegmentIndexCreationDriverImpl.java	Fixes inverted logic bug (isDisabled → isEnabled)
SegmentColumnarIndexCreator.java	Adds warning when dictionary required but explicitly disabled
SameValueForwardIndexCreator.java	Overrides add method to delegate to addRaw
RawForwardIndexInvertedIndexTest.java	New integration test for raw forward index with inverted index
ForwardIndexHandlerReloadQueriesTest.java	Updates test expectations and adds raw forward index test coverage
DataFetcher.java	Adds _useDictionary field for proper dictionary usage check

Copilot

Pull request overview

Copilot reviewed 29 out of 29 changed files in this pull request and generated no new comments.

codecov-commenter · 2025-11-25T13:36:55Z

Codecov Report

❌ Patch coverage is 64.31298% with 187 lines in your changes missing coverage. Please review.
✅ Project coverage is 63.65%. Comparing base (5ccd383) to head (0e11efc).
⚠️ Report is 6 commits behind head on master.

Files with missing lines	Patch %	Lines
...der/invertedindex/DictionaryBasedIndexBuilder.java	12.19%	69 Missing and 3 partials ⚠️
...ocal/segment/index/loader/ForwardIndexHandler.java	79.12%	14 Missing and 24 partials ⚠️
...ator/transform/function/ItemTransformFunction.java	0.00%	19 Missing ⚠️
...segment/spi/index/creator/ForwardIndexCreator.java	0.00%	15 Missing ⚠️
...dex/loader/invertedindex/InvertedIndexHandler.java	62.96%	7 Missing and 3 partials ⚠️
.../segment/index/dictionary/DictionaryIndexType.java	86.53%	1 Missing and 6 partials ⚠️
...ertedindex/LegacyRawValueInvertedIndexCleanup.java	75.00%	3 Missing and 3 partials ⚠️
...r/filter/predicate/PredicateEvaluatorProvider.java	80.95%	1 Missing and 3 partials ⚠️
.../index/loader/invertedindex/RangeIndexHandler.java	82.60%	2 Missing and 2 partials ⚠️
...operator/filter/RangeIndexBasedFilterOperator.java	50.00%	1 Missing and 2 partials ⚠️
... and 5 more

Additional details and impacted files

@@             Coverage Diff              @@
##             master   #17269      +/-   ##
============================================
+ Coverage     63.57%   63.65%   +0.08%     
- Complexity     1717     1723       +6     
============================================
  Files          3252     3254       +2     
  Lines        199132   199459     +327     
  Branches      30875    30977     +102     
============================================
+ Hits         126596   126966     +370     
+ Misses        62454    62361      -93     
- Partials      10082    10132      +50

Flag	Coverage Δ
custom-integration1	`100.00% <ø> (ø)`
integration	`100.00% <ø> (ø)`
integration1	`100.00% <ø> (ø)`
integration2	`0.00% <ø> (ø)`
java-21	`63.65% <64.31%> (+0.08%)`	⬆️
temurin	`63.65% <64.31%> (+0.08%)`	⬆️
unittests	`63.65% <64.31%> (+0.08%)`	⬆️
unittests1	`55.72% <39.50%> (+0.06%)`	⬆️
unittests2	`34.95% <55.34%> (+0.04%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copilot

Pull request overview

Copilot reviewed 37 out of 37 changed files in this pull request and generated 7 comments.

Jackie-Jiang

For future proof, how do you plan to support both dictionary encoded and raw forward index on the same column? Specifically, how does the FieldConfig look like?

Copilot

Pull request overview

Copilot reviewed 39 out of 39 changed files in this pull request and generated 6 comments.

testFetchDictIdsFromRawForwardIndexWithSharedDictionary previously called DataFetcher#fetchDictIds and expected the implicit per-row dictionary lookup. With the new contract, fetchDictIds throws on a RAW forward index and callers must opt in via fetchDictIdsFromRawValues. The test now asserts both: the throw (1) and the explicit-path dict ids (2). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…rward index" This reverts commit 7e66df4.

This reverts commit c8290ec.

…alue path" This reverts commit 1b45308.

@Jackie-Jiang

Per @Jackie-Jiang's PR comment: readDictIds should focus on reading dict ids directly from a dictionary-encoded forward index; readDictIdsFromRawValues handles the explicit per-row dictionary lookup when the forward index is RAW but a (shared) dictionary exists. Each ColumnValueReader method now does exactly one thing. The dispatch between the two paths moves up to the public DataFetcher#fetchDictIds entry point so existing callers are unaffected. A new public DataFetcher#fetchDictIdsFromRawValues (SV + MV) lets future callers opt explicitly into the per-row dictionary lookup when they have already verified the column is RAW + shared-dict. Also rename the ColumnValueReader field _useDictionary to _dictionaryEncoded to match the boolean's actual meaning, and remove the hot-path Preconditions guards in readDictIdsFromRawValues* — the caller contract documents the precondition and the underlying readers will throw if invariants are violated. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

fetchDictIds (SV + MV) now reads dict ids directly from the forward index with no implicit fallback. On a RAW + shared-dict column the underlying ForwardIndexReader#readDictIds throws UnsupportedOperationException — callers that genuinely need dict ids on such a column must call the new public fetchDictIdsFromRawValues (SV + MV) explicitly. This makes the per-row Dictionary#indexOf cost a deliberate caller decision instead of a hidden one. Update DataFetcherTest to assert the new contract: fetchDictIds throws on a RAW reader, and fetchDictIdsFromRawValues returns the dict ids produced by per-row dictionary lookups. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…OUP BY The previous gate (columnContext.getDictionary() == null) routed shared-dict + RAW columns to DictionaryBasedGroupKeyGenerator, which calls BlockValSet#getDictionaryIdsSV → DataFetcher#fetchDictIds. With fetchDictIds no longer falling back to per-row dictionary lookups, that path now throws. Extend the gate to also require the underlying forward index to be dictionary-encoded so columns with a shared standalone dictionary on a RAW forward index take the NoDictionarySingleColumnGroupKeyGenerator path. Other operators (DistinctExecutorFactory, aggregation functions, NoDictionaryMultiColumnGroupKeyGenerator, IdentifierTransformFunction) will be migrated to the same gate in a follow-up PR. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ndex need Refactor computeOperations into a per-column helper that decomposes into four orthogonal questions: forward-index transition, dictionary transition, compression-only change, and cross-cutting guards (sorted columns, range-index format compatibility, enable-forward prerequisites, enable-dictionary needs forward). The dictionary decision is now a single rule: desiredDict = newIsDict || DictionaryIndexConfig.requiresDictionary(fieldSpec, newConf). The "force dict on if any index requires it" path means adding an inverted index to a raw column now auto-creates the dictionary (and the inverted index gets built against it by InvertedIndexHandler in the same reload), where the pre-PR behavior silently skipped this combo and left the inverted-index request orphaned. Removing that index later auto-removes the dictionary again. shouldDisableDictionary now consults DictionaryIndexConfig.requiresDictionary to refuse dict removal whenever any enabled index in the new config still needs it (covers inverted, FST, IFST, and any future dict-requiring index type). The on-disk inverted/FST hasIndex check was removed — InvertedIndexHandler runs after ForwardIndexHandler and removes orphaned indexes, so refusing dict removal based on transient on-disk state was overly conservative. getStatsCollector gains a requireUniqueValues parameter so the dict-creation paths skip the no-dict-optimized collector and use the type-specific collector that tracks unique values. Without this, the auto-create-dict path NPEs when ClusterConfigForTable's optimized no-dict collector is in effect. Tests: - testRangeIndexRebuiltOnDictionaryToggle: new SegmentPreProcessorTest case asserting the range index is rebuilt with the right format every time the dictionary state toggles, covering both the auto path (driven by inverted-index add/remove) and the explicit toggle path (noDictionaryColumns add/remove). - testIfNeedProcess: updated to expect ENABLE_DICTIONARY when adding inverted to a raw column on v3 (the new auto path); v1 still skips. - testComputeOperationDisableForwardIndex TEST13: updated to expect no operation queued when inverted is in the new config (dictionary must stay). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…FORWARD_INDEX State representation: Replace the existingHasFwd / newIsFwd booleans with FieldConfig.EncodingType values where null means "forward index disabled / not on disk", DICTIONARY means dict-encoded forward, and RAW means raw forward. This makes the state space explicit and lets the compression-only branch test existingFwdEncoding == FieldConfig.EncodingType.RAW directly instead of calling isForwardIndexDictionaryEncoded(column). Operation split: ENABLE_FORWARD_INDEX is split into ENABLE_DICT_FORWARD_INDEX and ENABLE_RAW_FORWARD_INDEX. computeColumnOperations picks the right variant based on newFwdEncoding so the intent is explicit at the operation level. Both variants flow through createForwardIndexIfNeeded (which already reads the target encoding from the new config), but the post-rebuild assertions in updateIndices are now type-specific: ENABLE_DICT_FORWARD_INDEX requires the dictionary to remain after rebuild, ENABLE_RAW_FORWARD_INDEX requires it to be absent. Updated test cases in ForwardIndexHandlerTest accordingly: TESTs that explicitly set FieldConfig.EncodingType.RAW for a forward-index-disabled column now expect ENABLE_RAW_FORWARD_INDEX, while default-dict cases keep ENABLE_DICT_FORWARD_INDEX. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Fix the wrong assertion in ENABLE_RAW_FORWARD_INDEX (dict + inverted + raw forward is a valid post-rebuild state when an inverted index is enabled — the assertion that the dictionary must be absent contradicted the auto-keep-dictionary-when-required-by-index logic from the previous commits) and extend the operation set to actually perform the encoding flip when the dict transition step won't. The forward-index transition path now has three sub-cases: - Existing forward, new disables it → DISABLE_FORWARD_INDEX. - Existing disabled, new re-enables it → ENABLE_DICT/RAW_FORWARD_INDEX, reusing createForwardIndexIfNeeded. - Forward stays on but encoding flips DICT⇄RAW → queued only when the dict transition won't already cover the conversion (i.e. existingHasDict == desiredDict). The DICT→RAW + drop-dict and RAW→DICT + create-dict combos still fall out of DISABLE_DICTIONARY / ENABLE_DICTIONARY unchanged, so the encoding-flip op is reserved for the dict-stays cases: - DICT→RAW with dict kept (e.g. inverted index added to a dict-encoded column without explicitly enabling dict) - RAW→DICT with shared dict already present For the DICT→RAW + dict kept case, add convertDictForwardToRawKeepingDictionary: the rewrite reuses the existing rewriteDictToRawForwardIndex helper but keeps the dictionary on disk and does not call removeDictRelatedIndexes (secondary indexes against unchanged dict ids stay valid). Tests: - testFlipDictForwardToRawForwardKeepingDictionaryForInvertedIndex: end-to-end test asserting the encoding flip on a dict-encoded INT column with an inverted index leaves the column as dict + inverted + raw forward. - testComputeOperationDisableDictionary TEST3: previously asserted no-op for "disable dict + add inverted on dict-encoded column" (the pre-PR silent-skip behavior). Updated to expect the correct op ENABLE_RAW_FORWARD_INDEX, which performs the encoding flip while the inverted-index requirement keeps the dictionary. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…urce Revert all the DataFetcher / DataFetcherTest churn from earlier commits (API split, _dictionaryEncoded rename, Preconditions on hot paths, etc.) back to master and replace it with the minimal change that solves the shared-dict + RAW correctness issue: when forwardIndexReader.isDictionaryEncoded() is false, pass null instead of dataSource.getDictionary() to ColumnValueReader. ColumnValueReader's existing per-method branches all gate on _dictionary != null, so dropping the dictionary at construction makes every value-read method take the raw-value path uniformly without touching the read methods themselves. Callers that need dict ids on a RAW + shared-dict column still propagate UnsupportedOperationException from the underlying RAW reader's readDictIds default — that surfaces as a clear failure for any operator that hasn't been migrated to gate on forward-index encoding (DefaultGroupByExecutor was migrated in an earlier commit; the remaining migrations live in the BlockValSet follow-up branch). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Centralize the "no dictionary path on RAW forward index" rule in ColumnContext.fromDataSource. Every consumer of columnContext.getDictionary() now naturally routes shared-dict + RAW columns to the raw-value path without needing per-call check: - IdentifierTransformFunction: simplified — its inline check (forwardIndex.isDictionaryEncoded() ? columnContext.getDictionary() : null) is now redundant because ColumnContext returns null directly. - DefaultGroupByExecutor: revert the extra forward-index-encoding check added in an earlier commit; the master gate (columnContext.getDictionary() == null) is correct again now that ColumnContext drops the dictionary at the source. - DictionaryBasedGroupKeyGenerator, DistinctExecutorFactory, NoDictionaryMultiColumnGroupKeyGenerator: no change needed — they already gate on columnContext.getDictionary() and now see null for shared-dict + RAW. Callers that legitimately need the underlying dictionary (e.g. DictionaryBasedDistinctOperator iterating dict values directly) can still get it via columnContext.getDataSource().getDictionary(). This is the same pattern applied earlier to DataFetcher.addDataSource; moving it to ColumnContext makes the rule the single source of truth for the operator/transform-function tree. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

@xiangfu0

Per @xiangfu0's review: ColumnContext should expose both the data source and the dictionary so upper functions can decide for themselves. Some operators (e.g. DictionaryBasedDistinctOperator) iterate the dictionary directly and don't care about forward-index encoding — they would lose the dictionary if it were dropped at the source. Restore ColumnContext.fromDataSource to master behavior. Apply the "forward must be dict-encoded for dict-id reads" rule at each call site that does dict-id reads: - IdentifierTransformFunction: drop dict for shared-dict + RAW so transformToDictIdsSV/MV doesn't get advertised. - DefaultGroupByExecutor: route shared-dict + RAW columns to the no-dict GROUP BY path so DictionaryBasedGroupKeyGenerator's dict-id reads don't fire. Both checks are null-safe against DataSource without a forward index (test mock case). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Add INT_MV_DICT_RAW_COLUMN to BaseTransformFunctionTest, configured as EncodingType.RAW with both an explicit dictionary and inverted index in its FieldConfig — the canonical shared-dict + RAW shape introduced by this PR. The values mirror INT_MV_COLUMN so callers can compare results against the dict-encoded baseline. testFilterMvOnSharedDictRawForwardColumn in FilterMvTransformFunctionTest exercises every predicate in the existing IntPredicate matrix (EQ, NEQ, RANGE, IN, NOT_IN, BETWEEN, AND, OR, NOT) and asserts: - The data source actually has a RAW forward index plus a dictionary on disk (sanity: the configuration produced the intended shape). - The TransformFunction reports getDictionary() == null and TransformResultMetadata.hasDictionary() == false. This is the signal that drives FilterMvPredicateEvaluator -> PredicateEvaluatorProvider to construct a raw-value matching evaluator (matchesInt etc.) instead of a matchesDictId evaluator. - transformToIntValuesMV produces results that match the expected matcher applied to _intMVValues, confirming the predicate evaluator built without a dictionary still returns correct rows. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…tor overload Split the existing INT_MV_DICT_RAW_COLUMN coverage into two distinct on-disk shapes and exercise filterMv against both: - INT_MV_DICT_RAW_COLUMN: RAW forward + explicit shared dictionary, no secondary index. - INT_MV_DICT_RAW_INV_COLUMN (new): RAW forward + shared dictionary + inverted index — covers the case the user wanted documented in tests. Both columns hold the same values as INT_MV_COLUMN. The new testFilterMvOnSharedDictRawForwardWithInvertedColumn parameterizes over the existing IntPredicate matrix and asserts the filterMv result is identical for all three columns (dict-encoded baseline, RAW + dict, RAW + dict + inverted) for every predicate. The inverted index is irrelevant to filterMv's per-value evaluation but having it on disk must not change the result. Also add a DataSource-aware overload to FilterMvPredicateEvaluator so the evaluator can pull the dictionary only when the underlying forward index is dictionary-encoded: public static FilterMvPredicateEvaluator forPredicate(String predicate, DataSource dataSource) FilterMvTransformFunction now uses this overload when the inner argument is a direct column reference (IdentifierTransformFunction); for transform arguments it keeps the dictionary-only path. The control flow is fixed to use an else branch so the DataSource-aware evaluator isn't overwritten by the legacy call. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

@xiangfu0

Per @xiangfu0's review: each external caller should evaluate whether it has a DataSource. Consolidate PredicateEvaluatorProvider's public surface to one method: public static PredicateEvaluator getPredicateEvaluator(Predicate predicate, @nullable DataSource dataSource, DataType dataType, @nullable QueryContext queryContext) When the caller has a column DataSource, pass it so the gating logic in getDictionaryUsableForFiltering can pick between dict-based and raw-value evaluation for shared-dict + RAW columns. When the caller doesn't (post- reduction matchers, intermediate-result aggregators, computed transforms), pass null and the evaluator is built from raw values using the supplied data type. The two dictionary-based getPredicateEvaluator overloads (3-arg and 4-arg) are now private — they remain as the internal dispatch implementation but are no longer part of the public API. Migrations: - FilterPlanNode: pass dataSource.getDataSourceMetadata().getDataType() through. - BinaryOperatorTransformFunction: drill into IdentifierTransformFunction for the column's DataSource; for non-Identifier transforms pass null. - ExpressionFilterOperator: same Identifier-or-null pattern. - PredicateRowMatcher, DistinctCountThetaSketchAggregationFunction: pass null DataSource (post-reduction / intermediate-result paths have no column to point at). - PredicateEvaluatorProviderTest: pass FieldSpec.DataType.STRING through. FilterMvPredicateEvaluator deliberately bypasses PredicateEvaluatorProvider now: filterMv evaluates per-value at transform time, so the filter-plan-time gating PredicateEvaluatorProvider applies (which would route shared-dict + RAW columns through the dict-based evaluator) is the wrong policy at this layer. FilterMvPredicateEvaluator builds its evaluator directly via the Equals/NotEquals/In/NotIn/Range/RegexpLike factories based solely on whether a dictionary is supplied — IdentifierTransformFunction already returns null for shared-dict + RAW columns, so the existing forPredicate (predicate, dataType, dictionary) signature is the right call surface. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

@VisibleForTesting

PredicateEvaluatorProvider: - For RANGE on RAW forward, drop the dictionary unless the range index is exact (Jackie: "we need to ensure the predicate can solely be resolved with range index. Mixing dictionary encoded range index and raw forward index scan will break"). Non-exact (legacy) range readers fall back to ScanBasedFilterOperator for partial matches; that scan applies the predicate evaluator on raw forward values, which would break a dict-based evaluator. - Remove the sortedAvailable check inside the RAW-forward branch (Jackie: "this is dead code. Sorted index is run-length dictionary encoded forward index"). A sorted forward index is dict-encoded by definition, so it can never coexist with a RAW forward in this branch. - Make getDictionaryUsableForFiltering package-private and annotate it with @VisibleForTesting (Jackie's minor suggestion). - Update the public method's docstring to clarify caller responsibilities: leaf-filter callers may pass DataSource freely; transform-layer callers must only pass DataSource when the inner transform's getDictionary() is non-null (Jackie: "double check all callers of this, make sure it is not using forward index"). BinaryOperatorTransformFunction: pass leftDataSource only when the inner transform exposes a dictionary, mirroring IdentifierTransformFunction's "forward index is dict-encoded" contract. For RAW (including shared-dict + RAW) the inner transform returns null from getDictionary(), so the predicate evaluator is built against raw values and applySV(value) on the raw forward output stays consistent. ExpressionFilterOperator: always pass null DataSource. The inner is always a function (FilterPlanNode dispatches direct column refs through the leaf-filter path) and ExpressionScanDocIdIterator scans the transform output applying applySV(value) on raw values, so a dict-based evaluator would never be safe here. DataFetcher: add Preconditions check that dictionary != null when the forward index claims to be dictionary-encoded (Jackie: "is it possible that forward index is dictionary encoded by dictionary doesn't exist?"). Defensive against an impossible state — fails fast at DataFetcher construction rather than silently producing a ColumnValueReader that will NPE later in the read path. BaseSegmentCreator.createDictionaryForColumn: when an index requires a dictionary, create one regardless of the user's explicit dictionary setting (Jackie: "should we always create dictionary when any index requires it?"). Previously, explicitly disabling the dictionary while enabling an index that requires it would lead to segment creation failure (e.g. inverted index can't be built without a dict) or silent index loss. Now we log a warning and proceed. DefaultGroupByExecutor: add a comment explaining that ColumnContext.getDataSource() can be null for computed (non-Identifier) transforms (Jackie: "could datasource ever be null?"); in that case getDictionary() == null already routes those columns onto the no-dict GROUP BY path via the first condition. Tests: drop two tests that exercised an impossible RAW + sorted scenario and add a new test for the non-exact range index case. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

DataFetcher.addDataSource: drop the Preconditions check that asserted a dictionary exists when forward index claims to be dictionary-encoded. The segment loader sets isDictionaryEncoded() in lockstep with the dictionary file's presence — the check was over-protecting an invariant that's already guaranteed. IdentifierTransformFunction.forwardIndexIsDictEncoded: drop the ColumnContext.getDataSource() null guard. IdentifierTransformFunction is only constructed via TransformFunctionFactory with ColumnContexts built from ColumnContext.fromDataSource(...), where the DataSource is always non-null. The remaining `forwardIndex != null` check is the real one — it guards forward-index-disabled columns. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ctory dispatch Address Jackie-Jiang's remaining review comments: DictionaryIndexType: when a FieldConfig with encoding=RAW also configures an index that always requires a dictionary (FST, IFST, INVERTED), do NOT mark the dictionary as DISABLED in the deserialized config. The dict falls through to its default-enabled state, validation passes, and the runtime auto-creation paths (BaseSegmentCreator.createDictionaryForColumn / ForwardIndexHandler) build a shared-dict + RAW forward index. This closes the loop on Jackie's "Dictionary should be generated on FST without explicit configuration" comment — users no longer need to inject a dictionary entry alongside FST/IFST/INVERTED on a RAW column. PredicateEvaluatorProvider: - Replace the now-private factory-dispatch overload with a public buildEvaluator(predicate, dictionary, dataType, queryContext) entry point. This is for callers whose value stream is statically known (e.g. per-value transform evaluation in filterMv); filter-plan-time callers still go through getPredicateEvaluator(predicate, DataSource, ...) which applies gating. FilterMvPredicateEvaluator: drop the duplicated buildPerValueEvaluator switch — it was a copy of the factory dispatch. Now delegates to PredicateEvaluatorProvider.buildEvaluator. IdentifierTransformFunction: revert the encoding-based dictionary drop. The Identifier always exposes the column dictionary if one exists; it is the consumer's responsibility to additionally check forward-index encoding when deciding whether dict-id reads will be cheap. This matches the principle that callers should make decisions with all available information. FilterMvTransformFunction: now performs its own forward-encoding gate via columnContextMap to decide whether to use the dict-id matching path or fall back to per-value matching. Also rebuilds the result metadata so hasDictionary() reflects the gated dictionary actually used here, rather than the inner Identifier's (which always reports the underlying dict). BinaryOperatorTransformFunction: same gate inline — only pass DataSource to the predicate evaluator when the LHS Identifier wraps a dict-encoded forward index. ForwardIndexHandler: inline the unused two-arg getStatsCollector overload at its single call site. TableConfigUtilsTest: update two assertions that previously expected "Cannot create inverted index ... without dictionary" failures for encoding=RAW + INVERTED. With auto-enable that combination now validates and produces a shared-dict + RAW forward index. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…parameter TableIndexingTest.java: the FST encoding=RAW dictionary workaround, indexes.put→set rewrites, and getErrorMessage null-safety changes were all unrelated to this PR's scope. Revert the file to match master. The DictionaryIndexType.fromFieldConfigs auto-enable now handles the FST + encoding=RAW case at config time, so the test no longer needs the manual dictionary block. ForwardIndexHandler.getStatsCollector: drop the requireUniqueValues boolean parameter. The flag was redundant — when the auto-dict-creation path runs (the new code paths that flip a column from no-dict to dict), _fieldIndexConfigs already carries the new config with dict enabled, so hasIndex(column, dictionary) returns true and the no-dict optimization is already skipped, producing a per-type collector that tracks unique values. Restoring the original two-arg signature. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…raw internally PredicateEvaluatorProvider: - Single public entry: getPredicateEvaluator(predicate, dataSource, dictionary, dataType, queryContext). When dataSource is non-null the gating in getDictionaryUsableForFiltering derives the dictionary; when null the dictionary parameter is used directly. buildEvaluator (factory dispatch) is now private. FilterMvPredicateEvaluator: - Accept (predicate, dataType, dictionary, dataSource). Drop the dictionary internally if the inner forward index is RAW — filterMv evaluates per-value and forward.getDictIdMV would need expensive Dictionary#indexOf for RAW. Expose isDictionaryBased() so callers can sync their result metadata. - forPredicate now delegates to the unified PredicateEvaluatorProvider; no more duplicated factory dispatch. FilterMvTransformFunction: - Drop the dictionaryUsableForFilterMv helper; let FilterMvPredicateEvaluator do the gate. Pass the inner Identifier's DataSource through. Build resultMetadata from the gated dictionary so downstream consumers (ExpressionScanDocIdIterator) take the right value-stream path. DictionaryIndexType.hasIndexRequiringDictionary: - Replace the hardcoded FST/IFST/INVERTED list with an iteration over IndexService.getInstance() .getAllIndexes(), checking IndexType.requiresDictionary against the IndexType's default config and honoring the FieldConfig's per-index "disabled":true flag in the indexes JsonNode. Generalizes to any IndexType (including future plugins) that declares requiresDictionary. TableIndexingTest: drop the now-redundant manual dictionary block in the FST + RAW case. The DictionaryIndexType auto-enable now handles it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ct, getStatsCollector flag Two regressions surfaced in GitHub Actions Pinot Unit Test Sets: 1) NullHandlingEnabledQueriesTest.testExpressionFilterOperatorNotFilterOnMultiValue: ExpressionFilterOperator was passing a null dictionary to the unified getPredicateEvaluator. ExpressionScanDocIdIterator follows the inner transform's resultMetadata.hasDictionary() to choose between the dict-id and raw-value paths, so when the inner transform reports hasDictionary=true it feeds dict ids into a raw-value evaluator — wrong results. Hand the transform's getDictionary() through directly so the evaluator matches the value stream. 2) SegmentPreProcessorTest.testRangeIndexRebuiltOnDictionaryToggle and testIfNeedProcess[v3]: NPE in createDictionaryForRawForwardIndex because getStatsCollector returned a NoDictColumnStatisticsCollector (no unique values) when _fieldIndexConfigs still reported dict=disabled. This happens on the auto-toggle path: legacy invertedIndexColumns triggers ENABLE_DICTIONARY in computeOperations, but the deserialized FieldIndexConfigs hasn't been mutated to reflect the auto-derived dict requirement. Restore the requireUniqueValues parameter on getStatsCollector and pass true at the two dict-building call sites; this forces a per-type collector that tracks unique values regardless of the stale config flag. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

FilterMvTransformFunction: gate the dictionary on the inner forward-index encoding. IdentifierTransformFunction surfaces the column's dictionary unchanged, but the dict-id matching path requires forward.getDictIdMV() to serve dict ids cheaply — RAW forward indexes throw UnsupportedOperationException from that method, so for shared-dict + RAW columns we drop the dictionary internally and the predicate evaluator falls back to per-value raw matching. RangeIndexBasedFilterOperator.canEvaluate: also check that the predicate evaluator's encoding matches the range index's encoding. RangeIndexCreator and BitSlicedRangeIndexCreator both switch on hasDictionary at build/rebuild time — the on-disk index stores dict IDs whenever a dictionary exists. Pairing that index with a raw-value evaluator (e.g. our gating drops the dict on shared-dict + RAW with a non-exact range index) silently returns wrong matches: raw values would be compared against dict IDs by RangeIndexBasedFilterOperator. Fall through to ScanBasedFilterOperator instead so raw values are scanned against the raw forward index. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…rsion change RangeIndexType.validate: reject the combination of RAW forward index + dictionary + range version 1 (legacy RangeIndexCreator). The on-disk v1 range index over dict IDs is non-exact — query-time partial matches fall back to ScanBasedFilterOperator, which on a RAW forward column would have to apply a dict-based evaluator against raw values, silently producing wrong results. Force the BitSliced range index (version 2, exact) for this combination so the failure mode is config validation rather than wrong query answers. RangeIndexHandler.needUpdateIndices / updateIndices: also detect when the on-disk range index version differs from the configured version. v1 and v2 have incompatible on-disk layouts and serve different query semantics (non-exact vs exact), so a version change requires a full rebuild. Read the version from the index buffer's first int (matching RangeIndexType.read's dispatch) and rebuild when it doesn't match. The buffer is owned by SegmentDirectory — don't close it from the handler, since the mmap region is shared and closing it would crash other readers. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

xiangfu0 · 2026-05-08T16:13:58Z

Opened a follow-up docs PR for this change: pinot-contrib/pinot-docs#804

xiangfu0 requested a review from Copilot November 25, 2025 02:51

xiangfu0 added inverted-index Related to inverted index implementation index Related to indexing (general) enhancement Improvement to existing functionality labels Nov 25, 2025

Copilot AI reviewed Nov 25, 2025

View reviewed changes

xiangfu0 force-pushed the dictionary-auto-creation-when-corresponding-indexes-enabled branch from 880c72e to e042d90 Compare November 25, 2025 03:41

xiangfu0 requested a review from Copilot November 25, 2025 03:41

Copilot AI reviewed Nov 25, 2025

View reviewed changes

xiangfu0 force-pushed the dictionary-auto-creation-when-corresponding-indexes-enabled branch 2 times, most recently from 74727cd to d202d42 Compare November 25, 2025 12:36

xiangfu0 force-pushed the dictionary-auto-creation-when-corresponding-indexes-enabled branch 2 times, most recently from c2dfdcc to 4b4bcd7 Compare November 27, 2025 12:26

xiangfu0 force-pushed the dictionary-auto-creation-when-corresponding-indexes-enabled branch 4 times, most recently from 9e63630 to 677c60f Compare December 10, 2025 11:15

xiangfu0 requested review from Jackie-Jiang, Copilot and raghavyadav01 December 11, 2025 02:39

Copilot AI reviewed Dec 11, 2025

View reviewed changes

Jackie-Jiang reviewed Dec 18, 2025

View reviewed changes

xiangfu0 force-pushed the dictionary-auto-creation-when-corresponding-indexes-enabled branch from 677c60f to 7463dd1 Compare December 19, 2025 00:31

xiangfu0 force-pushed the dictionary-auto-creation-when-corresponding-indexes-enabled branch 2 times, most recently from a664045 to e936ca7 Compare December 31, 2025 05:28

xiangfu0 requested a review from Copilot December 31, 2025 09:21

Copilot AI reviewed Dec 31, 2025

View reviewed changes

xiangfu0 force-pushed the dictionary-auto-creation-when-corresponding-indexes-enabled branch 3 times, most recently from f023298 to 78efe79 Compare January 2, 2026 06:19

xiangfu0 and others added 27 commits May 6, 2026 23:41

Revert "Update DataFetcherTest for the new dict-id contract on RAW fo…

ac9672c

…rward index" This reverts commit 7e66df4.

Revert "Restore readDictIdsFromRawValues as opt-in DataFetcher API"

6dc6a12

This reverts commit c8290ec.

Revert "DataFetcher cleanup; route shared-dict + RAW columns to raw-v…

1c220f9

…alue path" This reverts commit 1b45308.

Revert TableIndexingTest.java to master — keep this file out of the PR

7777581

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Fix predicate evaluator provider

f48a3e1

Fix TransformFunction

0df7762

Jackie-Jiang approved these changes May 7, 2026

View reviewed changes

xiangfu0 mentioned this pull request May 8, 2026

docs: document RAW columns with dictionary-backed secondary indexes pinot-contrib/pinot-docs#804

Merged

Conversation

xiangfu0 commented Nov 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What's covered

Auto-create dictionary

ForwardIndexHandler.computeOperations simplification

Range index correctness gates (RAW + dict combination)

Remove legacy raw-value bitmap inverted index

Dict-id-based rebuild path

Query path: predicate evaluator selection on shared-dict columns

DataFetcher

addRaw(Object) SPI extraction

Tests

Backward compatibility

Example FieldConfig

Out of scope (follow-up PR)

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

codecov-commenter commented Nov 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Jackie-Jiang left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

xiangfu0 commented May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

xiangfu0 commented Nov 25, 2025 •

edited

Loading

`ForwardIndexHandler.computeOperations` simplification

`DataFetcher`

`addRaw(Object)` SPI extraction

codecov-commenter commented Nov 25, 2025 •

edited

Loading