Auto-create dictionary when secondary indexes need it (inverted, FST, range)#17269
Conversation
There was a problem hiding this comment.
Pull request overview
This PR enables dictionary inference for indexes that require dictionaries (inverted, FST, range) and enhances raw forward index handling throughout the codebase. It introduces automatic dictionary creation when required by certain index types, plumbs raw-encoding awareness through forward index creator/reader factories, and extends the forward index handler to support raw-forward columns in various operations.
Key changes:
- Automatic dictionary inference when indexes requiring dictionaries are configured
- Raw-encoding flag in ForwardIndexConfig to distinguish raw vs. dictionary-encoded forward indexes
- Enhanced ForwardIndexHandler to support raw forward indexes with dictionary add/remove operations
- New integration test verifying raw forward index + inverted index functionality
Reviewed changes
Copilot reviewed 29 out of 29 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| githubComplexTypeEvents_offline_table_config.json | Adds raw-encoded forward index with inverted index configuration example |
| ForwardIndexCreator.java | Refactors raw value handling into separate addRaw method |
| IndexType.java | Adds requiresDictionary method to indicate dictionary requirement |
| ForwardIndexUtils.java | New utility to identify raw forward index columns from table config |
| ForwardIndexConfig.java | Adds rawEncoding flag and getter/builder methods |
| FieldIndexConfigsUtil.java | Implements automatic dictionary enablement for required indexes |
| DictionaryIndexConfig.java | Adds utility methods to check if dictionary is required by any index |
| ForwardIndexTypeTest.java | Updates tests to include rawEncoding flag in expected configs |
| ColumnMinMaxValueGenerator.java | Passes ForwardIndexConfig when creating readers |
| InvertedIndexAndDictionaryBasedForwardIndexCreator.java | Respects rawEncoding flag when determining dictionary usage |
| ForwardIndexHandler.java | Adds ENABLE_DICTIONARY_FOR_RAW_FORWARD_INDEX operation and handling logic |
| InvertedIndexType.java | Implements requiresDictionary to return true |
| IFSTIndexType.java | Implements requiresDictionary to return true |
| FstIndexType.java | Implements requiresDictionary to return true |
| ForwardIndexType.java | Updates getFileExtension to return multiple possible extensions |
| ForwardIndexReaderFactory.java | Uses rawEncoding flag to determine reader type |
| ForwardIndexCreatorFactory.java | Uses rawEncoding flag to determine creator type |
| SingleValueVarByteRawIndexCreator.java | Overrides add method to delegate to addRaw |
| SingleValueFixedByteRawIndexCreator.java | Overrides add method to delegate to addRaw |
| MultiValueVarByteRawIndexCreator.java | Overrides add method to delegate to addRaw |
| MultiValueFixedByteRawIndexCreator.java | Overrides add method to delegate to addRaw |
| CLPForwardIndexCreatorV2.java | Overrides add method for CLP encoding |
| CLPForwardIndexCreatorV1.java | Overrides add method for CLP encoding |
| SegmentIndexCreationDriverImpl.java | Fixes inverted logic bug (isDisabled → isEnabled) |
| SegmentColumnarIndexCreator.java | Adds warning when dictionary required but explicitly disabled |
| SameValueForwardIndexCreator.java | Overrides add method to delegate to addRaw |
| RawForwardIndexInvertedIndexTest.java | New integration test for raw forward index with inverted index |
| ForwardIndexHandlerReloadQueriesTest.java | Updates test expectations and adds raw forward index test coverage |
| DataFetcher.java | Adds _useDictionary field for proper dictionary usage check |
880c72e to
e042d90
Compare
74727cd to
d202d42
Compare
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #17269 +/- ##
============================================
+ Coverage 63.57% 63.65% +0.08%
- Complexity 1717 1723 +6
============================================
Files 3252 3254 +2
Lines 199132 199459 +327
Branches 30875 30977 +102
============================================
+ Hits 126596 126966 +370
+ Misses 62454 62361 -93
- Partials 10082 10132 +50
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
c2dfdcc to
4b4bcd7
Compare
9e63630 to
677c60f
Compare
Jackie-Jiang
left a comment
There was a problem hiding this comment.
For future proof, how do you plan to support both dictionary encoded and raw forward index on the same column? Specifically, how does the FieldConfig look like?
677c60f to
7463dd1
Compare
a664045 to
e936ca7
Compare
f023298 to
78efe79
Compare
testFetchDictIdsFromRawForwardIndexWithSharedDictionary previously called DataFetcher#fetchDictIds and expected the implicit per-row dictionary lookup. With the new contract, fetchDictIds throws on a RAW forward index and callers must opt in via fetchDictIdsFromRawValues. The test now asserts both: the throw (1) and the explicit-path dict ids (2). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…rward index" This reverts commit 7e66df4.
This reverts commit c8290ec.
…alue path" This reverts commit 1b45308.
Per @Jackie-Jiang's PR comment: readDictIds should focus on reading dict ids directly from a dictionary-encoded forward index; readDictIdsFromRawValues handles the explicit per-row dictionary lookup when the forward index is RAW but a (shared) dictionary exists. Each ColumnValueReader method now does exactly one thing. The dispatch between the two paths moves up to the public DataFetcher#fetchDictIds entry point so existing callers are unaffected. A new public DataFetcher#fetchDictIdsFromRawValues (SV + MV) lets future callers opt explicitly into the per-row dictionary lookup when they have already verified the column is RAW + shared-dict. Also rename the ColumnValueReader field _useDictionary to _dictionaryEncoded to match the boolean's actual meaning, and remove the hot-path Preconditions guards in readDictIdsFromRawValues* — the caller contract documents the precondition and the underlying readers will throw if invariants are violated. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
fetchDictIds (SV + MV) now reads dict ids directly from the forward index with no implicit fallback. On a RAW + shared-dict column the underlying ForwardIndexReader#readDictIds throws UnsupportedOperationException — callers that genuinely need dict ids on such a column must call the new public fetchDictIdsFromRawValues (SV + MV) explicitly. This makes the per-row Dictionary#indexOf cost a deliberate caller decision instead of a hidden one. Update DataFetcherTest to assert the new contract: fetchDictIds throws on a RAW reader, and fetchDictIdsFromRawValues returns the dict ids produced by per-row dictionary lookups. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…OUP BY The previous gate (columnContext.getDictionary() == null) routed shared-dict + RAW columns to DictionaryBasedGroupKeyGenerator, which calls BlockValSet#getDictionaryIdsSV → DataFetcher#fetchDictIds. With fetchDictIds no longer falling back to per-row dictionary lookups, that path now throws. Extend the gate to also require the underlying forward index to be dictionary-encoded so columns with a shared standalone dictionary on a RAW forward index take the NoDictionarySingleColumnGroupKeyGenerator path. Other operators (DistinctExecutorFactory, aggregation functions, NoDictionaryMultiColumnGroupKeyGenerator, IdentifierTransformFunction) will be migrated to the same gate in a follow-up PR. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ndex need Refactor computeOperations into a per-column helper that decomposes into four orthogonal questions: forward-index transition, dictionary transition, compression-only change, and cross-cutting guards (sorted columns, range-index format compatibility, enable-forward prerequisites, enable-dictionary needs forward). The dictionary decision is now a single rule: desiredDict = newIsDict || DictionaryIndexConfig.requiresDictionary(fieldSpec, newConf). The "force dict on if any index requires it" path means adding an inverted index to a raw column now auto-creates the dictionary (and the inverted index gets built against it by InvertedIndexHandler in the same reload), where the pre-PR behavior silently skipped this combo and left the inverted-index request orphaned. Removing that index later auto-removes the dictionary again. shouldDisableDictionary now consults DictionaryIndexConfig.requiresDictionary to refuse dict removal whenever any enabled index in the new config still needs it (covers inverted, FST, IFST, and any future dict-requiring index type). The on-disk inverted/FST hasIndex check was removed — InvertedIndexHandler runs after ForwardIndexHandler and removes orphaned indexes, so refusing dict removal based on transient on-disk state was overly conservative. getStatsCollector gains a requireUniqueValues parameter so the dict-creation paths skip the no-dict-optimized collector and use the type-specific collector that tracks unique values. Without this, the auto-create-dict path NPEs when ClusterConfigForTable's optimized no-dict collector is in effect. Tests: - testRangeIndexRebuiltOnDictionaryToggle: new SegmentPreProcessorTest case asserting the range index is rebuilt with the right format every time the dictionary state toggles, covering both the auto path (driven by inverted-index add/remove) and the explicit toggle path (noDictionaryColumns add/remove). - testIfNeedProcess: updated to expect ENABLE_DICTIONARY when adding inverted to a raw column on v3 (the new auto path); v1 still skips. - testComputeOperationDisableForwardIndex TEST13: updated to expect no operation queued when inverted is in the new config (dictionary must stay). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…FORWARD_INDEX State representation: Replace the existingHasFwd / newIsFwd booleans with FieldConfig.EncodingType values where null means "forward index disabled / not on disk", DICTIONARY means dict-encoded forward, and RAW means raw forward. This makes the state space explicit and lets the compression-only branch test existingFwdEncoding == FieldConfig.EncodingType.RAW directly instead of calling isForwardIndexDictionaryEncoded(column). Operation split: ENABLE_FORWARD_INDEX is split into ENABLE_DICT_FORWARD_INDEX and ENABLE_RAW_FORWARD_INDEX. computeColumnOperations picks the right variant based on newFwdEncoding so the intent is explicit at the operation level. Both variants flow through createForwardIndexIfNeeded (which already reads the target encoding from the new config), but the post-rebuild assertions in updateIndices are now type-specific: ENABLE_DICT_FORWARD_INDEX requires the dictionary to remain after rebuild, ENABLE_RAW_FORWARD_INDEX requires it to be absent. Updated test cases in ForwardIndexHandlerTest accordingly: TESTs that explicitly set FieldConfig.EncodingType.RAW for a forward-index-disabled column now expect ENABLE_RAW_FORWARD_INDEX, while default-dict cases keep ENABLE_DICT_FORWARD_INDEX. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Fix the wrong assertion in ENABLE_RAW_FORWARD_INDEX (dict + inverted +
raw forward is a valid post-rebuild state when an inverted index is
enabled — the assertion that the dictionary must be absent contradicted
the auto-keep-dictionary-when-required-by-index logic from the previous
commits) and extend the operation set to actually perform the encoding
flip when the dict transition step won't.
The forward-index transition path now has three sub-cases:
- Existing forward, new disables it → DISABLE_FORWARD_INDEX.
- Existing disabled, new re-enables it → ENABLE_DICT/RAW_FORWARD_INDEX,
reusing createForwardIndexIfNeeded.
- Forward stays on but encoding flips DICT⇄RAW → queued only when the
dict transition won't already cover the conversion (i.e. existingHasDict
== desiredDict). The DICT→RAW + drop-dict and RAW→DICT + create-dict
combos still fall out of DISABLE_DICTIONARY / ENABLE_DICTIONARY
unchanged, so the encoding-flip op is reserved for the dict-stays cases:
- DICT→RAW with dict kept (e.g. inverted index added to a dict-encoded
column without explicitly enabling dict)
- RAW→DICT with shared dict already present
For the DICT→RAW + dict kept case, add convertDictForwardToRawKeepingDictionary:
the rewrite reuses the existing rewriteDictToRawForwardIndex helper but
keeps the dictionary on disk and does not call removeDictRelatedIndexes
(secondary indexes against unchanged dict ids stay valid).
Tests:
- testFlipDictForwardToRawForwardKeepingDictionaryForInvertedIndex:
end-to-end test asserting the encoding flip on a dict-encoded INT column
with an inverted index leaves the column as dict + inverted + raw forward.
- testComputeOperationDisableDictionary TEST3: previously asserted no-op
for "disable dict + add inverted on dict-encoded column" (the pre-PR
silent-skip behavior). Updated to expect the correct op
ENABLE_RAW_FORWARD_INDEX, which performs the encoding flip while the
inverted-index requirement keeps the dictionary.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…urce Revert all the DataFetcher / DataFetcherTest churn from earlier commits (API split, _dictionaryEncoded rename, Preconditions on hot paths, etc.) back to master and replace it with the minimal change that solves the shared-dict + RAW correctness issue: when forwardIndexReader.isDictionaryEncoded() is false, pass null instead of dataSource.getDictionary() to ColumnValueReader. ColumnValueReader's existing per-method branches all gate on _dictionary != null, so dropping the dictionary at construction makes every value-read method take the raw-value path uniformly without touching the read methods themselves. Callers that need dict ids on a RAW + shared-dict column still propagate UnsupportedOperationException from the underlying RAW reader's readDictIds default — that surfaces as a clear failure for any operator that hasn't been migrated to gate on forward-index encoding (DefaultGroupByExecutor was migrated in an earlier commit; the remaining migrations live in the BlockValSet follow-up branch). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Centralize the "no dictionary path on RAW forward index" rule in ColumnContext.fromDataSource. Every consumer of columnContext.getDictionary() now naturally routes shared-dict + RAW columns to the raw-value path without needing per-call check: - IdentifierTransformFunction: simplified — its inline check (forwardIndex.isDictionaryEncoded() ? columnContext.getDictionary() : null) is now redundant because ColumnContext returns null directly. - DefaultGroupByExecutor: revert the extra forward-index-encoding check added in an earlier commit; the master gate (columnContext.getDictionary() == null) is correct again now that ColumnContext drops the dictionary at the source. - DictionaryBasedGroupKeyGenerator, DistinctExecutorFactory, NoDictionaryMultiColumnGroupKeyGenerator: no change needed — they already gate on columnContext.getDictionary() and now see null for shared-dict + RAW. Callers that legitimately need the underlying dictionary (e.g. DictionaryBasedDistinctOperator iterating dict values directly) can still get it via columnContext.getDataSource().getDictionary(). This is the same pattern applied earlier to DataFetcher.addDataSource; moving it to ColumnContext makes the rule the single source of truth for the operator/transform-function tree. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per @xiangfu0's review: ColumnContext should expose both the data source and the dictionary so upper functions can decide for themselves. Some operators (e.g. DictionaryBasedDistinctOperator) iterate the dictionary directly and don't care about forward-index encoding — they would lose the dictionary if it were dropped at the source. Restore ColumnContext.fromDataSource to master behavior. Apply the "forward must be dict-encoded for dict-id reads" rule at each call site that does dict-id reads: - IdentifierTransformFunction: drop dict for shared-dict + RAW so transformToDictIdsSV/MV doesn't get advertised. - DefaultGroupByExecutor: route shared-dict + RAW columns to the no-dict GROUP BY path so DictionaryBasedGroupKeyGenerator's dict-id reads don't fire. Both checks are null-safe against DataSource without a forward index (test mock case). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add INT_MV_DICT_RAW_COLUMN to BaseTransformFunctionTest, configured as EncodingType.RAW with both an explicit dictionary and inverted index in its FieldConfig — the canonical shared-dict + RAW shape introduced by this PR. The values mirror INT_MV_COLUMN so callers can compare results against the dict-encoded baseline. testFilterMvOnSharedDictRawForwardColumn in FilterMvTransformFunctionTest exercises every predicate in the existing IntPredicate matrix (EQ, NEQ, RANGE, IN, NOT_IN, BETWEEN, AND, OR, NOT) and asserts: - The data source actually has a RAW forward index plus a dictionary on disk (sanity: the configuration produced the intended shape). - The TransformFunction reports getDictionary() == null and TransformResultMetadata.hasDictionary() == false. This is the signal that drives FilterMvPredicateEvaluator -> PredicateEvaluatorProvider to construct a raw-value matching evaluator (matchesInt etc.) instead of a matchesDictId evaluator. - transformToIntValuesMV produces results that match the expected matcher applied to _intMVValues, confirming the predicate evaluator built without a dictionary still returns correct rows. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…tor overload Split the existing INT_MV_DICT_RAW_COLUMN coverage into two distinct on-disk shapes and exercise filterMv against both: - INT_MV_DICT_RAW_COLUMN: RAW forward + explicit shared dictionary, no secondary index. - INT_MV_DICT_RAW_INV_COLUMN (new): RAW forward + shared dictionary + inverted index — covers the case the user wanted documented in tests. Both columns hold the same values as INT_MV_COLUMN. The new testFilterMvOnSharedDictRawForwardWithInvertedColumn parameterizes over the existing IntPredicate matrix and asserts the filterMv result is identical for all three columns (dict-encoded baseline, RAW + dict, RAW + dict + inverted) for every predicate. The inverted index is irrelevant to filterMv's per-value evaluation but having it on disk must not change the result. Also add a DataSource-aware overload to FilterMvPredicateEvaluator so the evaluator can pull the dictionary only when the underlying forward index is dictionary-encoded: public static FilterMvPredicateEvaluator forPredicate(String predicate, DataSource dataSource) FilterMvTransformFunction now uses this overload when the inner argument is a direct column reference (IdentifierTransformFunction); for transform arguments it keeps the dictionary-only path. The control flow is fixed to use an else branch so the DataSource-aware evaluator isn't overwritten by the legacy call. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per @xiangfu0's review: each external caller should evaluate whether it has a DataSource. Consolidate PredicateEvaluatorProvider's public surface to one method: public static PredicateEvaluator getPredicateEvaluator(Predicate predicate, @nullable DataSource dataSource, DataType dataType, @nullable QueryContext queryContext) When the caller has a column DataSource, pass it so the gating logic in getDictionaryUsableForFiltering can pick between dict-based and raw-value evaluation for shared-dict + RAW columns. When the caller doesn't (post- reduction matchers, intermediate-result aggregators, computed transforms), pass null and the evaluator is built from raw values using the supplied data type. The two dictionary-based getPredicateEvaluator overloads (3-arg and 4-arg) are now private — they remain as the internal dispatch implementation but are no longer part of the public API. Migrations: - FilterPlanNode: pass dataSource.getDataSourceMetadata().getDataType() through. - BinaryOperatorTransformFunction: drill into IdentifierTransformFunction for the column's DataSource; for non-Identifier transforms pass null. - ExpressionFilterOperator: same Identifier-or-null pattern. - PredicateRowMatcher, DistinctCountThetaSketchAggregationFunction: pass null DataSource (post-reduction / intermediate-result paths have no column to point at). - PredicateEvaluatorProviderTest: pass FieldSpec.DataType.STRING through. FilterMvPredicateEvaluator deliberately bypasses PredicateEvaluatorProvider now: filterMv evaluates per-value at transform time, so the filter-plan-time gating PredicateEvaluatorProvider applies (which would route shared-dict + RAW columns through the dict-based evaluator) is the wrong policy at this layer. FilterMvPredicateEvaluator builds its evaluator directly via the Equals/NotEquals/In/NotIn/Range/RegexpLike factories based solely on whether a dictionary is supplied — IdentifierTransformFunction already returns null for shared-dict + RAW columns, so the existing forPredicate (predicate, dataType, dictionary) signature is the right call surface. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PredicateEvaluatorProvider: - For RANGE on RAW forward, drop the dictionary unless the range index is exact (Jackie: "we need to ensure the predicate can solely be resolved with range index. Mixing dictionary encoded range index and raw forward index scan will break"). Non-exact (legacy) range readers fall back to ScanBasedFilterOperator for partial matches; that scan applies the predicate evaluator on raw forward values, which would break a dict-based evaluator. - Remove the sortedAvailable check inside the RAW-forward branch (Jackie: "this is dead code. Sorted index is run-length dictionary encoded forward index"). A sorted forward index is dict-encoded by definition, so it can never coexist with a RAW forward in this branch. - Make getDictionaryUsableForFiltering package-private and annotate it with @VisibleForTesting (Jackie's minor suggestion). - Update the public method's docstring to clarify caller responsibilities: leaf-filter callers may pass DataSource freely; transform-layer callers must only pass DataSource when the inner transform's getDictionary() is non-null (Jackie: "double check all callers of this, make sure it is not using forward index"). BinaryOperatorTransformFunction: pass leftDataSource only when the inner transform exposes a dictionary, mirroring IdentifierTransformFunction's "forward index is dict-encoded" contract. For RAW (including shared-dict + RAW) the inner transform returns null from getDictionary(), so the predicate evaluator is built against raw values and applySV(value) on the raw forward output stays consistent. ExpressionFilterOperator: always pass null DataSource. The inner is always a function (FilterPlanNode dispatches direct column refs through the leaf-filter path) and ExpressionScanDocIdIterator scans the transform output applying applySV(value) on raw values, so a dict-based evaluator would never be safe here. DataFetcher: add Preconditions check that dictionary != null when the forward index claims to be dictionary-encoded (Jackie: "is it possible that forward index is dictionary encoded by dictionary doesn't exist?"). Defensive against an impossible state — fails fast at DataFetcher construction rather than silently producing a ColumnValueReader that will NPE later in the read path. BaseSegmentCreator.createDictionaryForColumn: when an index requires a dictionary, create one regardless of the user's explicit dictionary setting (Jackie: "should we always create dictionary when any index requires it?"). Previously, explicitly disabling the dictionary while enabling an index that requires it would lead to segment creation failure (e.g. inverted index can't be built without a dict) or silent index loss. Now we log a warning and proceed. DefaultGroupByExecutor: add a comment explaining that ColumnContext.getDataSource() can be null for computed (non-Identifier) transforms (Jackie: "could datasource ever be null?"); in that case getDictionary() == null already routes those columns onto the no-dict GROUP BY path via the first condition. Tests: drop two tests that exercised an impossible RAW + sorted scenario and add a new test for the non-exact range index case. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
DataFetcher.addDataSource: drop the Preconditions check that asserted a dictionary exists when forward index claims to be dictionary-encoded. The segment loader sets isDictionaryEncoded() in lockstep with the dictionary file's presence — the check was over-protecting an invariant that's already guaranteed. IdentifierTransformFunction.forwardIndexIsDictEncoded: drop the ColumnContext.getDataSource() null guard. IdentifierTransformFunction is only constructed via TransformFunctionFactory with ColumnContexts built from ColumnContext.fromDataSource(...), where the DataSource is always non-null. The remaining `forwardIndex != null` check is the real one — it guards forward-index-disabled columns. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ctory dispatch Address Jackie-Jiang's remaining review comments: DictionaryIndexType: when a FieldConfig with encoding=RAW also configures an index that always requires a dictionary (FST, IFST, INVERTED), do NOT mark the dictionary as DISABLED in the deserialized config. The dict falls through to its default-enabled state, validation passes, and the runtime auto-creation paths (BaseSegmentCreator.createDictionaryForColumn / ForwardIndexHandler) build a shared-dict + RAW forward index. This closes the loop on Jackie's "Dictionary should be generated on FST without explicit configuration" comment — users no longer need to inject a dictionary entry alongside FST/IFST/INVERTED on a RAW column. PredicateEvaluatorProvider: - Replace the now-private factory-dispatch overload with a public buildEvaluator(predicate, dictionary, dataType, queryContext) entry point. This is for callers whose value stream is statically known (e.g. per-value transform evaluation in filterMv); filter-plan-time callers still go through getPredicateEvaluator(predicate, DataSource, ...) which applies gating. FilterMvPredicateEvaluator: drop the duplicated buildPerValueEvaluator switch — it was a copy of the factory dispatch. Now delegates to PredicateEvaluatorProvider.buildEvaluator. IdentifierTransformFunction: revert the encoding-based dictionary drop. The Identifier always exposes the column dictionary if one exists; it is the consumer's responsibility to additionally check forward-index encoding when deciding whether dict-id reads will be cheap. This matches the principle that callers should make decisions with all available information. FilterMvTransformFunction: now performs its own forward-encoding gate via columnContextMap to decide whether to use the dict-id matching path or fall back to per-value matching. Also rebuilds the result metadata so hasDictionary() reflects the gated dictionary actually used here, rather than the inner Identifier's (which always reports the underlying dict). BinaryOperatorTransformFunction: same gate inline — only pass DataSource to the predicate evaluator when the LHS Identifier wraps a dict-encoded forward index. ForwardIndexHandler: inline the unused two-arg getStatsCollector overload at its single call site. TableConfigUtilsTest: update two assertions that previously expected "Cannot create inverted index ... without dictionary" failures for encoding=RAW + INVERTED. With auto-enable that combination now validates and produces a shared-dict + RAW forward index. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…parameter TableIndexingTest.java: the FST encoding=RAW dictionary workaround, indexes.put→set rewrites, and getErrorMessage null-safety changes were all unrelated to this PR's scope. Revert the file to match master. The DictionaryIndexType.fromFieldConfigs auto-enable now handles the FST + encoding=RAW case at config time, so the test no longer needs the manual dictionary block. ForwardIndexHandler.getStatsCollector: drop the requireUniqueValues boolean parameter. The flag was redundant — when the auto-dict-creation path runs (the new code paths that flip a column from no-dict to dict), _fieldIndexConfigs already carries the new config with dict enabled, so hasIndex(column, dictionary) returns true and the no-dict optimization is already skipped, producing a per-type collector that tracks unique values. Restoring the original two-arg signature. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…raw internally PredicateEvaluatorProvider: - Single public entry: getPredicateEvaluator(predicate, dataSource, dictionary, dataType, queryContext). When dataSource is non-null the gating in getDictionaryUsableForFiltering derives the dictionary; when null the dictionary parameter is used directly. buildEvaluator (factory dispatch) is now private. FilterMvPredicateEvaluator: - Accept (predicate, dataType, dictionary, dataSource). Drop the dictionary internally if the inner forward index is RAW — filterMv evaluates per-value and forward.getDictIdMV would need expensive Dictionary#indexOf for RAW. Expose isDictionaryBased() so callers can sync their result metadata. - forPredicate now delegates to the unified PredicateEvaluatorProvider; no more duplicated factory dispatch. FilterMvTransformFunction: - Drop the dictionaryUsableForFilterMv helper; let FilterMvPredicateEvaluator do the gate. Pass the inner Identifier's DataSource through. Build resultMetadata from the gated dictionary so downstream consumers (ExpressionScanDocIdIterator) take the right value-stream path. DictionaryIndexType.hasIndexRequiringDictionary: - Replace the hardcoded FST/IFST/INVERTED list with an iteration over IndexService.getInstance() .getAllIndexes(), checking IndexType.requiresDictionary against the IndexType's default config and honoring the FieldConfig's per-index "disabled":true flag in the indexes JsonNode. Generalizes to any IndexType (including future plugins) that declares requiresDictionary. TableIndexingTest: drop the now-redundant manual dictionary block in the FST + RAW case. The DictionaryIndexType auto-enable now handles it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ct, getStatsCollector flag Two regressions surfaced in GitHub Actions Pinot Unit Test Sets: 1) NullHandlingEnabledQueriesTest.testExpressionFilterOperatorNotFilterOnMultiValue: ExpressionFilterOperator was passing a null dictionary to the unified getPredicateEvaluator. ExpressionScanDocIdIterator follows the inner transform's resultMetadata.hasDictionary() to choose between the dict-id and raw-value paths, so when the inner transform reports hasDictionary=true it feeds dict ids into a raw-value evaluator — wrong results. Hand the transform's getDictionary() through directly so the evaluator matches the value stream. 2) SegmentPreProcessorTest.testRangeIndexRebuiltOnDictionaryToggle and testIfNeedProcess[v3]: NPE in createDictionaryForRawForwardIndex because getStatsCollector returned a NoDictColumnStatisticsCollector (no unique values) when _fieldIndexConfigs still reported dict=disabled. This happens on the auto-toggle path: legacy invertedIndexColumns triggers ENABLE_DICTIONARY in computeOperations, but the deserialized FieldIndexConfigs hasn't been mutated to reflect the auto-derived dict requirement. Restore the requireUniqueValues parameter on getStatsCollector and pass true at the two dict-building call sites; this forces a per-type collector that tracks unique values regardless of the stale config flag. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
FilterMvTransformFunction: gate the dictionary on the inner forward-index encoding. IdentifierTransformFunction surfaces the column's dictionary unchanged, but the dict-id matching path requires forward.getDictIdMV() to serve dict ids cheaply — RAW forward indexes throw UnsupportedOperationException from that method, so for shared-dict + RAW columns we drop the dictionary internally and the predicate evaluator falls back to per-value raw matching. RangeIndexBasedFilterOperator.canEvaluate: also check that the predicate evaluator's encoding matches the range index's encoding. RangeIndexCreator and BitSlicedRangeIndexCreator both switch on hasDictionary at build/rebuild time — the on-disk index stores dict IDs whenever a dictionary exists. Pairing that index with a raw-value evaluator (e.g. our gating drops the dict on shared-dict + RAW with a non-exact range index) silently returns wrong matches: raw values would be compared against dict IDs by RangeIndexBasedFilterOperator. Fall through to ScanBasedFilterOperator instead so raw values are scanned against the raw forward index. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…rsion change RangeIndexType.validate: reject the combination of RAW forward index + dictionary + range version 1 (legacy RangeIndexCreator). The on-disk v1 range index over dict IDs is non-exact — query-time partial matches fall back to ScanBasedFilterOperator, which on a RAW forward column would have to apply a dict-based evaluator against raw values, silently producing wrong results. Force the BitSliced range index (version 2, exact) for this combination so the failure mode is config validation rather than wrong query answers. RangeIndexHandler.needUpdateIndices / updateIndices: also detect when the on-disk range index version differs from the configured version. v1 and v2 have incompatible on-disk layouts and serve different query semantics (non-exact vs exact), so a version change requires a full rebuild. Read the version from the index buffer's first int (matching RangeIndexType.read's dispatch) and rebuild when it doesn't match. The buffer is owned by SegmentDirectory — don't close it from the handler, since the mmap region is shared and closing it would crash other readers. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Opened a follow-up docs PR for this change: pinot-contrib/pinot-docs#804 |
Summary
When a column is configured with
encodingType: RAW(or appears innoDictionaryColumns) but a secondary index that needs a dictionary is also configured (inverted,fst,ifst, dict-id range), Pinot now auto-creates a standalone dictionary for that column so the secondary index can function. The forward index stays RAW; the dictionary lives alongside it.Previously, such configs either silently produced wrong results, fell back to a deprecated raw-value bitmap inverted index, or required the user to manually add a
"dictionary": {}block to theFieldConfig.indexesmap. This PR makes the common case work without explicit configuration and removes the legacy raw-value bitmap inverted index path entirely.Builds on the SPI surfaces introduced by #18364 (encoding) and #18365 (
requiresDictionary/shouldInvalidateOnDictionaryChange).What's covered
Auto-create dictionary
DictionaryIndexType.fromFieldConfigs): when aFieldConfigdeclaresencodingType=RAWAND any enabled index returnsIndexType.requiresDictionary()=true(FST, IFST, INVERTED, etc.), the dictionary config is left at its default-enabled state instead of being marked DISABLED. Validation and the rest of the pipeline see a dict-enabled column from the start.BaseSegmentCreator):DictionaryIndexConfig.requiresDictionary()drives auto-dict creation when a dict-requiring index is configured on a RAW column. If the user explicitly disabled the dictionary, a warning is logged and the dictionary is still created (alternative is a hard segment-build failure or silent index loss).MutableSegmentImpl): respects the encoding plumbing.ForwardIndexHandler.ENABLE_DICTIONARYop): if an existing RAW segment is loaded under a config that now requests a dict-requiring index,ForwardIndexHandlermaterializes a standalone dictionary; subsequent handlers (InvertedIndexHandler,RangeIndexHandler) build their dict-id-based indexes from it.BaseDefaultColumnHandler): same logic for newly-added columns at reload.FieldIndexConfigsUtil): fail-fast if the user-suppliedForwardIndexConfig.encodingTypedisagrees with the column-levelFieldConfig.encodingType/noDictionaryColumns.ForwardIndexHandler.computeOperationssimplificationThe reload-time decision logic in
ForwardIndexHandler.computeOperationswas rewritten to four orthogonal questions per column, with the dictionary state expressed as a single rule that drives every dict-toggle decision:State is tracked as
FieldConfig.EncodingType(DICTIONARY,RAW, ornullfor "forward index disabled / not on disk"), making the state space explicit. The transition steps run in this order:existingFwdEncodingvsnewFwdEncoding. Three sub-cases: forward disabled, forward re-enabled (rebuild from dict + inverted), or encoding flips with forward staying on.existingHasDictvsdesiredDict. Auto-enables the dict whenever a secondary index requires it; refuses to remove the dict if any enabled index in the new config still needs it.The
ENABLE_FORWARD_INDEXoperation is split intoENABLE_DICT_FORWARD_INDEXandENABLE_RAW_FORWARD_INDEXso the intent is explicit. Both variants cover two on-disk shapes:convertDictForwardToRawKeepingDictionaryhandles the DICT→RAW + dict-kept case (e.g. when an inverted index is added to a dict-encoded column without explicitly enabling dict).Range index correctness gates (RAW + dict combination)
Shared-dict + RAW exposes a subtle correctness hazard around range indexes: when a dictionary exists at index build time, both
RangeIndexCreator(v1) andBitSlicedRangeIndexCreator(v2) build over dict IDs. v1 is non-exact — its query-time partial matches fall back toScanBasedFilterOperator, which on a RAW forward column would have to apply a dict-based evaluator against raw values, silently producing wrong results.RangeIndexType.validate: rejectsRAW forward + dictionary + RangeIndexCreator.VERSION (v1). Forces the BitSliced v2 (exact) range index for shared-dict + RAW columns so the failure mode is config validation rather than wrong query answers.RangeIndexHandler.needUpdateIndices/updateIndices: read the on-disk range index version (first int of the buffer) and rebuild when it differs from the configured version. v1 and v2 have incompatible on-disk layouts and different exact/non-exact semantics, so a version change requires a full rebuild.RangeIndexBasedFilterOperator.canEvaluate: returns false whendataSource.getDictionary() != nulland the predicate evaluator isn't dict-based — falls through toScanBasedFilterOperator, which correctly applies raw values against the raw forward index. Defends against any other path that produces a raw-value evaluator on a column whose range index was built over dict IDs.PredicateEvaluatorProvider.getDictionaryUsableForFiltering: per-predicate-type decision. ForRANGEon a RAW forward column, the dictionary is kept only when the range index is exact (isExact() == true) so non-exact range readers can't be paired with a dict-based evaluator that scan-fallback can't apply.Remove legacy raw-value bitmap inverted index
The pre-shared-dict format embedded its own dictionary inline inside the
<col>.bitmap.inv.idxfile (written by the now-deletedRawValueBitmapInvertedIndexCreator). Reads went throughRawValueBitmapInvertedIndexReaderandRawValueInvertedIndexFilterOperator. All three are deleted in favor of the standardBitmapInvertedIndexReaderover a real standalone dictionary.SegmentPreProcessor.removeLegacyRawValueInvertedIndexespre-pass detects the legacy 44-byte big-endian header (version=1+ cardinality + offsets) on segment load, deletes the file, and lets the new ForwardIndexHandler + InvertedIndexHandler chain rebuild the index in dict-id format.InvertedIndexType.ReaderFactorydrops thehasDictionarybranch; asserts a dictionary exists.Dict-id-based rebuild path
DictionaryBasedIndexBuilder(new, ~200 lines): shared helper extracted fromInvertedIndexHandlerandRangeIndexHandler— reads raw forward values, looks each up in the dictionary, feeds (value, dictId) pairs into aDictionaryBasedInvertedIndexCreator. Single per-data-type dispatch handles SV, MV, INT/LONG/FLOAT/DOUBLE/BIG_DECIMAL/STRING/BYTES.InvertedIndexAndDictionaryBasedForwardIndexCreator: split_dictionaryEnabledinto two flags —_dictionaryPresent(a standalone dict file exists) and_dictionaryBasedForwardIndex(the forward index stores dict IDs). The two are now independent (RAW forward + standalone dict is the new third state).BaseIndexHandler: wires the two-flag model.InvertedIndexHandler's class-level Javadoc documents that it requires a dictionary by the timeupdateIndicesruns, and thatSegmentPreProcessorenforces this by always runningForwardIndexHandlerfirst.Query path: predicate evaluator selection on shared-dict columns
With shared-dict columns, predicates on the same column may need different evaluators: an
EQcan use the dict-id-based inverted index, while aLIKE '%x%'on the same column must scan the raw forward.PredicateEvaluatorProvider.getDictionaryUsableForFiltering: per-predicate-type decision returning the dictionary only when a dict-consuming filter operator (sorted, inverted, exact range) is actually available and enabled for that specific predicate type. Inverted-only is dropped forRANGE; non-exact range is dropped forRANGE/EQ. RAW forward + scan reads raw values directly.FilterMvTransformFunction: per-value evaluation viatransformToDictIdsMVrequiresforward.getDictIdMV()to actually serve dict ids cheaply. RAW forward indexes throwUnsupportedOperationExceptionfrom that method, so when the inner Identifier wraps a RAW column (with or without a shared dictionary), the dictionary is dropped here and filterMv falls back to per-value raw matching.DataFetcherDataFetcher.addDataSourcedrops the (shared) dictionary when the forward index is RAW — the existing per-method branches inColumnValueReaderthen take the raw-value paths uniformly without touching the read methods. Callers that genuinely need dict ids on a RAW + shared-dict column read raw values and consult the dictionary directly.DefaultGroupByExecutorsimilarly gates onforwardIndex.isDictionaryEncoded()so shared-dict + RAW columns route to the no-dict GROUP BY path.addRaw(Object)SPI extractionForwardIndexCreator.addRaw(Object)(new default method): extracted from theadd(Object, dictId)body so handlers can write raw values without going through the dict-id routing branch. Required by the rebuild path that converts dict-id forward back to raw forward when dropping a dictionary.add()→addRaw().Tests
IndexCombinationValidationTest(~640 lines, new) — exhaustive matrix over (encoding, dictionary, secondary index, column type, compression codec) combinations.LegacyRawValueInvertedIndexMigrationIntegrationTest(new, inpinot-integration-tests) — full-cluster integration test with a synthetically-constructed legacy segment (raw forward + legacy embedded-dict inverted file written by a resurrectedLegacyRawValueBitmapInvertedIndexCreator). Tars it, uploads to a real cluster, verifies the server preprocessor migrates the segment and that EQ / IN / NOT_EQ queries return correct counts.RawForwardIndexWithDictionaryTest(~580 lines) — 33 cases covering SV/MV × EQ/RANGE/IN/REGEXP_LIKE on RAW forward + dict, plus comprehensive mixed-predicate matrix combining inverted-index dict path and raw-scan path on the same column with explicit non-zero expected counts.RawForwardIndexInvertedIndexTest— query-result equivalence vs DICTIONARY-encoded baseline.SegmentPreProcessorTest.testRangeIndexRebuiltOnDictionaryToggle(new) — asserts the range index is rebuilt with the right format every time the dictionary state toggles, covering both the auto path (driven by inverted-index add/remove) and the explicit toggle path.SegmentPreProcessorTest.testFlipDictForwardToRawForwardKeepingDictionaryForInvertedIndex(new) — encoding-flip end-to-end test asserting that a dict-encoded INT column with an inverted index, when reloaded under a config that puts the column innoDictionaryColumns, ends up asdict + inverted + raw forward(dict kept, encoding flipped, inverted intact).InvertedIndexHandlerTest—isLegacyRawValueInvertedIndexFormatdetection +SegmentPreProcessorrebuild path.ForwardIndexHandlerTest(+477 lines) — reload-time auto-create-dict coverage.PredicateEvaluatorProviderTest(new) — per-predicate-type dict-drop decisions, including non-exact range and dict-required-by-inverted scenarios.FilterMvTransformFunctionTest— RAW + dict + inverted column filterMv parity with dict-encoded baseline; per-value path drops dict for RAW forward.SegmentPreProcessorTest,ColumnMetadataImplTest(new),SegmentGeneratorConfigTest,TableConfigUtilsTest,DictionaryIndexTypeTest,LazyRowTest,ForwardIndexHandlerReloadQueriesTest,TableIndexingTest+ CSV — combination coverage and reload regressions.RealtimeSegmentConverterTest+CrcUtilsTest— CRC values updated for the new metadata key footprint.Backward compatibility
FORWARD_INDEX_ENCODINGinmetadata.properties:ColumnMetadataImpl.fromPropertiesConfigurationfalls back to inferring encoding fromHAS_DICTIONARY(the field added by Forward-index encoding: introduce Encoding SPI surface and use it for raw vs dict checks #18364, which this PR depends on).SegmentPreProcessoron first load under the new code. Covered byLegacyRawValueInvertedIndexMigrationIntegrationTest.RangeIndexHandlerdetects the on-disk version mismatch and rebuilds in v2 format on reload.noDictionaryColumns + invertedIndexColumnsoverlap, previously implicit, now produces a real dict + standard inverted via auto-create. No config change required.Example FieldConfig
Implicit (auto-create dictionary because inverted needs it):
{ "fieldConfigList": [ { "name": "myColumn", "encodingType": "RAW", "indexes": { "inverted": {} } } ] }Explicit (also valid; both dictionary and inverted listed):
{ "fieldConfigList": [ { "name": "myColumn", "encodingType": "RAW", "compressionCodec": "LZ4", "indexes": { "dictionary": {}, "inverted": {} } } ] }Out of scope (follow-up PR)
The full operator-routing migration to gate on
BlockValSet.isDictionaryEncoded()(instead ofBlockValSet.getDictionary() != null) is split into a separate branchblockvalset-isDictionaryEncoded-routing-for-shared-dict-raw. That branch wires the gate throughDistinctExecutorFactory,NoDictionaryMultiColumnGroupKeyGenerator, and allDistinctCount*aggregation functions. This PR includes only theDefaultGroupByExecutor,BinaryOperatorTransformFunction, andFilterMvTransformFunctiongate updates needed to keep the new shared-dict + RAW tests green.🤖 Generated with Claude Code