feat(columnar_map): SPI and data model for COLUMNAR_MAP index#18368
feat(columnar_map): SPI and data model for COLUMNAR_MAP index#18368tarun11Mavani wants to merge 11 commits intoapache:masterfrom
Conversation
Introduces the public API surface for the columnar MAP index type: pinot-spi: - ColumnarMapIndexConfig: table/field config for enabling columnar MAP - ColumnarMapNaming: canonical naming helpers for virtual key/value columns - FieldConfig: add COLUMNAR_MAP to the index type enum - ComplexFieldSpec: expose MAP key/value type accessors - Schema: wire ComplexFieldSpec into schema validation pinot-segment-spi: - ColumnarMapIndexCreator: creator interface (add/seal/close lifecycle) - ColumnarMapIndexReader: reader interface (key enumeration, per-key access) - ColumnMetadataImpl: persist isColumnarMap flag and key/value types - StandardIndexes: register COLUMNAR_MAP as a known index type - NullDataSource: datasource stub for absent MAP keys (moved from segment-local) - V1Constants: metadata keys for columnar MAP column properties - SegmentGeneratorConfig: propagate columnar MAP field configs Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #18368 +/- ##
============================================
+ Coverage 63.44% 63.51% +0.07%
- Complexity 1683 1709 +26
============================================
Files 3253 3258 +5
Lines 198854 199066 +212
Branches 30796 30835 +39
============================================
+ Hits 126154 126436 +282
+ Misses 62627 62527 -100
- Partials 10073 10103 +30
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
…isVirtualColumn -> isMapVirtualColumn
…lumn and setMapVirtualColumn
…MapIndexConfig and usages
|
@raghavagrawal @Jackie-Jiang please take a look. |
xiangfu0
left a comment
There was a problem hiding this comment.
Flagging two compatibility risks on the current head: one shared-schema wire-format break and one existing-schema validation break.
…UMNAR_MAP
ComplexFieldSpec.toJsonObject(): always emit childFieldSpecs for MAP
columns. When _childFieldSpecs is empty (new COLUMNAR_MAP schemas with
no explicit children), synthesise the legacy {key:STRING, value:STRING}
defaults so older brokers/servers deserialising via toMapFieldSpec()
continue to work across rolling upgrades.
Schema.validate(): remove global \$__ rejection. The check fired on
every schema regardless of whether COLUMNAR_MAP was configured, breaking
existing tables whose column names happened to contain that separator.
The guard is now in TableConfigUtils.validate() — only fires when
COLUMNAR_MAP is actually enabled for a table.
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
| /** | ||
| * Returns true if {@code key} is present in this MAP column for at least one document in | ||
| * this segment. Call this before {@link #getKeyDataSource(String)} to avoid undefined | ||
| * behaviour on absent keys. |
There was a problem hiding this comment.
Should not absent keys fall into Sparse Key reader Path? Do we need this ?
There was a problem hiding this comment.
ContainsKey is currently used as an early-exit guard in MapFilterOperator and the aggregation/group-by plan nodes to skip processing when a key is absent from the segment (backed by the map key set) which is O(1) lookup.
Removing it would mean every caller replaces containsKey(key) with getDataSource(key) != null, which triggers a full NullDataSource construction just to check presence.
| // Optional, default false | ||
| public static final String IS_AUTO_GENERATED = "isAutoGenerated"; | ||
|
|
||
| // Optional, default false. True for virtual columns materialized from a COLUMNAR_MAP parent column. |
There was a problem hiding this comment.
Can you share how does columnMetadata and index_map looks like for virtual columns?
There was a problem hiding this comment.
sure. I have it documented here.
https://docs.google.com/document/d/14kPmjDTKbO8l0ql4rrN7I5Yki5pqMw6GeGmxxc9grsU/edit?tab=t.0#bookmark=id.kycvq78ioe5g
| } | ||
|
|
||
| /** Maximum number of MAP keys to materialise as dense virtual columns. Default: 1000. */ | ||
| public int getMaxDenseKeys() { |
There was a problem hiding this comment.
How do we choose which 1000 keys incase we have more than 1000 dense keys? is it based on frequency ?
There was a problem hiding this comment.
When more keys qualify as dense than maxDenseKeys allows, the top maxDenseKeys keys ranked by fill rate are materialized; the remainder fall back to the sparse MAP column.
I have also added this in the java doc.
| /** | ||
| * Configuration for the COLUMNAR_MAP index on a MAP column. | ||
| * | ||
| * <p>Dense keys (above {@code denseKeyMinFillRate} or explicitly listed in {@code denseKeys}) are |
There was a problem hiding this comment.
How do you know if a key is dense or sparse? is it using static list in Config ?
There was a problem hiding this comment.
it's determined at the segment seal time based on the fill rate.
A key can be dense in one segment and sparse in another unless specifically added as denseKeys in the config. In that case, it will always be dense regardless of fill rate.
| * Get the Data Source representation of a single key within this map column. | ||
| * Only call after confirming the key exists via {@link #containsKey(String)}. | ||
| */ | ||
| DataSource getKeyDataSource(String key); |
There was a problem hiding this comment.
Not introduced in this PR, but let's rename it to:
| DataSource getKeyDataSource(String key); | |
| @Nullable | |
| DataSource getDataSource(String key); |
It is very confusing now because the data source is for value, not key.
Suggest letting it return @Nullable to represent key not exist
There was a problem hiding this comment.
Agreed on the naming — getDataSource(key) is cleaner. I'll do the full getKeyXXX → getXXX rename across the SPI in a separate refactoring PR so it's one clean sweep rather than scattered across the stack. Will raise a new PR for this.
There was a problem hiding this comment.
For @nullable: I'd prefer to keep the current non-null contract backed by NullDataSource. The callers in ProjectionBlock and ItemTransformFunction dereference the result immediately without a null-check — NullDataSource lets them do that safely by returning the column's default value for every doc.
If we switch to @nullable, those two callers need null-guards, and so does any future caller. NullDataSource gives the same semantic (absent key → null/default for all rows) without pushing null-handling into every call site.
Happy to revisit this if you strognly feel we should add @nullable.
Let me know your thoughts.
| * <p>The default implementation delegates to {@link #getKeyDataSources()}, which may be | ||
| * expensive for large key sets. Implementations should override for O(1) performance. | ||
| */ | ||
| default boolean containsKey(String key) { |
There was a problem hiding this comment.
(optional) This is probably not needed if we make getDataSource return @Nullable
There was a problem hiding this comment.
replied above.
For https://github.com/nullable: I'd prefer to keep the current non-null contract backed by NullDataSource. The callers in ProjectionBlock and ItemTransformFunction dereference the result immediately without a null-check — NullDataSource lets them do that safely by returning the column's default value for every doc.
If we switch to https://github.com/nullable, those two callers need null-guards, and so does any future caller. NullDataSource gives the same semantic (absent key → null/default for all rows) without pushing null-handling into every call site.
Happy to revisit this if you strognly feel we should add https://github.com/nullable.Let me know your thoughts.
| public static final String TEXT_ID = "text_index"; | ||
| public static final String H3_ID = "h3_index"; | ||
| public static final String VECTOR_ID = "vector_index"; | ||
| public static final String COLUMNAR_MAP_ID = "columnar_map"; |
There was a problem hiding this comment.
Should we just call it MAP? Do you foresee other map types to be added in the future that doesn't go under this?
There was a problem hiding this comment.
There's already a MAP concept in the codebase. The existing MapIndexReader / MapDataSource / MapColumnPreIndexStatsCollector family represents the non-columnar MAP storage path. Naming the new index type MAP would collide with that established concept and make it unclear which path a given piece of code refers to.
Naming it columnar_map also calls out how it's stored clearly.
- getMaterializedColumnMetadata replaces getVirtualColumnMetadata - getValueType replaces getKeyValueType - isMaterializedMapColumn replaces isMapVirtualColumn; _isMapVirtualColumn removed - IS_MAP_MATERIALIZED_COLUMN replaces IS_MAP_VIRTUAL_COLUMN (string: mapMaterializedColumn) - FieldConfig.IndexType enum Javadoc - ColumnarMapIndexConfig dense/sparse + maxDenseKeys Javadoc Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…e and ColumnarMapDataSource containsKey() is allowed to return true conservatively when key presence cannot be determined without a full scan (e.g. when a sparse column exists). Document this in the interface Javadoc so callers know to handle absent keys even when containsKey returns true. Also document that getKeyDataSources() returns dense keys only. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…ssion test
- MapDataSource.containsKey() Javadoc: replace {@link NullDataSource} (a
pinot-segment-local class) with prose to fix illegal cross-module reference
from pinot-segment-spi.
- SegmentV1V2ToV3FormatConverterTest: add testVirtualColumnsNotDoubleWrittenDuringV3Conversion
that builds a V1 MAP segment via the full SegmentIndexCreationDriverImpl pipeline
(virtual columns now in COMPLEX_COLUMNS), converts to V3, and asserts each virtual
column's forward_index.startOffset appears exactly once in index_map — catching any
regression in the dedup logic that guards against double-write when virtual columns
are present in both getAllColumns() and getVirtualColumnsFromMetadata().
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Clarify that the DataSource returned for an absent key returns the column default value for forward-index reads and marks all rows as null via the null-value bitmap. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
- getMaterializedColumnMetadata replaces getVirtualColumnMetadata - getValueType replaces getKeyValueType - isMaterializedMapColumn replaces isMapVirtualColumn; _isMapVirtualColumn field removed (derived as parentMapColumn != null) - IS_MAP_MATERIALIZED_COLUMN replaces IS_MAP_VIRTUAL_COLUMN (on-disk key: mapMaterializedColumn) - FieldConfig.IndexType: add one-line Javadoc to all 11 enum constants - ColumnarMapIndexConfig: document dense/sparse key selection criteria and top-N-by-fill-rate semantics for maxDenseKeys cutoff Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
75f588a to
b098f20
Compare
- getMaterializedColumnMetadata replaces getVirtualColumnMetadata - getValueType replaces getKeyValueType - isMaterializedMapColumn replaces isMapVirtualColumn; _isMapVirtualColumn field removed (derived as parentMapColumn != null) - IS_MAP_MATERIALIZED_COLUMN replaces IS_MAP_VIRTUAL_COLUMN (on-disk key: mapMaterializedColumn) - FieldConfig.IndexType: add one-line Javadoc to all 11 enum constants - ColumnarMapIndexConfig: document dense/sparse key selection criteria and top-N-by-fill-rate semantics for maxDenseKeys cutoff Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
b098f20 to
59a62b9
Compare
- getMaterializedColumnMetadata replaces getVirtualColumnMetadata - getValueType replaces getKeyValueType - isMaterializedMapColumn replaces isMapVirtualColumn; _isMapVirtualColumn field removed (derived as parentMapColumn != null) - IS_MAP_MATERIALIZED_COLUMN replaces IS_MAP_VIRTUAL_COLUMN (on-disk key: mapMaterializedColumn) - FieldConfig.IndexType: add one-line Javadoc to all 11 enum constants - ColumnarMapIndexConfig: document dense/sparse key selection criteria and top-N-by-fill-rate semantics for maxDenseKeys cutoff Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
59a62b9 to
f12d50d
Compare
- getMaterializedColumnMetadata replaces getVirtualColumnMetadata - getValueType replaces getKeyValueType - isMaterializedMapColumn replaces isMapVirtualColumn; _isMapVirtualColumn field removed (derived as parentMapColumn != null) - IS_MAP_MATERIALIZED_COLUMN replaces IS_MAP_VIRTUAL_COLUMN (on-disk key: mapMaterializedColumn) - FieldConfig.IndexType: add one-line Javadoc to all 11 enum constants - ColumnarMapIndexConfig: document dense/sparse key selection criteria and top-N-by-fill-rate semantics for maxDenseKeys cutoff Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
f12d50d to
a0e8451
Compare
Summary
This is the first PR in a 4-part stack introducing the COLUMNAR_MAP index type, which stores MAP columns in a columnar format (one sub-column per key) enabling per-key dictionary lookup, inverted-index-based filtering, and dict-based GROUP BY without expression evaluation.
RFC: https://docs.google.com/document/d/14kPmjDTKbO8l0ql4rrN7I5Yki5pqMw6GeGmxxc9grsU/edit?tab=t.0
This PR covers the public API surface only — config, naming, interfaces, and metadata. No implementation code.
pinot-spichangesColumnarMapIndexConfig— new table/field config object controlling columnar MAP behavior (key limit, value types, encoding)ColumnarMapNaming— canonical naming helpers that derive the virtual column name for each MAP key (e.g.value_map_string$__status)FieldConfig— addCOLUMNAR_MAPto theIndexTypeenumComplexFieldSpec— exposegetMapKeyType()/getMapValueType()accessorsSchema— wireComplexFieldSpeckey/value types into schema validationpinot-segment-spichangesColumnarMapIndexCreator— creator interface (add(docId, map),seal(),close())ColumnarMapIndexReader— reader interface (getKeys(),getKeyDataSource(key))ColumnMetadataImpl— persistisColumnarMap,mapKeyType,mapValueTypeflags in segment metadataStandardIndexes— registerCOLUMNAR_MAPas a known index typeNullDataSource—DataSourcestub returned for MAP keys absent from a given segment, avoids null-checks throughout the query pathV1Constants— metadata property keys for columnar MAP column propertiesSegmentGeneratorConfig— propagateColumnarMapIndexConfigfrom field configs into segment generationFollow-up PRs
pinot-segment-local) —ColumnarMapColumnSplitter,MutableColumnarMapIndex,ColumnarMapIndexTypepinot-core) —MapFilterOperator,AggregationPlanNode,GroupByPlanNode,MapKeyAwareDictionaryGroupKeyGeneratorpinot-integration-tests) — realtime consuming and committed segment end-to-end testsTest plan
ColumnarMapIndexConfigTest— serialisation/deserialisation, defaults, validationColumnarMapNamingTest— virtual column name round-trip for string/int/long key typesColumnarMapDataTypeTest— key/value type compatibility matrixColumnMetadataImplTest— metadata persistence for columnar MAP columnsSchemaTest— schema validation accepts/rejects MAP field configs correctly