Skip to content

feat(columnar_map): SPI and data model for COLUMNAR_MAP index#18368

Open
tarun11Mavani wants to merge 11 commits intoapache:masterfrom
tarun11Mavani:columnar-map-v2-spi
Open

feat(columnar_map): SPI and data model for COLUMNAR_MAP index#18368
tarun11Mavani wants to merge 11 commits intoapache:masterfrom
tarun11Mavani:columnar-map-v2-spi

Conversation

@tarun11Mavani
Copy link
Copy Markdown
Contributor

@tarun11Mavani tarun11Mavani commented Apr 29, 2026

Summary

This is the first PR in a 4-part stack introducing the COLUMNAR_MAP index type, which stores MAP columns in a columnar format (one sub-column per key) enabling per-key dictionary lookup, inverted-index-based filtering, and dict-based GROUP BY without expression evaluation.
RFC: https://docs.google.com/document/d/14kPmjDTKbO8l0ql4rrN7I5Yki5pqMw6GeGmxxc9grsU/edit?tab=t.0

This PR covers the public API surface only — config, naming, interfaces, and metadata. No implementation code.

pinot-spi changes

  • ColumnarMapIndexConfig — new table/field config object controlling columnar MAP behavior (key limit, value types, encoding)
  • ColumnarMapNaming — canonical naming helpers that derive the virtual column name for each MAP key (e.g. value_map_string$__status)
  • FieldConfig — add COLUMNAR_MAP to the IndexType enum
  • ComplexFieldSpec — expose getMapKeyType() / getMapValueType() accessors
  • Schema — wire ComplexFieldSpec key/value types into schema validation

pinot-segment-spi changes

  • ColumnarMapIndexCreator — creator interface (add(docId, map), seal(), close())
  • ColumnarMapIndexReader — reader interface (getKeys(), getKeyDataSource(key))
  • ColumnMetadataImpl — persist isColumnarMap, mapKeyType, mapValueType flags in segment metadata
  • StandardIndexes — register COLUMNAR_MAP as a known index type
  • NullDataSourceDataSource stub returned for MAP keys absent from a given segment, avoids null-checks throughout the query path
  • V1Constants — metadata property keys for columnar MAP column properties
  • SegmentGeneratorConfig — propagate ColumnarMapIndexConfig from field configs into segment generation

Follow-up PRs

  • PR 2: Segment storage layer (pinot-segment-local) — ColumnarMapColumnSplitter, MutableColumnarMapIndex, ColumnarMapIndexType
  • PR 3: Query engine (pinot-core) — MapFilterOperator, AggregationPlanNode, GroupByPlanNode, MapKeyAwareDictionaryGroupKeyGenerator
  • PR 4: Integration tests (pinot-integration-tests) — realtime consuming and committed segment end-to-end tests

Test plan

  • ColumnarMapIndexConfigTest — serialisation/deserialisation, defaults, validation
  • ColumnarMapNamingTest — virtual column name round-trip for string/int/long key types
  • ColumnarMapDataTypeTest — key/value type compatibility matrix
  • ColumnMetadataImplTest — metadata persistence for columnar MAP columns
  • SchemaTest — schema validation accepts/rejects MAP field configs correctly

Introduces the public API surface for the columnar MAP index type:

pinot-spi:
- ColumnarMapIndexConfig: table/field config for enabling columnar MAP
- ColumnarMapNaming: canonical naming helpers for virtual key/value columns
- FieldConfig: add COLUMNAR_MAP to the index type enum
- ComplexFieldSpec: expose MAP key/value type accessors
- Schema: wire ComplexFieldSpec into schema validation

pinot-segment-spi:
- ColumnarMapIndexCreator: creator interface (add/seal/close lifecycle)
- ColumnarMapIndexReader: reader interface (key enumeration, per-key access)
- ColumnMetadataImpl: persist isColumnarMap flag and key/value types
- StandardIndexes: register COLUMNAR_MAP as a known index type
- NullDataSource: datasource stub for absent MAP keys (moved from segment-local)
- V1Constants: metadata keys for columnar MAP column properties
- SegmentGeneratorConfig: propagate columnar MAP field configs

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Apr 29, 2026

Codecov Report

❌ Patch coverage is 66.66667% with 43 lines in your changes missing coverage. Please review.
✅ Project coverage is 63.51%. Comparing base (ac47494) to head (16e05bd).
⚠️ Report is 13 commits behind head on master.

Files with missing lines Patch % Lines
...ot/segment/spi/creator/SegmentGeneratorConfig.java 0.00% 10 Missing ⚠️
...segment/spi/index/metadata/ColumnMetadataImpl.java 59.09% 5 Missing and 4 partials ⚠️
...pinot/spi/config/table/ColumnarMapIndexConfig.java 80.43% 0 Missing and 9 partials ⚠️
...va/org/apache/pinot/spi/data/ComplexFieldSpec.java 80.00% 4 Missing and 3 partials ⚠️
...pache/pinot/segment/spi/index/StandardIndexes.java 0.00% 2 Missing ⚠️
...gment/spi/index/reader/ColumnarMapIndexReader.java 0.00% 2 Missing ⚠️
...a/org/apache/pinot/spi/data/ColumnarMapNaming.java 75.00% 0 Missing and 2 partials ⚠️
...he/pinot/segment/spi/datasource/MapDataSource.java 0.00% 1 Missing ⚠️
...ent/spi/index/creator/ColumnarMapIndexCreator.java 0.00% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #18368      +/-   ##
============================================
+ Coverage     63.44%   63.51%   +0.07%     
- Complexity     1683     1709      +26     
============================================
  Files          3253     3258       +5     
  Lines        198854   199066     +212     
  Branches      30796    30835      +39     
============================================
+ Hits         126154   126436     +282     
+ Misses        62627    62527     -100     
- Partials      10073    10103      +30     
Flag Coverage Δ
custom-integration1 100.00% <ø> (ø)
integration 100.00% <ø> (ø)
integration1 100.00% <ø> (ø)
integration2 0.00% <ø> (ø)
java-21 63.51% <66.66%> (+0.07%) ⬆️
temurin 63.51% <66.66%> (+0.07%) ⬆️
unittests 63.51% <66.66%> (+0.07%) ⬆️
unittests1 55.46% <66.66%> (+0.09%) ⬆️
unittests2 34.98% <2.32%> (-0.03%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@tarun11Mavani
Copy link
Copy Markdown
Contributor Author

@raghavagrawal @Jackie-Jiang please take a look.

Copy link
Copy Markdown
Contributor

@xiangfu0 xiangfu0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Flagging two compatibility risks on the current head: one shared-schema wire-format break and one existing-schema validation break.

Comment thread pinot-spi/src/main/java/org/apache/pinot/spi/data/ComplexFieldSpec.java Outdated
Comment thread pinot-spi/src/main/java/org/apache/pinot/spi/data/Schema.java Outdated
…UMNAR_MAP

ComplexFieldSpec.toJsonObject(): always emit childFieldSpecs for MAP
columns. When _childFieldSpecs is empty (new COLUMNAR_MAP schemas with
no explicit children), synthesise the legacy {key:STRING, value:STRING}
defaults so older brokers/servers deserialising via toMapFieldSpec()
continue to work across rolling upgrades.

Schema.validate(): remove global \$__ rejection. The check fired on
every schema regardless of whether COLUMNAR_MAP was configured, breaking
existing tables whose column names happened to contain that separator.
The guard is now in TableConfigUtils.validate() — only fires when
COLUMNAR_MAP is actually enabled for a table.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
/**
* Returns true if {@code key} is present in this MAP column for at least one document in
* this segment. Call this before {@link #getKeyDataSource(String)} to avoid undefined
* behaviour on absent keys.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should not absent keys fall into Sparse Key reader Path? Do we need this ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ContainsKey is currently used as an early-exit guard in MapFilterOperator and the aggregation/group-by plan nodes to skip processing when a key is absent from the segment (backed by the map key set) which is O(1) lookup.

Removing it would mean every caller replaces containsKey(key) with getDataSource(key) != null, which triggers a full NullDataSource construction just to check presence.

// Optional, default false
public static final String IS_AUTO_GENERATED = "isAutoGenerated";

// Optional, default false. True for virtual columns materialized from a COLUMNAR_MAP parent column.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you share how does columnMetadata and index_map looks like for virtual columns?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

}

/** Maximum number of MAP keys to materialise as dense virtual columns. Default: 1000. */
public int getMaxDenseKeys() {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we choose which 1000 keys incase we have more than 1000 dense keys? is it based on frequency ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When more keys qualify as dense than maxDenseKeys allows, the top maxDenseKeys keys ranked by fill rate are materialized; the remainder fall back to the sparse MAP column.
I have also added this in the java doc.

/**
* Configuration for the COLUMNAR_MAP index on a MAP column.
*
* <p>Dense keys (above {@code denseKeyMinFillRate} or explicitly listed in {@code denseKeys}) are
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you know if a key is dense or sparse? is it using static list in Config ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's determined at the segment seal time based on the fill rate.
A key can be dense in one segment and sparse in another unless specifically added as denseKeys in the config. In that case, it will always be dense regardless of fill rate.

@Jackie-Jiang Jackie-Jiang added the feature New functionality label May 6, 2026
* Get the Data Source representation of a single key within this map column.
* Only call after confirming the key exists via {@link #containsKey(String)}.
*/
DataSource getKeyDataSource(String key);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not introduced in this PR, but let's rename it to:

Suggested change
DataSource getKeyDataSource(String key);
@Nullable
DataSource getDataSource(String key);

It is very confusing now because the data source is for value, not key.
Suggest letting it return @Nullable to represent key not exist

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed on the naming — getDataSource(key) is cleaner. I'll do the full getKeyXXX → getXXX rename across the SPI in a separate refactoring PR so it's one clean sweep rather than scattered across the stack. Will raise a new PR for this.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For @nullable: I'd prefer to keep the current non-null contract backed by NullDataSource. The callers in ProjectionBlock and ItemTransformFunction dereference the result immediately without a null-check — NullDataSource lets them do that safely by returning the column's default value for every doc.

If we switch to @nullable, those two callers need null-guards, and so does any future caller. NullDataSource gives the same semantic (absent key → null/default for all rows) without pushing null-handling into every call site.
Happy to revisit this if you strognly feel we should add @nullable.

Let me know your thoughts.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the API names here - #18437

* <p>The default implementation delegates to {@link #getKeyDataSources()}, which may be
* expensive for large key sets. Implementations should override for O(1) performance.
*/
default boolean containsKey(String key) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(optional) This is probably not needed if we make getDataSource return @Nullable

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replied above.

For https://github.com/nullable: I'd prefer to keep the current non-null contract backed by NullDataSource. The callers in ProjectionBlock and ItemTransformFunction dereference the result immediately without a null-check — NullDataSource lets them do that safely by returning the column's default value for every doc.

If we switch to https://github.com/nullable, those two callers need null-guards, and so does any future caller. NullDataSource gives the same semantic (absent key → null/default for all rows) without pushing null-handling into every call site.
Happy to revisit this if you strognly feel we should add https://github.com/nullable.

Let me know your thoughts.

public static final String TEXT_ID = "text_index";
public static final String H3_ID = "h3_index";
public static final String VECTOR_ID = "vector_index";
public static final String COLUMNAR_MAP_ID = "columnar_map";
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we just call it MAP? Do you foresee other map types to be added in the future that doesn't go under this?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's already a MAP concept in the codebase. The existing MapIndexReader / MapDataSource / MapColumnPreIndexStatsCollector family represents the non-columnar MAP storage path. Naming the new index type MAP would collide with that established concept and make it unclear which path a given piece of code refers to.
Naming it columnar_map also calls out how it's stored clearly.

Comment thread pinot-spi/src/main/java/org/apache/pinot/spi/config/table/FieldConfig.java Outdated
tarun11Mavani added a commit to tarun11Mavani/pinot that referenced this pull request May 7, 2026
- getMaterializedColumnMetadata replaces getVirtualColumnMetadata
- getValueType replaces getKeyValueType
- isMaterializedMapColumn replaces isMapVirtualColumn; _isMapVirtualColumn removed
- IS_MAP_MATERIALIZED_COLUMN replaces IS_MAP_VIRTUAL_COLUMN (string: mapMaterializedColumn)
- FieldConfig.IndexType enum Javadoc
- ColumnarMapIndexConfig dense/sparse + maxDenseKeys Javadoc

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
tarun11Mavani and others added 3 commits May 7, 2026 07:29
…e and ColumnarMapDataSource

containsKey() is allowed to return true conservatively when key presence cannot be determined
without a full scan (e.g. when a sparse column exists). Document this in the interface Javadoc
so callers know to handle absent keys even when containsKey returns true. Also document that
getKeyDataSources() returns dense keys only.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…ssion test

- MapDataSource.containsKey() Javadoc: replace {@link NullDataSource} (a
  pinot-segment-local class) with prose to fix illegal cross-module reference
  from pinot-segment-spi.
- SegmentV1V2ToV3FormatConverterTest: add testVirtualColumnsNotDoubleWrittenDuringV3Conversion
  that builds a V1 MAP segment via the full SegmentIndexCreationDriverImpl pipeline
  (virtual columns now in COMPLEX_COLUMNS), converts to V3, and asserts each virtual
  column's forward_index.startOffset appears exactly once in index_map — catching any
  regression in the dedup logic that guards against double-write when virtual columns
  are present in both getAllColumns() and getVirtualColumnsFromMetadata().

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Clarify that the DataSource returned for an absent key returns the
column default value for forward-index reads and marks all rows as null
via the null-value bitmap.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
tarun11Mavani added a commit to tarun11Mavani/pinot that referenced this pull request May 7, 2026
- getMaterializedColumnMetadata replaces getVirtualColumnMetadata
- getValueType replaces getKeyValueType
- isMaterializedMapColumn replaces isMapVirtualColumn; _isMapVirtualColumn
  field removed (derived as parentMapColumn != null)
- IS_MAP_MATERIALIZED_COLUMN replaces IS_MAP_VIRTUAL_COLUMN (on-disk key:
  mapMaterializedColumn)
- FieldConfig.IndexType: add one-line Javadoc to all 11 enum constants
- ColumnarMapIndexConfig: document dense/sparse key selection criteria and
  top-N-by-fill-rate semantics for maxDenseKeys cutoff

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
@tarun11Mavani tarun11Mavani force-pushed the columnar-map-v2-spi branch from 75f588a to b098f20 Compare May 7, 2026 07:31
tarun11Mavani added a commit to tarun11Mavani/pinot that referenced this pull request May 7, 2026
- getMaterializedColumnMetadata replaces getVirtualColumnMetadata
- getValueType replaces getKeyValueType
- isMaterializedMapColumn replaces isMapVirtualColumn; _isMapVirtualColumn
  field removed (derived as parentMapColumn != null)
- IS_MAP_MATERIALIZED_COLUMN replaces IS_MAP_VIRTUAL_COLUMN (on-disk key:
  mapMaterializedColumn)
- FieldConfig.IndexType: add one-line Javadoc to all 11 enum constants
- ColumnarMapIndexConfig: document dense/sparse key selection criteria and
  top-N-by-fill-rate semantics for maxDenseKeys cutoff

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
@tarun11Mavani tarun11Mavani force-pushed the columnar-map-v2-spi branch from b098f20 to 59a62b9 Compare May 7, 2026 07:41
tarun11Mavani added a commit to tarun11Mavani/pinot that referenced this pull request May 7, 2026
- getMaterializedColumnMetadata replaces getVirtualColumnMetadata
- getValueType replaces getKeyValueType
- isMaterializedMapColumn replaces isMapVirtualColumn; _isMapVirtualColumn
  field removed (derived as parentMapColumn != null)
- IS_MAP_MATERIALIZED_COLUMN replaces IS_MAP_VIRTUAL_COLUMN (on-disk key:
  mapMaterializedColumn)
- FieldConfig.IndexType: add one-line Javadoc to all 11 enum constants
- ColumnarMapIndexConfig: document dense/sparse key selection criteria and
  top-N-by-fill-rate semantics for maxDenseKeys cutoff

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
@tarun11Mavani tarun11Mavani force-pushed the columnar-map-v2-spi branch from 59a62b9 to f12d50d Compare May 7, 2026 07:43
- getMaterializedColumnMetadata replaces getVirtualColumnMetadata
- getValueType replaces getKeyValueType
- isMaterializedMapColumn replaces isMapVirtualColumn; _isMapVirtualColumn
  field removed (derived as parentMapColumn != null)
- IS_MAP_MATERIALIZED_COLUMN replaces IS_MAP_VIRTUAL_COLUMN (on-disk key:
  mapMaterializedColumn)
- FieldConfig.IndexType: add one-line Javadoc to all 11 enum constants
- ColumnarMapIndexConfig: document dense/sparse key selection criteria and
  top-N-by-fill-rate semantics for maxDenseKeys cutoff

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
@tarun11Mavani tarun11Mavani force-pushed the columnar-map-v2-spi branch from f12d50d to a0e8451 Compare May 7, 2026 09:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature New functionality

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants