feat(columnar_map): SPI and data model for COLUMNAR_MAP index by tarun11Mavani · Pull Request #18368 · apache/pinot

tarun11Mavani · 2026-04-29T05:56:54Z

Summary

This is the first PR in a 4-part stack introducing the COLUMNAR_MAP index type, which stores MAP columns in a columnar format (one sub-column per key) enabling per-key dictionary lookup, inverted-index-based filtering, and dict-based GROUP BY without expression evaluation.
RFC: https://docs.google.com/document/d/14kPmjDTKbO8l0ql4rrN7I5Yki5pqMw6GeGmxxc9grsU/edit?tab=t.0

This PR covers the public API surface only — config, naming, interfaces, and metadata. No implementation code.

`pinot-spi` changes

ColumnarMapIndexConfig — new table/field config object controlling columnar MAP behavior (key limit, value types, encoding)
ColumnarMapNaming — canonical naming helpers that derive the virtual column name for each MAP key (e.g. value_map_string$__status)
FieldConfig — add COLUMNAR_MAP to the IndexType enum
ComplexFieldSpec — expose getMapKeyType() / getMapValueType() accessors
Schema — wire ComplexFieldSpec key/value types into schema validation

`pinot-segment-spi` changes

ColumnarMapIndexCreator — creator interface (add(docId, map), seal(), close())
ColumnarMapIndexReader — reader interface (getKeys(), getKeyDataSource(key))
ColumnMetadataImpl — persist isColumnarMap, mapKeyType, mapValueType flags in segment metadata
StandardIndexes — register COLUMNAR_MAP as a known index type
NullDataSource — DataSource stub returned for MAP keys absent from a given segment, avoids null-checks throughout the query path
V1Constants — metadata property keys for columnar MAP column properties
SegmentGeneratorConfig — propagate ColumnarMapIndexConfig from field configs into segment generation

Follow-up PRs

PR 2: Segment storage layer (pinot-segment-local) — ColumnarMapColumnSplitter, MutableColumnarMapIndex, ColumnarMapIndexType
PR 3: Query engine (pinot-core) — MapFilterOperator, AggregationPlanNode, GroupByPlanNode, MapKeyAwareDictionaryGroupKeyGenerator
PR 4: Integration tests (pinot-integration-tests) — realtime consuming and committed segment end-to-end tests

Test plan

ColumnarMapIndexConfigTest — serialisation/deserialisation, defaults, validation
ColumnarMapNamingTest — virtual column name round-trip for string/int/long key types
ColumnarMapDataTypeTest — key/value type compatibility matrix
ColumnMetadataImplTest — metadata persistence for columnar MAP columns
SchemaTest — schema validation accepts/rejects MAP field configs correctly

Introduces the public API surface for the columnar MAP index type: pinot-spi: - ColumnarMapIndexConfig: table/field config for enabling columnar MAP - ColumnarMapNaming: canonical naming helpers for virtual key/value columns - FieldConfig: add COLUMNAR_MAP to the index type enum - ComplexFieldSpec: expose MAP key/value type accessors - Schema: wire ComplexFieldSpec into schema validation pinot-segment-spi: - ColumnarMapIndexCreator: creator interface (add/seal/close lifecycle) - ColumnarMapIndexReader: reader interface (key enumeration, per-key access) - ColumnMetadataImpl: persist isColumnarMap flag and key/value types - StandardIndexes: register COLUMNAR_MAP as a known index type - NullDataSource: datasource stub for absent MAP keys (moved from segment-local) - V1Constants: metadata keys for columnar MAP column properties - SegmentGeneratorConfig: propagate columnar MAP field configs Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

codecov-commenter · 2026-04-29T06:45:49Z

Codecov Report

❌ Patch coverage is 66.66667% with 43 lines in your changes missing coverage. Please review.
✅ Project coverage is 63.51%. Comparing base (ac47494) to head (16e05bd).
⚠️ Report is 13 commits behind head on master.

Files with missing lines	Patch %	Lines
...ot/segment/spi/creator/SegmentGeneratorConfig.java	0.00%	10 Missing ⚠️
...segment/spi/index/metadata/ColumnMetadataImpl.java	59.09%	5 Missing and 4 partials ⚠️
...pinot/spi/config/table/ColumnarMapIndexConfig.java	80.43%	0 Missing and 9 partials ⚠️
...va/org/apache/pinot/spi/data/ComplexFieldSpec.java	80.00%	4 Missing and 3 partials ⚠️
...pache/pinot/segment/spi/index/StandardIndexes.java	0.00%	2 Missing ⚠️
...gment/spi/index/reader/ColumnarMapIndexReader.java	0.00%	2 Missing ⚠️
...a/org/apache/pinot/spi/data/ColumnarMapNaming.java	75.00%	0 Missing and 2 partials ⚠️
...he/pinot/segment/spi/datasource/MapDataSource.java	0.00%	1 Missing ⚠️
...ent/spi/index/creator/ColumnarMapIndexCreator.java	0.00%	1 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##             master   #18368      +/-   ##
============================================
+ Coverage     63.44%   63.51%   +0.07%     
- Complexity     1683     1709      +26     
============================================
  Files          3253     3258       +5     
  Lines        198854   199066     +212     
  Branches      30796    30835      +39     
============================================
+ Hits         126154   126436     +282     
+ Misses        62627    62527     -100     
- Partials      10073    10103      +30

Flag	Coverage Δ
custom-integration1	`100.00% <ø> (ø)`
integration	`100.00% <ø> (ø)`
integration1	`100.00% <ø> (ø)`
integration2	`0.00% <ø> (ø)`
java-21	`63.51% <66.66%> (+0.07%)`	⬆️
temurin	`63.51% <66.66%> (+0.07%)`	⬆️
unittests	`63.51% <66.66%> (+0.07%)`	⬆️
unittests1	`55.46% <66.66%> (+0.09%)`	⬆️
unittests2	`34.98% <2.32%> (-0.03%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

…isVirtualColumn -> isMapVirtualColumn

…lumn and setMapVirtualColumn

…MapIndexConfig and usages

tarun11Mavani · 2026-04-30T06:44:46Z

@raghavagrawal @Jackie-Jiang please take a look.

xiangfu0

Flagging two compatibility risks on the current head: one shared-schema wire-format break and one existing-schema validation break.

…UMNAR_MAP ComplexFieldSpec.toJsonObject(): always emit childFieldSpecs for MAP columns. When _childFieldSpecs is empty (new COLUMNAR_MAP schemas with no explicit children), synthesise the legacy {key:STRING, value:STRING} defaults so older brokers/servers deserialising via toMapFieldSpec() continue to work across rolling upgrades. Schema.validate(): remove global \$__ rejection. The check fired on every schema regardless of whether COLUMNAR_MAP was configured, breaking existing tables whose column names happened to contain that separator. The guard is now in TableConfigUtils.validate() — only fires when COLUMNAR_MAP is actually enabled for a table. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

raghavyadav01 · 2026-05-06T20:15:19Z

+  /**
+   * Returns true if {@code key} is present in this MAP column for at least one document in
+   * this segment. Call this before {@link #getKeyDataSource(String)} to avoid undefined
+   * behaviour on absent keys.


Should not absent keys fall into Sparse Key reader Path? Do we need this ?

ContainsKey is currently used as an early-exit guard in MapFilterOperator and the aggregation/group-by plan nodes to skip processing when a key is absent from the segment (backed by the map key set) which is O(1) lookup.

Removing it would mean every caller replaces containsKey(key) with getDataSource(key) != null, which triggers a full NullDataSource construction just to check presence.

raghavyadav01 · 2026-05-06T20:19:25Z

      // Optional, default false
      public static final String IS_AUTO_GENERATED = "isAutoGenerated";

+      // Optional, default false. True for virtual columns materialized from a COLUMNAR_MAP parent column.


Can you share how does columnMetadata and index_map looks like for virtual columns?

sure. I have it documented here.
https://docs.google.com/document/d/14kPmjDTKbO8l0ql4rrN7I5Yki5pqMw6GeGmxxc9grsU/edit?tab=t.0#bookmark=id.kycvq78ioe5g

raghavyadav01 · 2026-05-06T20:21:04Z

+  }
+
+  /** Maximum number of MAP keys to materialise as dense virtual columns. Default: 1000. */
+  public int getMaxDenseKeys() {


How do we choose which 1000 keys incase we have more than 1000 dense keys? is it based on frequency ?

When more keys qualify as dense than maxDenseKeys allows, the top maxDenseKeys keys ranked by fill rate are materialized; the remainder fall back to the sparse MAP column.
I have also added this in the java doc.

raghavyadav01 · 2026-05-06T20:31:14Z

+/**
+ * Configuration for the COLUMNAR_MAP index on a MAP column.
+ *
+ * <p>Dense keys (above {@code denseKeyMinFillRate} or explicitly listed in {@code denseKeys}) are


How do you know if a key is dense or sparse? is it using static list in Config ?

it's determined at the segment seal time based on the fill rate.
A key can be dense in one segment and sparse in another unless specifically added as denseKeys in the config. In that case, it will always be dense regardless of fill rate.

Jackie-Jiang · 2026-05-06T21:47:04Z

   * Get the Data Source representation of a single key within this map column.
+   * Only call after confirming the key exists via {@link #containsKey(String)}.
   */
  DataSource getKeyDataSource(String key);


Not introduced in this PR, but let's rename it to:

Suggested change

DataSource getKeyDataSource(String key);

@Nullable

DataSource getDataSource(String key);

It is very confusing now because the data source is for value, not key.
Suggest letting it return @Nullable to represent key not exist

Agreed on the naming — getDataSource(key) is cleaner. I'll do the full getKeyXXX → getXXX rename across the SPI in a separate refactoring PR so it's one clean sweep rather than scattered across the stack. Will raise a new PR for this.

For @nullable: I'd prefer to keep the current non-null contract backed by NullDataSource. The callers in ProjectionBlock and ItemTransformFunction dereference the result immediately without a null-check — NullDataSource lets them do that safely by returning the column's default value for every doc.

If we switch to @nullable, those two callers need null-guards, and so does any future caller. NullDataSource gives the same semantic (absent key → null/default for all rows) without pushing null-handling into every call site.
Happy to revisit this if you strognly feel we should add @nullable.

Let me know your thoughts.

Updated the API names here - #18437

Jackie-Jiang · 2026-05-06T21:48:53Z

+   * <p>The default implementation delegates to {@link #getKeyDataSources()}, which may be
+   * expensive for large key sets. Implementations should override for O(1) performance.
+   */
+  default boolean containsKey(String key) {


(optional) This is probably not needed if we make getDataSource return @Nullable

replied above.

For https://github.com/nullable: I'd prefer to keep the current non-null contract backed by NullDataSource. The callers in ProjectionBlock and ItemTransformFunction dereference the result immediately without a null-check — NullDataSource lets them do that safely by returning the column's default value for every doc.

If we switch to https://github.com/nullable, those two callers need null-guards, and so does any future caller. NullDataSource gives the same semantic (absent key → null/default for all rows) without pushing null-handling into every call site.
Happy to revisit this if you strognly feel we should add https://github.com/nullable.

Let me know your thoughts.

Jackie-Jiang · 2026-05-06T22:12:43Z

  public static final String TEXT_ID = "text_index";
  public static final String H3_ID = "h3_index";
  public static final String VECTOR_ID = "vector_index";
+  public static final String COLUMNAR_MAP_ID = "columnar_map";


Should we just call it MAP? Do you foresee other map types to be added in the future that doesn't go under this?

There's already a MAP concept in the codebase. The existing MapIndexReader / MapDataSource / MapColumnPreIndexStatsCollector family represents the non-columnar MAP storage path. Naming the new index type MAP would collide with that established concept and make it unclear which path a given piece of code refers to.
Naming it columnar_map also calls out how it's stored clearly.

- getMaterializedColumnMetadata replaces getVirtualColumnMetadata - getValueType replaces getKeyValueType - isMaterializedMapColumn replaces isMapVirtualColumn; _isMapVirtualColumn removed - IS_MAP_MATERIALIZED_COLUMN replaces IS_MAP_VIRTUAL_COLUMN (string: mapMaterializedColumn) - FieldConfig.IndexType enum Javadoc - ColumnarMapIndexConfig dense/sparse + maxDenseKeys Javadoc Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

…e and ColumnarMapDataSource containsKey() is allowed to return true conservatively when key presence cannot be determined without a full scan (e.g. when a sparse column exists). Document this in the interface Javadoc so callers know to handle absent keys even when containsKey returns true. Also document that getKeyDataSources() returns dense keys only. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

…ssion test - MapDataSource.containsKey() Javadoc: replace {@link NullDataSource} (a pinot-segment-local class) with prose to fix illegal cross-module reference from pinot-segment-spi. - SegmentV1V2ToV3FormatConverterTest: add testVirtualColumnsNotDoubleWrittenDuringV3Conversion that builds a V1 MAP segment via the full SegmentIndexCreationDriverImpl pipeline (virtual columns now in COMPLEX_COLUMNS), converts to V3, and asserts each virtual column's forward_index.startOffset appears exactly once in index_map — catching any regression in the dedup logic that guards against double-write when virtual columns are present in both getAllColumns() and getVirtualColumnsFromMetadata(). Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

Clarify that the DataSource returned for an absent key returns the column default value for forward-index reads and marks all rows as null via the null-value bitmap. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

- getMaterializedColumnMetadata replaces getVirtualColumnMetadata - getValueType replaces getKeyValueType - isMaterializedMapColumn replaces isMapVirtualColumn; _isMapVirtualColumn field removed (derived as parentMapColumn != null) - IS_MAP_MATERIALIZED_COLUMN replaces IS_MAP_VIRTUAL_COLUMN (on-disk key: mapMaterializedColumn) - FieldConfig.IndexType: add one-line Javadoc to all 11 enum constants - ColumnarMapIndexConfig: document dense/sparse key selection criteria and top-N-by-fill-rate semantics for maxDenseKeys cutoff Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

tarun11Mavani added 5 commits April 29, 2026 08:44

refactor(map): delete NullDataSource from pinot-segment-spi

d4b23e3

feat(map): add containsKey default method to MapDataSource SPI

d3a31e5

refactor(map): rename IS_VIRTUAL_COLUMN -> IS_MAP_VIRTUAL_COLUMN and …

041abe2

…isVirtualColumn -> isMapVirtualColumn

fix(map): update ColumnMetadataImplTest to use renamed isMapVirtualCo…

3ef6d5e

…lumn and setMapVirtualColumn

refactor(map): rename maxKeys -> maxDenseKeys for clarity in Columnar…

16e05bd

…MapIndexConfig and usages

xiangfu0 reviewed Apr 30, 2026

View reviewed changes

Comment thread pinot-spi/src/main/java/org/apache/pinot/spi/data/ComplexFieldSpec.java Outdated

Comment thread pinot-spi/src/main/java/org/apache/pinot/spi/data/Schema.java Outdated

raghavyadav01 reviewed May 6, 2026

View reviewed changes

raghavyadav01 approved these changes May 6, 2026

View reviewed changes

Jackie-Jiang added the feature New functionality label May 6, 2026

Jackie-Jiang reviewed May 6, 2026

View reviewed changes

tarun11Mavani and others added 3 commits May 7, 2026 07:29

tarun11Mavani force-pushed the columnar-map-v2-spi branch from 75f588a to b098f20 Compare May 7, 2026 07:31

tarun11Mavani force-pushed the columnar-map-v2-spi branch from b098f20 to 59a62b9 Compare May 7, 2026 07:41

tarun11Mavani force-pushed the columnar-map-v2-spi branch from 59a62b9 to f12d50d Compare May 7, 2026 07:43

tarun11Mavani force-pushed the columnar-map-v2-spi branch from f12d50d to a0e8451 Compare May 7, 2026 09:00

	DataSource getKeyDataSource(String key);
	@Nullable
	DataSource getDataSource(String key);

Conversation

tarun11Mavani commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

pinot-spi changes

pinot-segment-spi changes

Follow-up PRs

Test plan

Uh oh!

codecov-commenter commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

tarun11Mavani commented Apr 30, 2026

Uh oh!

xiangfu0 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

tarun11Mavani commented Apr 29, 2026 •

edited

Loading

`pinot-spi` changes

`pinot-segment-spi` changes

codecov-commenter commented Apr 29, 2026 •

edited

Loading