diff --git a/build-with-pinot/indexing/dictionary-index.md b/build-with-pinot/indexing/dictionary-index.md index 421b76a4..20986efb 100644 --- a/build-with-pinot/indexing/dictionary-index.md +++ b/build-with-pinot/indexing/dictionary-index.md @@ -20,10 +20,10 @@ In Pinot, dictionaries serve as both an index and actual encoding. Consequently, | ------------------------------------------- | ------------------------- | ------------------------------------------------------------------- | | [forward](forward-index.md) | | Implementation depends on whether the dictionary is enabled or not. | | [range](range-index.md) | | Implementation depends on whether the dictionary is enabled or not. | -| [inverted](inverted-index.md) | | Requires the dictionary index to be enabled. | +| [inverted](inverted-index.md) | | Uses dictionary IDs. Pinot can materialize a standalone dictionary for RAW columns when the index is enabled. | | [json](json-index.md) | when `optimizeDictionary` | Disables dictionary. | | [text](text-search-support.md) | when `optimizeDictionary` | Disables dictionary. | -| FST | | Requires dictionary. | +| FST | | Uses dictionary values. Pinot can materialize a standalone dictionary for RAW STRING columns when FST is enabled. | | [H3 (or geospatial)](geospatial-support.md) | | Incompatible with dictionary. | ## Configuration @@ -70,6 +70,10 @@ Alternatively, the `encodingType` property can be changed. For example: You may choose the option you prefer, but it's essential to maintain consistency, as Pinot will reject table configurations where the same column and index are defined in different locations. +Even when a column keeps a RAW forward index, Pinot may still materialize a standalone dictionary when another enabled +index needs dictionary IDs or dictionary values. This lets a RAW column back features such as bitmap inverted indexes +or FST/IFST without changing the forward-index encoding. + ### Heuristically enable dictionaries Most of the time the domain expert that creates the table knows whether a dictionary will be useful or not. For example, a column with random values or public IPs will probably have a large cardinality, so they can be immediately be targeted as raw encoded while columns like employee ids will have a small cardinality and therefore can be easily be recognized as good dictionary candidates. But sometimes the decision may not be clear. To help in these situations, Pinot can be configured to heuristically create the dictionary depending on the actual values and a relation factor. diff --git a/build-with-pinot/indexing/forward-index.md b/build-with-pinot/indexing/forward-index.md index 6344a571..a24da008 100644 --- a/build-with-pinot/indexing/forward-index.md +++ b/build-with-pinot/indexing/forward-index.md @@ -116,7 +116,10 @@ The raw value forward index stores actual values instead of IDs. This means that As shown in the diagram below, dictionary encoding can lead to numerous random memory accesses for dictionary lookups. In contrast, the raw value forward index allows for sequential value scanning, which can enhance query performance when applied appropriately. -Note: Raw value forward index currently does not support inverted index (all others JSON/TEXT/Range/etc are supported). Also, since reading a value from this index requires reading the entire chunk in memory and decompressing, it is not suitable for heavy random reads. +Note: A RAW forward index can still be paired with secondary indexes that need dictionary IDs. When you enable a +dictionary-backed index such as bitmap inverted index or FST/IFST on a RAW column, Pinot keeps the forward index RAW +and materializes a standalone dictionary for the secondary index. Since reading a value from this index requires +reading the entire chunk in memory and decompressing, it is not suitable for heavy random reads. **Sorted raw columns:** As of Pinot 1.3.0, raw columns can now be configured as sorted columns without forcing an inverted index. Previously, configuring a column as both sorted and no-dictionary would cause Pinot to force-add an inverted index, which negated the storage benefits of raw encoding. Now, you can have a time-sorted raw column (e.g., a timestamp column) without dictionary encoding or inverted index, allowing for efficient storage while maintaining sort order metadata. diff --git a/build-with-pinot/indexing/fst-index.md b/build-with-pinot/indexing/fst-index.md index 5571e155..22fdcb0e 100644 --- a/build-with-pinot/indexing/fst-index.md +++ b/build-with-pinot/indexing/fst-index.md @@ -1,6 +1,7 @@ # FST index -The FST (Finite State Transducer) index accelerates regex queries on dictionary-encoded STRING columns. It reduces the on-disk index size by 4-6x compared to scanning the full dictionary. +The FST (Finite State Transducer) index accelerates regex queries on STRING columns by building over dictionary +values. It reduces the on-disk index size by 4-6x compared to scanning the full dictionary. ## When to use @@ -10,13 +11,14 @@ Use an FST index when your queries use `LIKE` or `REGEXP_LIKE` predicates on str - STRING columns only - Must be single-valued -- Must be dictionary-encoded +- Must have dictionary values available. This can come from a dictionary-encoded forward index, or Pinot can + materialize a standalone dictionary while keeping the forward index RAW. ## Limitations - Only supports regex queries (`LIKE` and `REGEXP_LIKE` predicates). - Only supported on stored or completed segments (not consuming segments in real-time tables). -- Only supported on dictionary-encoded columns. +- Only supported on columns with dictionary values available. - Works best for prefix queries. Suffix-only or infix-only patterns may not benefit as much. {% hint style="info" %} @@ -27,7 +29,7 @@ For more information on FST construction, see the [Lucene FST documentation](htt ## Configuration -To enable the FST index on a dictionary-encoded column: +To enable the FST index on a column: {% code title="Recommended: fieldConfigList" %} ```json @@ -43,7 +45,9 @@ To enable the FST index on a dictionary-encoded column: ``` {% endcode %} -The FST index generates one index file (`.lucene.fst`). If an inverted index is also enabled on the column, FST can take advantage of it for faster lookups. +The FST index generates one index file (`.lucene.fst`). If you keep the forward index RAW, Pinot materializes a +standalone dictionary for the FST automatically. If an inverted index is also enabled on the column, FST can take +advantage of it for faster lookups. ## Query examples @@ -77,7 +81,7 @@ The case-insensitive FST index (IFST) provides the same functionality as the sta - Supports case-insensitive regex queries. - Only supported on stored or completed segments (not consuming segments). -- Only supported on dictionary-encoded STRING columns. +- Only supported on STRING columns with dictionary values available. - Works best for prefix queries with case-insensitive matching. ### Configuration diff --git a/build-with-pinot/indexing/inverted-index.md b/build-with-pinot/indexing/inverted-index.md index ef709b79..2af091cb 100644 --- a/build-with-pinot/indexing/inverted-index.md +++ b/build-with-pinot/indexing/inverted-index.md @@ -42,6 +42,10 @@ The recommended way to enable a bitmap inverted index: ``` {% endcode %} +If the column uses a RAW forward index, you do not need to add a separate dictionary configuration just to make the +bitmap inverted index work. Pinot keeps the forward index RAW and materializes a standalone dictionary for the +inverted index automatically. +
Older configuration @@ -112,7 +116,8 @@ LIMIT 10 ## Limitations -- Bitmap inverted indexes require [dictionary encoding](dictionary-index.md) to be enabled on the column. +- Bitmap inverted indexes require dictionary IDs, but Pinot can satisfy that either with a dictionary-encoded forward + index or with a standalone dictionary materialized for a RAW forward index. - Sorted inverted indexes (on dictionary-encoded columns) only work on columns whose data is physically sorted within each segment. - Sorted raw columns (no-dictionary) also support sort metadata without requiring an inverted index. - MAP columns are not supported.