Skip to content

feat(writer): honor write.metadata.metrics.* properties — bound truncation missing in MinMaxColAggregator #2362

@laichunpongben

Description

@laichunpongben

Summary

The write path (including ParquetWriter::parquet_files_to_data_files and ParquetWriter::close) currently stores the full, untruncated values of a column's minimum and maximum in the manifest's lower_bounds / upper_bounds maps. The Iceberg spec (and every other official implementation — Java, Python, pyiceberg) applies a configurable truncation step at manifest serialization time, controlled by the write.metadata.metrics.* table properties, with truncate(16) as the default for binary and string columns.

iceberg-rust has no write-side metrics-mode framework at all: neither the constants, the MetricsMode parser, nor the truncation helpers exist, and MinMaxColAggregator::produce() is not aware of table properties.

This is strictly a write-path gap. The read path already understands truncated bounds: inclusive_metrics_evaluator.rs and page_index_evaluator.rs both truncate filter datums when comparing against possibly-truncated manifest bounds, so reads against tables written by other implementations work correctly. Tables written by iceberg-rust are also read correctly by pyiceberg/Java, because untruncated bounds are more precise and still valid. So this is not a correctness regression — but it is a spec-conformance and manifest-size gap.

Reproduction (against a REST catalog + S3)

Using the dev/docker-compose.yaml fixture:

  1. Create a partitioned table with a string column carrying, say, 32-byte values.
  2. Write a Parquet file where the column's min/max is the same 32-byte string.
  3. Convert to DataFile via ParquetWriter::parquet_files_to_data_files.
  4. Inspect the resulting DataFile.lower_bounds / .upper_bounds for the string field:
    • iceberg-rust (current): both entries store all 32 bytes verbatim.
    • pyiceberg (reference): lower_bounds stores the first 16 bytes; upper_bounds stores the first 16 bytes with the final codepoint incremented (so the truncated value remains a valid upper bound).

Reference implementation (pyiceberg, Apache-2.0)

Property constants (pyiceberg/table/__init__.py)

class TableProperties:
    DEFAULT_WRITE_METRICS_MODE = "write.metadata.metrics.default"
    DEFAULT_WRITE_METRICS_MODE_DEFAULT = "truncate(16)"
    METRICS_MODE_COLUMN_CONF_PREFIX = "write.metadata.metrics.column"

Mode parser (pyiceberg/io/pyarrow.py)

class MetricModeTypes(Enum):
    TRUNCATE = "truncate"
    NONE = "none"
    COUNTS = "counts"
    FULL = "full"

Upper-bound increment (pyiceberg/utils/truncate.py)

def truncate_upper_bound_text_string(value: str, trunc_length: int | None) -> str | None:
    result = value[:trunc_length]
    if result != value:
        chars = [*result]
        for i in range(-1, -len(result) - 1, -1):
            try:
                chars[i] = chr(ord(chars[i]) + 1)
                return "".join(chars)
            except ValueError:
                pass  # this codepoint was at max; try the previous position
        return None
    return result

Binary variant is the same structure, operating on bytes with < 255 check.

Per-column override

Each primitive field is visited; the effective mode is write.metadata.metrics.column.<name> if set, else the default. Non-string/binary columns with mode truncate auto-upgrade to full; nested columns downgrade to counts.

Proposed scope

Minimum viable change that the maintainers might accept in one PR:

  1. Constants — add to spec/table_properties.rs:

    • PROPERTY_WRITE_METRICS_MODE / ..._DEFAULT = "truncate(16)"
    • PROPERTY_METRICS_MODE_COLUMN_CONF_PREFIX
  2. TypeMetricsMode enum + a match_metrics_mode(s: &str) -> Result<MetricsMode> that parses truncate(N), none, counts, full.

  3. Plan — a compute_metrics_plan(schema: &SchemaRef, properties: &HashMap<String, String>) -> HashMap<i32, MetricsMode> mirroring pyiceberg's compute_statistics_plan (pre-order visit with nested-downgrade and non-text-auto-upgrade).

  4. Threading — pass TableMetadata (or the precomputed plan) into parquet_to_data_file_builder so the aggregator can see it.

  5. AggregatorMinMaxColAggregator gains a per-field mode, and its bound-producing methods apply truncate(N) for strings/binary as described.

  6. Helpers — port truncate_upper_bound_text_string / truncate_upper_bound_binary_string into a new spec::truncate_bound (or similar) module. Rust-native UTF-8: iterate by chars() rather than bytes for the string variant, matching pyiceberg's semantics.

  7. Tests — unit coverage per mode (full, counts, none, truncate(N)) × per primitive type; integration test that round-trips a known string against pyiceberg's output (the docker-compose fixture makes this straightforward).

A non-negotiable constraint

partition_value_from_bounds (added in #1079, wired into the add-files flow by its follow-up) must continue to see untruncated bounds. Identity partitioning on a string column relies on min == max to detect single-partition files; if truncation is applied before partition inference, two files with, say, "AAPL-2026-01-15-200-C" rows would falsely fail with "more than one partition values" because the truncated-and-incremented upper bound differs from the truncated lower bound.

pyiceberg maintains this separation cleanly: StatsAggregator.current_min / current_max are the raw values (consumed by _partition_value), and truncation is applied only inside min_as_bytes() / max_as_bytes() on the serialization path. Any fix here should follow the same layering — bounds stay full-precision in the aggregator's public output; truncation happens at the DataFile-serialization boundary.

References

Happy to take this on if there is interest and no duplicate work underway.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions