feat(writer): honor write.metadata.metrics.* properties — bound truncation missing in MinMaxColAggregator

## Summary

The write path (including `ParquetWriter::parquet_files_to_data_files` and `ParquetWriter::close`) currently stores the full, untruncated values of a column's minimum and maximum in the manifest's `lower_bounds` / `upper_bounds` maps. The Iceberg spec (and every other official implementation — Java, Python, pyiceberg) applies a configurable truncation step at manifest serialization time, controlled by the `write.metadata.metrics.*` table properties, with `truncate(16)` as the default for binary and string columns.

iceberg-rust has no write-side metrics-mode framework at all: neither the constants, the `MetricsMode` parser, nor the truncation helpers exist, and `MinMaxColAggregator::produce()` is not aware of table properties.

This is strictly a write-path gap. The read path already understands truncated bounds: `inclusive_metrics_evaluator.rs` and `page_index_evaluator.rs` both truncate filter datums when comparing against possibly-truncated manifest bounds, so reads against tables written by other implementations work correctly. Tables written by iceberg-rust are also read correctly by pyiceberg/Java, because untruncated bounds are more precise and still valid. So this is not a correctness regression — but it is a spec-conformance and manifest-size gap.

## Reproduction (against a REST catalog + S3)

Using the `dev/docker-compose.yaml` fixture:

1. Create a partitioned table with a `string` column carrying, say, 32-byte values.
2. Write a Parquet file where the column's min/max is the same 32-byte string.
3. Convert to `DataFile` via `ParquetWriter::parquet_files_to_data_files`.
4. Inspect the resulting `DataFile.lower_bounds` / `.upper_bounds` for the string field:
   - **iceberg-rust (current)**: both entries store all 32 bytes verbatim.
   - **pyiceberg (reference)**: `lower_bounds` stores the first 16 bytes; `upper_bounds` stores the first 16 bytes with the final codepoint incremented (so the truncated value remains a valid upper bound).

## Reference implementation (pyiceberg, Apache-2.0)

### Property constants (`pyiceberg/table/__init__.py`)

```python
class TableProperties:
    DEFAULT_WRITE_METRICS_MODE = "write.metadata.metrics.default"
    DEFAULT_WRITE_METRICS_MODE_DEFAULT = "truncate(16)"
    METRICS_MODE_COLUMN_CONF_PREFIX = "write.metadata.metrics.column"
```

### Mode parser (`pyiceberg/io/pyarrow.py`)

```python
class MetricModeTypes(Enum):
    TRUNCATE = "truncate"
    NONE = "none"
    COUNTS = "counts"
    FULL = "full"
```

### Upper-bound increment (`pyiceberg/utils/truncate.py`)

```python
def truncate_upper_bound_text_string(value: str, trunc_length: int | None) -> str | None:
    result = value[:trunc_length]
    if result != value:
        chars = [*result]
        for i in range(-1, -len(result) - 1, -1):
            try:
                chars[i] = chr(ord(chars[i]) + 1)
                return "".join(chars)
            except ValueError:
                pass  # this codepoint was at max; try the previous position
        return None
    return result
```

Binary variant is the same structure, operating on bytes with `< 255` check.

### Per-column override

Each primitive field is visited; the effective mode is `write.metadata.metrics.column.<name>` if set, else the default. Non-string/binary columns with mode `truncate` auto-upgrade to `full`; nested columns downgrade to `counts`.

## Proposed scope

Minimum viable change that the maintainers might accept in one PR:

1. **Constants** — add to `spec/table_properties.rs`:
   - `PROPERTY_WRITE_METRICS_MODE` / `..._DEFAULT = "truncate(16)"`
   - `PROPERTY_METRICS_MODE_COLUMN_CONF_PREFIX`

2. **Type** — `MetricsMode` enum + a `match_metrics_mode(s: &str) -> Result<MetricsMode>` that parses `truncate(N)`, `none`, `counts`, `full`.

3. **Plan** — a `compute_metrics_plan(schema: &SchemaRef, properties: &HashMap<String, String>) -> HashMap<i32, MetricsMode>` mirroring pyiceberg's `compute_statistics_plan` (pre-order visit with nested-downgrade and non-text-auto-upgrade).

4. **Threading** — pass `TableMetadata` (or the precomputed plan) into `parquet_to_data_file_builder` so the aggregator can see it.

5. **Aggregator** — `MinMaxColAggregator` gains a per-field mode, and its bound-producing methods apply `truncate(N)` for strings/binary as described.

6. **Helpers** — port `truncate_upper_bound_text_string` / `truncate_upper_bound_binary_string` into a new `spec::truncate_bound` (or similar) module. Rust-native UTF-8: iterate by `chars()` rather than bytes for the string variant, matching pyiceberg's semantics.

7. **Tests** — unit coverage per mode (`full`, `counts`, `none`, `truncate(N)`) × per primitive type; integration test that round-trips a known string against pyiceberg's output (the docker-compose fixture makes this straightforward).

## A non-negotiable constraint

`partition_value_from_bounds` (added in #1079, wired into the add-files flow by its follow-up) **must continue to see untruncated bounds**. Identity partitioning on a string column relies on `min == max` to detect single-partition files; if truncation is applied before partition inference, two files with, say, `"AAPL-2026-01-15-200-C"` rows would falsely fail with `"more than one partition values"` because the truncated-and-incremented upper bound differs from the truncated lower bound.

pyiceberg maintains this separation cleanly: `StatsAggregator.current_min` / `current_max` are the raw values (consumed by `_partition_value`), and truncation is applied only inside `min_as_bytes()` / `max_as_bytes()` on the serialization path. Any fix here should follow the same layering — bounds stay full-precision in the aggregator's public output; truncation happens at the DataFile-serialization boundary.

## References

- Spec: https://iceberg.apache.org/spec/#metrics-modes
- pyiceberg reference implementation: `pyiceberg/io/pyarrow.py` (`StatsAggregator`, `compute_statistics_plan`) and `pyiceberg/utils/truncate.py`
- Related work in this repo: #1079 (partition-from-bounds helper), #1035 (parent issue, closed), #960 (original add-existing-parquet-files PR)
- Read path already handling truncated bounds: `expr/visitors/inclusive_metrics_evaluator.rs`, `expr/visitors/page_index_evaluator.rs`

Happy to take this on if there is interest and no duplicate work underway.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(writer): honor write.metadata.metrics.* properties — bound truncation missing in MinMaxColAggregator #2362

Summary

Reproduction (against a REST catalog + S3)

Reference implementation (pyiceberg, Apache-2.0)

Property constants (`pyiceberg/table/init.py`)

Mode parser (`pyiceberg/io/pyarrow.py`)

Upper-bound increment (`pyiceberg/utils/truncate.py`)

Per-column override

Proposed scope

A non-negotiable constraint

References

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

feat(writer): honor write.metadata.metrics.* properties — bound truncation missing in MinMaxColAggregator #2362

Description

Summary

Reproduction (against a REST catalog + S3)

Reference implementation (pyiceberg, Apache-2.0)

Property constants (pyiceberg/table/__init__.py)

Mode parser (pyiceberg/io/pyarrow.py)

Upper-bound increment (pyiceberg/utils/truncate.py)

Per-column override

Proposed scope

A non-negotiable constraint

References

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Property constants (`pyiceberg/table/init.py`)

Mode parser (`pyiceberg/io/pyarrow.py`)

Upper-bound increment (`pyiceberg/utils/truncate.py`)