Summary
The write path (including ParquetWriter::parquet_files_to_data_files and ParquetWriter::close) currently stores the full, untruncated values of a column's minimum and maximum in the manifest's lower_bounds / upper_bounds maps. The Iceberg spec (and every other official implementation — Java, Python, pyiceberg) applies a configurable truncation step at manifest serialization time, controlled by the write.metadata.metrics.* table properties, with truncate(16) as the default for binary and string columns.
iceberg-rust has no write-side metrics-mode framework at all: neither the constants, the MetricsMode parser, nor the truncation helpers exist, and MinMaxColAggregator::produce() is not aware of table properties.
This is strictly a write-path gap. The read path already understands truncated bounds: inclusive_metrics_evaluator.rs and page_index_evaluator.rs both truncate filter datums when comparing against possibly-truncated manifest bounds, so reads against tables written by other implementations work correctly. Tables written by iceberg-rust are also read correctly by pyiceberg/Java, because untruncated bounds are more precise and still valid. So this is not a correctness regression — but it is a spec-conformance and manifest-size gap.
Reproduction (against a REST catalog + S3)
Using the dev/docker-compose.yaml fixture:
- Create a partitioned table with a
string column carrying, say, 32-byte values.
- Write a Parquet file where the column's min/max is the same 32-byte string.
- Convert to
DataFile via ParquetWriter::parquet_files_to_data_files.
- Inspect the resulting
DataFile.lower_bounds / .upper_bounds for the string field:
- iceberg-rust (current): both entries store all 32 bytes verbatim.
- pyiceberg (reference):
lower_bounds stores the first 16 bytes; upper_bounds stores the first 16 bytes with the final codepoint incremented (so the truncated value remains a valid upper bound).
Reference implementation (pyiceberg, Apache-2.0)
Property constants (pyiceberg/table/__init__.py)
class TableProperties:
DEFAULT_WRITE_METRICS_MODE = "write.metadata.metrics.default"
DEFAULT_WRITE_METRICS_MODE_DEFAULT = "truncate(16)"
METRICS_MODE_COLUMN_CONF_PREFIX = "write.metadata.metrics.column"
Mode parser (pyiceberg/io/pyarrow.py)
class MetricModeTypes(Enum):
TRUNCATE = "truncate"
NONE = "none"
COUNTS = "counts"
FULL = "full"
Upper-bound increment (pyiceberg/utils/truncate.py)
def truncate_upper_bound_text_string(value: str, trunc_length: int | None) -> str | None:
result = value[:trunc_length]
if result != value:
chars = [*result]
for i in range(-1, -len(result) - 1, -1):
try:
chars[i] = chr(ord(chars[i]) + 1)
return "".join(chars)
except ValueError:
pass # this codepoint was at max; try the previous position
return None
return result
Binary variant is the same structure, operating on bytes with < 255 check.
Per-column override
Each primitive field is visited; the effective mode is write.metadata.metrics.column.<name> if set, else the default. Non-string/binary columns with mode truncate auto-upgrade to full; nested columns downgrade to counts.
Proposed scope
Minimum viable change that the maintainers might accept in one PR:
-
Constants — add to spec/table_properties.rs:
PROPERTY_WRITE_METRICS_MODE / ..._DEFAULT = "truncate(16)"
PROPERTY_METRICS_MODE_COLUMN_CONF_PREFIX
-
Type — MetricsMode enum + a match_metrics_mode(s: &str) -> Result<MetricsMode> that parses truncate(N), none, counts, full.
-
Plan — a compute_metrics_plan(schema: &SchemaRef, properties: &HashMap<String, String>) -> HashMap<i32, MetricsMode> mirroring pyiceberg's compute_statistics_plan (pre-order visit with nested-downgrade and non-text-auto-upgrade).
-
Threading — pass TableMetadata (or the precomputed plan) into parquet_to_data_file_builder so the aggregator can see it.
-
Aggregator — MinMaxColAggregator gains a per-field mode, and its bound-producing methods apply truncate(N) for strings/binary as described.
-
Helpers — port truncate_upper_bound_text_string / truncate_upper_bound_binary_string into a new spec::truncate_bound (or similar) module. Rust-native UTF-8: iterate by chars() rather than bytes for the string variant, matching pyiceberg's semantics.
-
Tests — unit coverage per mode (full, counts, none, truncate(N)) × per primitive type; integration test that round-trips a known string against pyiceberg's output (the docker-compose fixture makes this straightforward).
A non-negotiable constraint
partition_value_from_bounds (added in #1079, wired into the add-files flow by its follow-up) must continue to see untruncated bounds. Identity partitioning on a string column relies on min == max to detect single-partition files; if truncation is applied before partition inference, two files with, say, "AAPL-2026-01-15-200-C" rows would falsely fail with "more than one partition values" because the truncated-and-incremented upper bound differs from the truncated lower bound.
pyiceberg maintains this separation cleanly: StatsAggregator.current_min / current_max are the raw values (consumed by _partition_value), and truncation is applied only inside min_as_bytes() / max_as_bytes() on the serialization path. Any fix here should follow the same layering — bounds stay full-precision in the aggregator's public output; truncation happens at the DataFile-serialization boundary.
References
Happy to take this on if there is interest and no duplicate work underway.
Summary
The write path (including
ParquetWriter::parquet_files_to_data_filesandParquetWriter::close) currently stores the full, untruncated values of a column's minimum and maximum in the manifest'slower_bounds/upper_boundsmaps. The Iceberg spec (and every other official implementation — Java, Python, pyiceberg) applies a configurable truncation step at manifest serialization time, controlled by thewrite.metadata.metrics.*table properties, withtruncate(16)as the default for binary and string columns.iceberg-rust has no write-side metrics-mode framework at all: neither the constants, the
MetricsModeparser, nor the truncation helpers exist, andMinMaxColAggregator::produce()is not aware of table properties.This is strictly a write-path gap. The read path already understands truncated bounds:
inclusive_metrics_evaluator.rsandpage_index_evaluator.rsboth truncate filter datums when comparing against possibly-truncated manifest bounds, so reads against tables written by other implementations work correctly. Tables written by iceberg-rust are also read correctly by pyiceberg/Java, because untruncated bounds are more precise and still valid. So this is not a correctness regression — but it is a spec-conformance and manifest-size gap.Reproduction (against a REST catalog + S3)
Using the
dev/docker-compose.yamlfixture:stringcolumn carrying, say, 32-byte values.DataFileviaParquetWriter::parquet_files_to_data_files.DataFile.lower_bounds/.upper_boundsfor the string field:lower_boundsstores the first 16 bytes;upper_boundsstores the first 16 bytes with the final codepoint incremented (so the truncated value remains a valid upper bound).Reference implementation (pyiceberg, Apache-2.0)
Property constants (
pyiceberg/table/__init__.py)Mode parser (
pyiceberg/io/pyarrow.py)Upper-bound increment (
pyiceberg/utils/truncate.py)Binary variant is the same structure, operating on bytes with
< 255check.Per-column override
Each primitive field is visited; the effective mode is
write.metadata.metrics.column.<name>if set, else the default. Non-string/binary columns with modetruncateauto-upgrade tofull; nested columns downgrade tocounts.Proposed scope
Minimum viable change that the maintainers might accept in one PR:
Constants — add to
spec/table_properties.rs:PROPERTY_WRITE_METRICS_MODE/..._DEFAULT = "truncate(16)"PROPERTY_METRICS_MODE_COLUMN_CONF_PREFIXType —
MetricsModeenum + amatch_metrics_mode(s: &str) -> Result<MetricsMode>that parsestruncate(N),none,counts,full.Plan — a
compute_metrics_plan(schema: &SchemaRef, properties: &HashMap<String, String>) -> HashMap<i32, MetricsMode>mirroring pyiceberg'scompute_statistics_plan(pre-order visit with nested-downgrade and non-text-auto-upgrade).Threading — pass
TableMetadata(or the precomputed plan) intoparquet_to_data_file_builderso the aggregator can see it.Aggregator —
MinMaxColAggregatorgains a per-field mode, and its bound-producing methods applytruncate(N)for strings/binary as described.Helpers — port
truncate_upper_bound_text_string/truncate_upper_bound_binary_stringinto a newspec::truncate_bound(or similar) module. Rust-native UTF-8: iterate bychars()rather than bytes for the string variant, matching pyiceberg's semantics.Tests — unit coverage per mode (
full,counts,none,truncate(N)) × per primitive type; integration test that round-trips a known string against pyiceberg's output (the docker-compose fixture makes this straightforward).A non-negotiable constraint
partition_value_from_bounds(added in #1079, wired into the add-files flow by its follow-up) must continue to see untruncated bounds. Identity partitioning on a string column relies onmin == maxto detect single-partition files; if truncation is applied before partition inference, two files with, say,"AAPL-2026-01-15-200-C"rows would falsely fail with"more than one partition values"because the truncated-and-incremented upper bound differs from the truncated lower bound.pyiceberg maintains this separation cleanly:
StatsAggregator.current_min/current_maxare the raw values (consumed by_partition_value), and truncation is applied only insidemin_as_bytes()/max_as_bytes()on the serialization path. Any fix here should follow the same layering — bounds stay full-precision in the aggregator's public output; truncation happens at the DataFile-serialization boundary.References
pyiceberg/io/pyarrow.py(StatsAggregator,compute_statistics_plan) andpyiceberg/utils/truncate.pyexpr/visitors/inclusive_metrics_evaluator.rs,expr/visitors/page_index_evaluator.rsHappy to take this on if there is interest and no duplicate work underway.