SreeramGarlapati opened a new pull request, #2544:
URL: https://github.com/apache/iceberg-rust/pull/2544

   ## Summary
   
   Apply Iceberg's default 16-unit bound truncation to manifest lower/upper 
bounds for `STRING` (codepoint-based), `BINARY`, and `FIXED` (byte-based) when 
`MinMaxColAggregator` collects per-row-group parquet statistics in 
`parquet_writer.rs`.
   
   Mirrors Java's `org.apache.iceberg.util.UnicodeUtil#truncateStringMin/Max` 
and `BinaryUtil#truncateBinaryMin/Max`, called from `ParquetUtil#updateMin/Max`.
   
   ## Motivation
   
   Without this, long string/binary values produced manifest bounds that 
exceeded the conventional 16-unit budget and didn't agree with bounds 
Java/Spark would have written for the same data. In a two-writer setup where 
Java/Spark performs DDL/compaction on tables that iceberg-rust appends to, 
bounds disagreement breaks scan-time min/max pruning correctness for downstream 
readers.
   
   This is the only `OPEN` rust-side gap from an internal Spark↔Rust interop 
matrix that exercises every V2 schema-evolution + concurrent-writer scenario; 
the rest are either already upstream, cosmetic, or Spark-side.
   
   ## Changes
   
   - New helpers in `crates/iceberg/src/writer/file_writer/parquet_writer.rs`:
     - `truncate_string_min(s, n)` / `truncate_string_max(s, n)` — 
codepoint-based, mirrors `UnicodeUtil`
     - `truncate_binary_min(b, n)` / `truncate_binary_max(b, n)` — byte-based, 
mirrors `BinaryUtil`
     - `truncate_lower_bound` / `truncate_upper_bound` dispatchers over 
`PrimitiveType` × `Datum`
     - `ICEBERG_BOUND_TRUNCATE_LENGTH = 16`
   - `MinMaxColAggregator::update`: route lower/upper datums through the 
truncators for `String`, `Binary`, `Fixed(_)`. Other types pass through 
unchanged. When `truncate_*_max` returns `None` (every position at max), the 
upper bound is dropped — matches Java semantics; lower bound is still recorded.
   - Type-mismatch error semantics preserved: gating on 
`min_bytes_opt().is_some()` / `max_bytes_opt().is_some()` so an unmappable 
`(PrimitiveType, Statistics)` pair still surfaces an error if a stat is 
actually present.
   
   ## Upper-bound algorithm
   
   Take the 16-unit prefix, increment the last unit; on overflow drop that 
position and try the previous one; if every position in the prefix is at max, 
return `None` and drop the upper bound. Lower bound is a simple prefix.
   
   For `STRING`, the increment walks past UTF-16 surrogates (U+D800–U+DFFF) 
because Rust's `char::from_u32` rejects them. Java's 
`Character.isValidCodePoint` accepts surrogates, but skipping them on the Rust 
side preserves monotonic ordering for valid UTF-8 strings.
   
   ## API surface
   
   None — all new items are private to the file. No public API changes.
   
   ## Test coverage (18 new tests)
   
   Helper-level (13):
   - `test_truncate_string_min_short_input_unchanged`
   - `test_truncate_string_min_long_input_truncates_codepoints`
   - `test_truncate_string_max_short_input_unchanged`
   - `test_truncate_string_max_long_input_increments_last_codepoint`
   - `test_truncate_string_max_overflow_drops_position`
   - `test_truncate_string_max_skips_utf16_surrogates`
   - `test_truncate_string_max_all_max_returns_none`
   - `test_truncate_binary_min_short_input_unchanged`
   - `test_truncate_binary_min_long_input_truncates`
   - `test_truncate_binary_max_short_input_unchanged`
   - `test_truncate_binary_max_long_input_increments_last_byte`
   - `test_truncate_binary_max_drops_trailing_0xff`
   - `test_truncate_binary_max_all_ff_returns_none`
   
   Aggregator (4):
   - `test_min_max_aggregator_keeps_inexact_string_stats`
   - `test_min_max_aggregator_truncates_long_string_bounds`
   - `test_min_max_aggregator_truncates_long_binary_bounds`
   - `test_min_max_aggregator_drops_only_upper_when_unboundable`
   
   End-to-end (1):
   - `test_parquet_writer_truncates_long_string_bounds` — writes long-string 
rows through `ParquetWriter` and asserts `data_file.lower_bounds()` / 
`upper_bounds()` match the Java-equivalent 16-codepoint truncation.
   
   `cargo test -p iceberg --lib` → 1314 pass. `cargo clippy -p iceberg --lib 
--tests` clean. `cargo fmt` applied.
   
   ## Out of scope
   
   - Full `MetricsConfig` wiring (per-column truncate-length, 
full-column-disable, count-only modes). This PR only changes the default 
16-unit truncation; `MetricsConfig` plumbing can land independently.
   - Decimal / fixed-with-large-precision bound truncation (Java truncates 
these as fixed bytes; rust-side semantics for non-string/binary primitives are 
unchanged here).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to