rraulinio opened a new issue, #1192:
URL: https://github.com/apache/iceberg-go/issues/1192

   ### Apache Iceberg version
   
   main (development)
   
   ### Please describe the bug 🐞
   
   ## Problem
   
   `iceberg-go` can incorrectly prune files for filters like:
   
   ```sql
   WHERE name NOT STARTS WITH 'abc'
   ```
   
   when the table is partitioned by a shorter string truncate transform, for 
example:
   
   ```text
   truncate[2](name)
   ```
   
   In that case, a data file with partition value `"ab"` may contain rows like:
   
   ```text
   name = "ab"
   ```
   
   That row **does** satisfy:
   
   ```sql
   name NOT STARTS WITH 'abc'
   ```
   
   because `"ab"` does not start with `"abc"`.
   
   However, the current projection logic can truncate the filter literal 
`"abc"` to `"ab"` and reason from the partition value alone. That can make the 
planner conclude that partition `"ab"` cannot match the filter, so the file is 
skipped. This is a false negative and can produce incomplete query results.
   
   ## Why This Is Unsafe
   
   For `truncate[W](name)`, the partition value only preserves the first `W` 
characters of `name`. If the `NOT STARTS WITH` prefix is longer than `W`, the 
partition value does not contain enough information to prove that every row in 
the file fails the filter.
   
   Concrete counterexample:
   
   ```text
   partition transform: truncate[2](name)
   partition value:     "ab"
   row value:           "ab"
   filter:              name NOT STARTS WITH "abc"
   ```
   
   The row should be returned, so pruning the file is incorrect.
   
   ## Expected Behavior
   
   For `NOT STARTS WITH` over a truncated string partition:
   
   - If the filter prefix length is less than or equal to the truncate width, 
projection can be safe.
   - If the filter prefix is longer than the truncate width, the projection 
should return no partition predicate and avoid pruning based on that transform.
   
   That means the planner should read the candidate file and let row-level 
filtering decide.
   
   ## Reference Behavior
   
   This matches the behavior in other Iceberg implementations:
   
   - Java Iceberg's `TruncateString.project` returns no projection for 
`NOT_STARTS_WITH` when the literal is longer than the truncate width.
   - iceberg-rust has an explicit truncate projection test where `NOT STARTS 
WITH "abcdefg"` over `truncate[5]` returns `None`.
   
   This is also consistent with the Iceberg spec's truncate semantics: string 
`truncate[L]` preserves only the first `L` code points, so longer-prefix `NOT 
STARTS WITH` predicates cannot be proven from the partition value alone.
   
   References:
   
   - Iceberg spec, truncate transform details: 
https://iceberg.apache.org/spec/#truncate-transform-details
   - Java Iceberg `Truncate.java`: 
https://github.com/apache/iceberg/blob/main/api/src/main/java/org/apache/iceberg/transforms/Truncate.java
   - iceberg-rust `truncate.rs`: 
https://github.com/apache/iceberg-rust/blob/main/crates/iceberg/src/transform/truncate.rs
   
   ## Proposed Fix
   
   Update the truncate string projection logic so `NOT STARTS WITH` only 
produces a partition predicate when the filter prefix length is less than or 
equal to the truncate width.
   
   If the filter prefix is longer than the truncate width, return no projection 
instead of projecting an unsafe predicate.
   
   Add regression coverage for a case like:
   
   ```text
   partition: truncate[2](name)
   filter:    name NOT STARTS WITH "abc"
   value:     "ab"
   ```
   
   The test should fail before the fix because the file is incorrectly pruned, 
and pass after the fix because pruning is disabled for that unsafe case.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to