rraulinio opened a new issue, #1192: URL: https://github.com/apache/iceberg-go/issues/1192
### Apache Iceberg version main (development) ### Please describe the bug 🐞 ## Problem `iceberg-go` can incorrectly prune files for filters like: ```sql WHERE name NOT STARTS WITH 'abc' ``` when the table is partitioned by a shorter string truncate transform, for example: ```text truncate[2](name) ``` In that case, a data file with partition value `"ab"` may contain rows like: ```text name = "ab" ``` That row **does** satisfy: ```sql name NOT STARTS WITH 'abc' ``` because `"ab"` does not start with `"abc"`. However, the current projection logic can truncate the filter literal `"abc"` to `"ab"` and reason from the partition value alone. That can make the planner conclude that partition `"ab"` cannot match the filter, so the file is skipped. This is a false negative and can produce incomplete query results. ## Why This Is Unsafe For `truncate[W](name)`, the partition value only preserves the first `W` characters of `name`. If the `NOT STARTS WITH` prefix is longer than `W`, the partition value does not contain enough information to prove that every row in the file fails the filter. Concrete counterexample: ```text partition transform: truncate[2](name) partition value: "ab" row value: "ab" filter: name NOT STARTS WITH "abc" ``` The row should be returned, so pruning the file is incorrect. ## Expected Behavior For `NOT STARTS WITH` over a truncated string partition: - If the filter prefix length is less than or equal to the truncate width, projection can be safe. - If the filter prefix is longer than the truncate width, the projection should return no partition predicate and avoid pruning based on that transform. That means the planner should read the candidate file and let row-level filtering decide. ## Reference Behavior This matches the behavior in other Iceberg implementations: - Java Iceberg's `TruncateString.project` returns no projection for `NOT_STARTS_WITH` when the literal is longer than the truncate width. - iceberg-rust has an explicit truncate projection test where `NOT STARTS WITH "abcdefg"` over `truncate[5]` returns `None`. This is also consistent with the Iceberg spec's truncate semantics: string `truncate[L]` preserves only the first `L` code points, so longer-prefix `NOT STARTS WITH` predicates cannot be proven from the partition value alone. References: - Iceberg spec, truncate transform details: https://iceberg.apache.org/spec/#truncate-transform-details - Java Iceberg `Truncate.java`: https://github.com/apache/iceberg/blob/main/api/src/main/java/org/apache/iceberg/transforms/Truncate.java - iceberg-rust `truncate.rs`: https://github.com/apache/iceberg-rust/blob/main/crates/iceberg/src/transform/truncate.rs ## Proposed Fix Update the truncate string projection logic so `NOT STARTS WITH` only produces a partition predicate when the filter prefix length is less than or equal to the truncate width. If the filter prefix is longer than the truncate width, return no projection instead of projecting an unsafe predicate. Add regression coverage for a case like: ```text partition: truncate[2](name) filter: name NOT STARTS WITH "abc" value: "ab" ``` The test should fail before the fix because the file is incorrectly pruned, and pass after the fix because pruning is disabled for that unsafe case. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
