szehon-ho opened a new issue, #6670: URL: https://github.com/apache/iceberg/issues/6670
### Apache Iceberg version 1.1.0 (latest release) ### Query engine Spark ### Please describe the bug 🐞 This is a serious random bug I debugged together with @dramaticlly, it will make metadata deletes silently fail in real environments. **Setup** We have a table that is partitioned by two date integers, a and b. They are both in same format yyyymmdd, and refer to relatively current dates, so the values are not very far from each other. We find that sometimes doing a `delete from table where date <= a` fails silently to delete files of partition a. **Problem** The problem is computing the Iceberg residual, which is the remaining data-filter that is pushed down to the files, after the partition filter has been evaluated. ManifestFilterManager has a residual cache by partition, which we use to find the residual for each file under consideration of being deleted. All files of a partition should use the same residual. Ref: https://github.com/apache/iceberg/blob/1.0.x/core/src/main/java/org/apache/iceberg/ManifestFilterManager.java#L511 But in this case, a hashcode collission causes a file of partition b having residual of AlwaysFalse() for query ```date <= a``` be used when a file of partition a is being evaluated (it should have evaluated to AlwaysTrue() for query ```date <= a```. Hence the later code skips evaluation of the file and does not proceed to delete it. **Analysis** We saw that partition with different but similar range a, b values in this use-case often collide in the hashcode. We saw it multiple times and were able to come up with an example: a = 20220728, b = 20220719 a = 20220726, b = 20220801 The problem is the StructWrapper's hash (which is the JavaHash: Ref: https://github.com/apache/iceberg/blob/1.0.x/api/src/main/java/org/apache/iceberg/types/JavaHashes.java#L83 It makes a random base depending on len of the struct, but in this case these partition type all have the same len, so there's effectively no randomization. The only randomization then is, per value, multiplying result by 41 and then adding the next value, ie, ``` result = 41 * result + hash(a) result = 41 * result + hash(b) ``` but in this case Java int hash of an integer is just itself and so because: ``` 41*(20220728) + 41*(20220719) = 41*(20220726) + 41*(20220801) ``` we the hash collission. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org