raunaqmorarka opened a new pull request, #14757: URL: https://github.com/apache/iceberg/pull/14757
The existing logic of de-duplicating sanitized values implemented in https://github.com/apache/iceberg/pull/5908 works well only when we're dealing with numeric values because those are sanitized to a limited set of values like 2-digit-int, 3-digit-int etc. But when we have string values which are sanitized to hashes (e.g. (hash-1b409883), (hash-53cd6d46), (hash-24add70a), (hash-7df3cf93), ... this logic isn't helpful in abbreviating long IN list. e.g.: ``` 2025-11-28T11:33:04.374Z INFO iceberg-split-source-iceberg_ws-161 org.apache.iceberg.SnapshotScan Scanning table "iceberg_tpcds_sf1000_parquet_part".item snapshot 7925025686344816596 created at 2024-03-20T21:48:28.148+00:00 with filter i_item_id IN ((hash-7f083fe9), (hash-77356dd5), (hash-66cc3748), (hash-6e6e32bb), (hash-75659594), (hash-0f678143), (hash-5f354c1a), (hash-222697cd), (hash-78bd038f), (hash-38a9fbe6), (hash-644a0e1a), (hash-251adfed), (hash-3d078191), (hash-6c4c1e30), (hash-208a10e3), (hash-4c5e8a00), (hash-1ff5355c), (hash-14a83bbd), (hash-1c4e0323), (hash-1214f434), (hash-4a8467a4), (hash-507532dd), (hash-1a58ff72), (hash-22d10a3d), (hash-20d9d51d), (hash-29aff673), (hash-1fd20691), (hash-66ae0d2f), (hash-017b5f86), (hash-2ec71033), (hash-33a6bbde), (hash-154aa905), (hash-589aca81), (hash-50aea8a5), (hash-63e84313), (hash-3ff2d50b), (hash-44e1a213), (hash-394ffd0f), (hash-01d10a12), (hash-1b5196c4), (hash-2b5376e2), (hash-227a51cb), (hash-495dfefe), (hash-6b56bcbe ), (hash-4727d59f), (hash-045ddf4f), (hash-3b26480b), (hash-539f93f6), (hash-073c0658), (hash-64a91d3f), (hash-3f4d7e7e), (hash-6dcc5f83), (hash-0967ab2b), (hash-26138fe1), (hash-407fabe0), (hash-392e5e45), (hash-66033407), (hash-35f522fd), (hash-3a31debe), (hash-5568d99a), (hash-621452f6), (hash-25b0e48d), (hash-0c307c95), (hash-74b248fc), (hash-118e66bb), (hash-4e4da5cc), (hash-43f3d80c), (hash-5c0f0895), (hash-2379640e), (hash-1cccfc11), (hash-35a1145c), (hash-1cf230cb), (hash-6f622b17), (hash-7ed174d9), (hash-1efda34c), (hash-3d29a275), (hash-388775c2), (hash-0ce5c90a)...... ``` This logic is now simplified to always abbreviate when the distinct sanitized values exceed LONG_IN_PREDICATE_ABBREVIATION_THRESHOLD -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
