raunaqmorarka opened a new pull request, #14757:
URL: https://github.com/apache/iceberg/pull/14757

   The existing logic of de-duplicating sanitized values implemented in 
https://github.com/apache/iceberg/pull/5908 works well only when we're dealing 
with numeric values because those are sanitized to a limited set of values like 
2-digit-int, 3-digit-int etc.
   But when we have string values which are sanitized to hashes (e.g. 
(hash-1b409883), (hash-53cd6d46), (hash-24add70a), (hash-7df3cf93), ... this 
logic isn't helpful in abbreviating long IN list.
   
   e.g.:
   ```
   2025-11-28T11:33:04.374Z     INFO    iceberg-split-source-iceberg_ws-161     
org.apache.iceberg.SnapshotScan Scanning table 
"iceberg_tpcds_sf1000_parquet_part".item snapshot 7925025686344816596 created 
at 2024-03-20T21:48:28.148+00:00 with filter i_item_id IN ((hash-7f083fe9), 
(hash-77356dd5), (hash-66cc3748), (hash-6e6e32bb), (hash-75659594), 
(hash-0f678143), (hash-5f354c1a), (hash-222697cd), (hash-78bd038f), 
(hash-38a9fbe6), (hash-644a0e1a), (hash-251adfed), (hash-3d078191), 
(hash-6c4c1e30), (hash-208a10e3), (hash-4c5e8a00), (hash-1ff5355c), 
(hash-14a83bbd), (hash-1c4e0323), (hash-1214f434), (hash-4a8467a4), 
(hash-507532dd), (hash-1a58ff72), (hash-22d10a3d), (hash-20d9d51d), 
(hash-29aff673), (hash-1fd20691), (hash-66ae0d2f), (hash-017b5f86), 
(hash-2ec71033), (hash-33a6bbde), (hash-154aa905), (hash-589aca81), 
(hash-50aea8a5), (hash-63e84313), (hash-3ff2d50b), (hash-44e1a213), 
(hash-394ffd0f), (hash-01d10a12), (hash-1b5196c4), (hash-2b5376e2), 
(hash-227a51cb), (hash-495dfefe), (hash-6b56bcbe
 ), (hash-4727d59f), (hash-045ddf4f), (hash-3b26480b), (hash-539f93f6), 
(hash-073c0658), (hash-64a91d3f), (hash-3f4d7e7e), (hash-6dcc5f83), 
(hash-0967ab2b), (hash-26138fe1), (hash-407fabe0), (hash-392e5e45), 
(hash-66033407), (hash-35f522fd), (hash-3a31debe), (hash-5568d99a), 
(hash-621452f6), (hash-25b0e48d), (hash-0c307c95), (hash-74b248fc), 
(hash-118e66bb), (hash-4e4da5cc), (hash-43f3d80c), (hash-5c0f0895), 
(hash-2379640e), (hash-1cccfc11), (hash-35a1145c), (hash-1cf230cb), 
(hash-6f622b17), (hash-7ed174d9), (hash-1efda34c), (hash-3d29a275), 
(hash-388775c2), (hash-0ce5c90a)......
   ```
   
   This logic is now simplified to always abbreviate when the distinct 
sanitized values exceed LONG_IN_PREDICATE_ABBREVIATION_THRESHOLD


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to