bk-mz commented on issue #9833: URL: https://github.com/apache/iceberg/issues/9833#issuecomment-1989424746
>do the files have incorrect partition tuples (the values actually used for filtering)? how do I check this? Something is odd, because I see bloated small-file partitions, i.e. each batch I run 3 consecutive compactions: ``` CALL glue.system.rewrite_position_delete_files(table => 'table', where => "data_load_ts between TIMESTAMP '2024-02-10 20:39:47.669' and TIMESTAMP '2024-03-11 19:39:47.669'", options => map('partial-progress.enabled', 'true', 'min-file-size-bytes', '26843545', 'max-file-size-bytes', '134217728', 'min-input-files', '10', 'max-concurrent-file-group-rewrites', '500')) +----------------------------+------------------------+---------------------+-----------------+ |rewritten_delete_files_count|added_delete_files_count|rewritten_bytes_count|added_bytes_count| +----------------------------+------------------------+---------------------+-----------------+ |693 |689 |10353631 |10322951 | +----------------------------+------------------------+---------------------+-----------------+ CALL glue.system.rewrite_data_files(table => 'table', where => "data_load_ts between TIMESTAMP '2024-03-04 20:41:16.346' and TIMESTAMP '2024-03-11 19:41:16.346'", options => map('partial-progress.enabled', 'true', 'min-file-size-bytes', '53687091', 'max-file-size-bytes', '268435456', 'min-input-files', '20', 'max-concurrent-file-group-rewrites', '500')) +--------------------------+----------------------+---------------------+-----------------------+ |rewritten_data_files_count|added_data_files_count|rewritten_bytes_count|failed_data_files_count| +--------------------------+----------------------+---------------------+-----------------------+ |60 |6 |131249114 |0 | +--------------------------+----------------------+---------------------+-----------------------+ CALL glue.system.rewrite_data_files(table => 'table', where => "data_load_ts <= TIMESTAMP '2024-03-04 20:41:42.204'", options => map('partial-progress.enabled', 'true', 'min-file-size-bytes', '53687091', 'max-file-size-bytes', '268435456', 'min-input-files', '20', 'max-concurrent-file-group-rewrites', '1000')) +--------------------------+----------------------+---------------------+-----------------------+ |rewritten_data_files_count|added_data_files_count|rewritten_bytes_count|failed_data_files_count| +--------------------------+----------------------+---------------------+-----------------------+ |20 |2 |655906 |0 | +--------------------------+----------------------+---------------------+-----------------------+ ``` (where clause changes per each batch). Then, this is what I see in the compactions logs: ```sql with data as ( select committed_at, snapshot_id, summary.`changed-partition-count` as changed_partition_count, if(summary.`added-data-files` is null, "compact_delete_files", "compact_base_files") as op, if(summary.`added-data-files` is null, summary.`added-position-delete-files`, summary.`added-data-files`) as added_files, if(summary.`added-data-files` is null, summary.`removed-position-delete-files`, summary.`deleted-data-files`) as removed_files from messaging_data_platform_pii_sandbox.mdp_temp_inflight_pffm_iceberg2.snapshots where operation = "replace") select committed_at, snapshot_id, changed_partition_count, op, concat(removed_files, "->", added_files) as change, removed_files / added_files as compact_ratio from data limit 100; ``` Results: ``` 2024-03-10 20:36:07.15 8797169961871736418 23 compact_delete_files 165->157 1.0509554140127388 2024-03-10 20:36:08.41 7845087824966754379 23 compact_delete_files 263->249 1.0562248995983936 2024-03-10 20:36:09.532 7903647840988554268 23 compact_delete_files 634->616 1.0292207792207793 2024-03-10 20:36:10.764 483626480370270807 23 compact_delete_files 835->814 1.0257985257985258 2024-03-10 20:36:12.728 5477859694032525431 23 compact_delete_files 441->426 1.0352112676056338 2024-03-10 20:36:16.581 1255334267473732600 23 compact_delete_files 803->784 1.024234693877551 2024-03-10 20:36:18.947 7227297553321373728 23 compact_delete_files 505->478 1.0564853556485356 2024-03-10 20:36:21.98 1281338329182940375 23 compact_delete_files 582->566 1.028268551236749 2024-03-10 20:36:23.08 6410347697455606449 23 compact_delete_files 353->317 1.113564668769716 2024-03-10 20:36:24.808 7588875997900067709 20 compact_delete_files 599->588 1.0187074829931972 2024-03-10 20:40:05.974 198476720206951232 2 compact_base_files 40->3 13.333333333333334 2024-03-10 20:40:08.515 7210798375839628837 2 compact_base_files 42->4 10.5 2024-03-10 20:40:10.876 5692616137794405663 2 compact_base_files 41->4 10.25 2024-03-10 20:40:13.437 6290596725370482099 2 compact_base_files 40->4 10.0 2024-03-10 20:40:16.507 217541509135133596 2 compact_base_files 42->4 10.5 2024-03-10 20:40:20.011 4802230468835188293 1 compact_base_files 20->2 10.0 2024-03-10 20:41:04.273 6246580026217410227 23 compact_delete_files 166->155 1.070967741935484 2024-03-10 20:41:05.719 3297695269624318303 23 compact_delete_files 264->239 1.104602510460251 2024-03-10 20:41:06.785 5847626578154331062 23 compact_delete_files 666->654 1.018348623853211 2024-03-10 20:41:08.07 4267888321478088912 23 compact_delete_files 804->778 1.0334190231362468 2024-03-10 20:41:10.013 8135762954708433780 23 compact_delete_files 437->420 1.0404761904761906 2024-03-10 20:41:14.315 2664190367967793873 23 compact_delete_files 845->828 1.0205314009661837 2024-03-10 20:41:16.688 3946211637382110477 23 compact_delete_files 365->343 1.064139941690962 2024-03-10 20:41:18.796 4932807097664769635 23 compact_delete_files 546->511 1.0684931506849316 2024-03-10 20:41:19.909 6618458420791632675 23 compact_delete_files 448->439 1.020501138952164 2024-03-10 20:41:21.741 6703590499156726364 19 compact_delete_files 609->585 1.041025641025641 2024-03-10 20:45:05.725 9121859527157436009 2 compact_base_files 40->3 13.333333333333334 2024-03-10 20:45:08.159 4157493376091722749 2 compact_base_files 40->4 10.0 2024-03-10 20:45:10.364 7307518553901315194 2 compact_base_files 40->4 10.0 2024-03-10 20:45:12.53 7391382820317167039 2 compact_base_files 40->4 10.0 2024-03-10 20:45:15.294 1634346428437060541 2 compact_base_files 40->4 10.0 2024-03-10 20:45:18.664 6471891107683048438 1 compact_base_files 20->2 10.0 2024-03-10 20:45:23.872 2705895529971003001 1 compact_base_files 20->2 10.0 2024-03-10 20:46:09.038 3459810320855697041 23 compact_delete_files 182->170 1.0705882352941176 2024-03-10 20:46:10.458 2661568600570547866 23 compact_delete_files 454->430 1.0558139534883721 2024-03-10 20:46:11.621 1341779779088511826 23 compact_delete_files 577->564 1.0230496453900708 2024-03-10 20:46:12.949 1453266135319995705 23 compact_delete_files 859->835 1.02874251497006 2024-03-10 20:46:14.829 4995825204442784024 23 compact_delete_files 277->259 1.0694980694980696 2024-03-10 20:46:19.456 6932435756418096621 23 compact_delete_files 815->786 1.0368956743002544 2024-03-10 20:46:21.472 2430383133564740551 23 compact_delete_files 337->325 1.0369230769230768 2024-03-10 20:46:23.548 3455842334691225549 23 compact_delete_files 545->528 1.0321969696969697 2024-03-10 20:46:24.884 6197136702470865431 23 compact_delete_files 417->405 1.0296296296296297 2024-03-10 20:46:26.57 3018837951760552054 19 compact_delete_files 658->639 1.029733959311424 2024-03-10 20:50:05.474 5737241757182222621 2 compact_base_files 40->4 10.0 ``` `compact_delete_files` actually do nothing, i.e. rewrite data into itself, or even produce more files than needed. Are there any obvious places to check for inconsistency in setup or setting? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org