aokolnychyi commented on PR #2276: URL: https://github.com/apache/iceberg/pull/2276#issuecomment-1294031962
Let's think through the algorithm given what we discussed. The main problem is that we don't know what spec IDs are affected by a scan until we plan files. I think the following would work. ``` - Call planFiles, materialize the iterable of files to be scanned and store the files in a list. - While materializing the iterable, keep track of all seen spec IDs. - Build an intersection of all partition types for the specs covered by this scan (we know spec IDs from the previous step). - This intersection is a set of partition columns which we can guarantee the data is clustered by. That’s why they should be our keys in KeyGroupedPartitioning. - Determine what partition columns from the intersection of all partition types can’t be combined (using the source columns provided by the user). - Create a projection that would select values for non-combinable partition columns from a file partition tuple. - Build a map where keys are tuples with projected non-combinable partition columns and values are lists of files. - Sort each file list by spec ID and partition so that we combine splits for the same partition. - Call bin-packing code on each list of files independently. - Union the result. ``` What do you think, @sunchao @rdblue? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org