aokolnychyi commented on PR #2276:
URL: https://github.com/apache/iceberg/pull/2276#issuecomment-1294031962

   Let's think through the algorithm given what we discussed. The main problem 
is that we don't know what spec IDs are affected by a scan until we plan files. 
I think the following would work.
   
   ```
   - Call planFiles, materialize the iterable of files to be scanned and store 
the files in a list.
   - While materializing the iterable, keep track of all seen spec IDs.
   - Build an intersection of all partition types for the specs covered by this 
scan (we know spec IDs from the previous step).
       - This intersection is a set of partition columns which we can guarantee 
the data is clustered by. That’s why they should be our keys in 
KeyGroupedPartitioning.
   - Determine what partition columns from the intersection of all partition 
types can’t be combined (using the source columns provided by the user).
   - Create a projection that would select values for non-combinable partition 
columns from a file partition tuple.
   - Build a map where keys are tuples with projected non-combinable partition 
columns and values are lists of files.
   - Sort each file list by spec ID and partition so that we combine splits for 
the same partition.
   - Call bin-packing code on each list of files independently.
   - Union the result.
   ```
   
   What do you think, @sunchao @rdblue?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to