ConeyLiu commented on PR #1427:
URL: https://github.com/apache/iceberg-python/pull/1427#issuecomment-2560708280

   > I like that idea. Could you elaborate on that?
   
   Yes, the following is what I implemented in the internal repo:
   ```python
   def plan_scan_tasks(
       files: Iterable[FileScanTask], split_size: int, loop_back: int, 
open_file_cost: int
   ) -> List[CombinedFileScanTask]:
       """Plan balanced combined tasks for this scan by splitting large and 
combining small tasks.
   
       Returns:
           List of CombinedFileScanTasks
       """
   
       def split(task: FileScanTask) -> List[FileScanTask]:
           data_file = task.file
           if not data_file.file_format.is_splittable() or not 
data_file.split_offsets:
               return [task]
   
           split_offsets = data_file.split_offsets
           if not all(split_offsets[i] <= split_offsets[i + 1] for i in 
range(len(split_offsets) - 1)):
               # split offsets must be strictly ascending
               return [task]
   
           all_tasks = []
           for i in range(len(split_offsets) - 1):
               all_tasks.append(
                   FileScanTask(data_file, task.delete_files, split_offsets[i], 
split_offsets[i + 1] - split_offsets[i])
               )
   
           all_tasks.append(
               FileScanTask(data_file, task.delete_files, split_offsets[-1], 
data_file.file_size_in_bytes - split_offsets[-1])
           )
   
           return all_tasks
   
       def weight_func(task: FileScanTask) -> int:
           return max(task.size_in_bytes(), (1 + len(task.delete_files)) * 
open_file_cost)
   
       split_file_tasks = list(itertools.chain.from_iterable(map(split, files)))
       packing_iterator = PackingIterator(split_file_tasks, split_size, 
loop_back, weight_func, False)
   
       return list(map(_merge_split_task, packing_iterator))
   
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to