ConeyLiu commented on PR #1427: URL: https://github.com/apache/iceberg-python/pull/1427#issuecomment-2560708280
> I like that idea. Could you elaborate on that? Yes, the following is what I implemented in the internal repo: ```python def plan_scan_tasks( files: Iterable[FileScanTask], split_size: int, loop_back: int, open_file_cost: int ) -> List[CombinedFileScanTask]: """Plan balanced combined tasks for this scan by splitting large and combining small tasks. Returns: List of CombinedFileScanTasks """ def split(task: FileScanTask) -> List[FileScanTask]: data_file = task.file if not data_file.file_format.is_splittable() or not data_file.split_offsets: return [task] split_offsets = data_file.split_offsets if not all(split_offsets[i] <= split_offsets[i + 1] for i in range(len(split_offsets) - 1)): # split offsets must be strictly ascending return [task] all_tasks = [] for i in range(len(split_offsets) - 1): all_tasks.append( FileScanTask(data_file, task.delete_files, split_offsets[i], split_offsets[i + 1] - split_offsets[i]) ) all_tasks.append( FileScanTask(data_file, task.delete_files, split_offsets[-1], data_file.file_size_in_bytes - split_offsets[-1]) ) return all_tasks def weight_func(task: FileScanTask) -> int: return max(task.size_in_bytes(), (1 + len(task.delete_files)) * open_file_cost) split_file_tasks = list(itertools.chain.from_iterable(map(split, files))) packing_iterator = PackingIterator(split_file_tasks, split_size, loop_back, weight_func, False) return list(map(_merge_split_task, packing_iterator)) ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org