Re: [PR] [WIP] Spark: Make Spark readers function asynchronously for many small files [issue #15287] [iceberg]

via GitHub Tue, 03 Mar 2026 17:14:04 -0800


RussellSpitzer commented on PR #15341:
URL: https://github.com/apache/iceberg/pull/15341#issuecomment-3994612354


   This is really great initial testing, I've only take a small look but I 
would recommend you try out using ParallelIterable for parallelism rather than 
the current implementation. You could use this in conjunction with 
ThreadPools.getWorkerPool as a default (which auto scales with the reported cpu 
stats) although I think leaving it configurable is also interesting and good 
for testing)
   
   ```java
   Iterable<Iterable<T>> taskIterables = tasks.stream()
       .map(task -> (Iterable<T>) () -> open(task))
       .collect(toList());
   ParallelIterable<T> parallel = new ParallelIterable<>(taskIterables, 
ThreadPools.getWorkerPool());
   ```
   
   Could you try that out with your same benchmark? I know you are using only 
10 files, but I'd really bet interested at the scale of improvement all the way 
up to 10 threads (one per file) My hunch is we can basically get an order of 
magnitude at least
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [WIP] Spark: Make Spark readers function asynchronously for many small files [issue #15287] [iceberg]

Reply via email to