RussellSpitzer commented on PR #15341:
URL: https://github.com/apache/iceberg/pull/15341#issuecomment-3994612354
This is really great initial testing, I've only take a small look but I
would recommend you try out using ParallelIterable for parallelism rather than
the current implementation. You could use this in conjunction with
ThreadPools.getWorkerPool as a default (which auto scales with the reported cpu
stats) although I think leaving it configurable is also interesting and good
for testing)
```java
Iterable<Iterable<T>> taskIterables = tasks.stream()
.map(task -> (Iterable<T>) () -> open(task))
.collect(toList());
ParallelIterable<T> parallel = new ParallelIterable<>(taskIterables,
ThreadPools.getWorkerPool());
```
Could you try that out with your same benchmark? I know you are using only
10 files, but I'd really bet interested at the scale of improvement all the way
up to 10 threads (one per file) My hunch is we can basically get an order of
magnitude at least
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]