Dandandan commented on issue #1617:
URL: 
https://github.com/apache/datafusion-ballista/issues/1617#issuecomment-4351554380

   Perhaps you can find out the slow parts to make performance on high number 
of small files better?
   
   Some options
   
   * Planning performance: perhaps there are hot paths in the planning (which 
your log shows).
   
   * Scheduler -> executor communication
   
   As far as I remember, Ballista doesn't prune plans yet (e.g. to only add 
file paths to the used partitions) so it might serialize/send/plans with the 
full 200K files over and over to all workers.
   
   * Combine small files in scan: when possible, combine multiple files in one 
partition to reduce the per-partition planning/scheduling/communication 
overhead.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to