Dandandan commented on issue #1617: URL: https://github.com/apache/datafusion-ballista/issues/1617#issuecomment-4351554380
Perhaps you can find out the slow parts to make performance on high number of small files better? Some options * Planning performance: perhaps there are hot paths in the planning (which your log shows). * Scheduler -> executor communication As far as I remember, Ballista doesn't prune plans yet (e.g. to only add file paths to the used partitions) so it might serialize/send/plans with the full 200K files over and over to all workers. * Combine small files in scan: when possible, combine multiple files in one partition to reduce the per-partition planning/scheduling/communication overhead. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
