[I] Small files problem. [datafusion-ballista]

via GitHub Tue, 28 Apr 2026 20:38:25 -0700


xunxunmimi5577 opened a new issue, #1617:
URL: https://github.com/apache/datafusion-ballista/issues/1617


   I observed that an excessive number of small files significantly increases 
physical plan generation time and memory consumption.
   
   Here are my test results:
   
   I tested with a 100GB TPC-DS dataset, where the store_returns table is 4GB 
and partitioned with over 200,000 files. I loaded the data locally, registered 
it as a table using the register_parquet API, and executed:
   
   ```
   SELECT count(*) FROM store_returns;
   ```
   The query took 81.39 seconds, with the log showing:
   ```
   Planned job QiZU0fh in 81.057936897s
   ```
   After merging the files in the store_returns directory down to 4,032 files, 
the same query took only 16.84 seconds, with the log showing:
   ```
   Planned job 7534D1z in 16.758905506s
   ```
   This clearly demonstrates that the performance difference is primarily 
caused by the physical plan generation phase.
   
   Additionally, when executing TPC-DS Q1 with too many small files, I observed 
the scheduler memory continuously increasing, eventually causing the process to 
crash. After merging the small files, the query executed successfully in just 
over 70 seconds.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] Small files problem. [datafusion-ballista]

Reply via email to