Re: [I] DataFusion 54: distributed tasks queued after the first wave run ~8x slower (parquet decode) [datafusion-ballista]

via GitHub Sat, 27 Jun 2026 09:50:04 -0700


andygrove commented on issue #1907:
URL: 
https://github.com/apache/datafusion-ballista/issues/1907#issuecomment-4819329415


   **Update — root cause is data amplification, not slow decode.**
   
   Measuring the partial-aggregate input rows (`reduction_factor` metric, which 
is reliable, unlike the merged scan `output_rows`):
   
   | q1 partitions | slots | input rows to partial aggregate |
   |---:|---:|---:|
   | 8 | 8 (all wave 1) | **59.1 M** (≈ table, correct) |
   | 9 | 8 (8 wave-1 + 1 wave-2) | **118.3 M** |
   
   Going from 8→9 partitions added **one** task that runs in the second wave, 
and total processed rows jumped by ~59 M — i.e. that **single second-wave task 
scanned the entire table (~59 M rows) instead of its ~1/9 file-range 
partition.**
   
   So: **tasks that run after the first wave scan the whole dataset instead of 
their assigned file range.** Key properties:
   - It is **wave-/slot-dependent, not partition-index-dependent**: with 
`-c16`, all 16 partitions of `q1 --partitions 16` run in one wave and the query 
is fast (0.3 s, ~59 M rows total); with `-c8` the same plan runs in two waves 
and the second-wave tasks each re-scan the whole table (4.2 s). Same serialized 
plan, same file groups — only the number of concurrent slots (waves) differs.
   - It **resets per query** (the first wave is always correct/fast), so it is 
tied to per-stage execution state, not process-global accumulation.
   - CPU profile of a single slow task is dominated by aggregate accumulate + 
group-by-on-string-columns + snappy decode — consistent with simply processing 
~8× the rows.
   
   This explains the earlier symptoms (≈N× scan inefficiency, disk-full on join 
queries from shuffling N× data). It looks like later-wave tasks fail to 
restrict the parquet scan to their `file_group`'s byte range and read all row 
groups — possibly related to the DF54 parquet footer/metadata handling (we hit 
metadata-cache changes upgrading Comet). Repro: `tpch benchmark ballista 
--query 1 --partitions 9 --iterations 1 -c 
datafusion.optimizer.prefer_hash_join=false` on an 8-slot executor.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] DataFusion 54: distributed tasks queued after the first wave run ~8x slower (parquet decode) [datafusion-ballista]

Reply via email to