andygrove opened a new pull request, #3947:
URL: https://github.com/apache/datafusion-comet/pull/3947

   ## Which issue does this PR close?
   
   Closes #3817.
   
   ## Rationale for this change
   
   When Comet uses `native_datafusion` scan mode, DataFusion's built-in 
`prune_by_range` uses a different algorithm than Spark/parquet-mr to assign row 
groups to file splits:
   
   - **Spark/parquet-mr/parquet-rs**: Uses the **midpoint** of a row group 
(`start_offset + compressed_size / 2`) to determine ownership. A row group 
belongs to a split if its midpoint falls within `[split_start, split_end)`.
   - **DataFusion**: Uses the **start offset** 
(`column(0).dictionary_page_offset` or `data_page_offset`). A row group belongs 
to a split if its start offset falls within the range.
   
   When these algorithms disagree (e.g., a row group starts before a split 
boundary but its midpoint is after it), some tasks end up reading too many row 
groups while others read none. This wastes cluster parallelism — in the 
reported case, 600 out of 1800 tasks were idle.
   
   ## What changes are included in this PR?
   
   Two new functions in `native/core/src/parquet/parquet_exec.rs`:
   
   - `get_row_group_midpoint(rg)` — Computes the midpoint offset of a row group 
using the same algorithm as Spark/parquet-mr and parquet-rs.
   - `apply_midpoint_row_group_pruning(file_groups, store)` — For each 
`PartitionedFile` with a byte range, reads the Parquet footer, computes which 
row groups have their midpoint within the range, creates a `ParquetAccessPlan` 
with those row groups, and removes the byte range. This causes DataFusion to 
use the explicit access plan and skip its built-in `prune_by_range`.
   
   The function is called in `init_datasource_exec` and short-circuits early if 
no files have ranges (no overhead for non-split files).
   
   Note: this is a Comet-side workaround. The upstream fix would be to change 
DataFusion's `prune_by_range` to use midpoint-based assignment.
   
   ## How are these changes tested?
   
   This needs testing with splittable Parquet files on a cluster (HDFS) where 
files are large enough to be split into multiple tasks. The issue could not be 
reproduced locally with local filesystem. Existing test suites verify no 
regression for the common case where files are not split.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to