Re: [I] Epic: CometNativeScan improvements (per-partition serde, cleanup, DPP, AQE DPP, V2 operator) [datafusion-comet]

via GitHub Fri, 24 Apr 2026 07:03:13 -0700


mbutrovich commented on issue #3510:
URL: 
https://github.com/apache/datafusion-comet/issues/3510#issuecomment-4313762170


   More thoughts as I chip away at AQE and non-AQE DPP:
   
   # CometNativeScanExec Design Debt
   
   ## makeCopy loses expression transformations
   
   `CometNativeScanExec.apply` uses `scanExec.makeCopy(newArgs)` to create a 
`FileSourceScanExec` with `CometParquetFileFormat` swapped in. `makeCopy` 
rebuilds from the original case class fields via reflection, which loses any 
expression transformations that Spark rules applied to the plan tree node 
(e.g., `ReusedSubqueryExec` from `ReuseAdaptiveSubquery`).
   
   `copy()` can't be used because some Spark distributions modify the 
`FileSourceScanExec` constructor, breaking `copy()` at runtime 
([#190](https://github.com/apache/datafusion-comet/issues/190)).
   
   `CometReuseSubquery` exists as a workaround: it re-applies subquery 
deduplication after `makeCopy` has lost it.
   
   ## CometParquetFileFormat wrapper
   
   The only reason `CometParquetFileFormat` exists is to make Spark's file scan 
infrastructure create `CometParquetPartitionReaderFactory` instead of the 
vanilla `ParquetPartitionReaderFactory`. If native scans fully bypassed Spark's 
file reading path (using DataFusion's ParquetExec directly for all I/O), this 
wrapper and the `makeCopy` would both be unnecessary.
   
   ## CometScanExec wrapping
   
   `CometNativeScanExec` holds `@transient scan: CometScanExec` for lazy access 
to file partitions via `scan.getFilePartitions()`. This handles bucketing, DPP 
partition selection, and other V1 scan infrastructure that Spark manages. 
Decoupling from `CometScanExec` would let `CometScanRule` create 
`CometNativeScanExec` directly without the intermediate wrapper.
   
   ## What fixing these enables
   
   Eliminating these layers would:
   - Let CometScanRule preserve expression transformations from Spark's rules, 
potentially removing the need for CometReuseSubquery
   - Simplify the scan creation path (no makeCopy, no CometParquetFileFormat, 
no CometScanExec intermediary)
   - Allow collapsing CometScanRule and CometExecRule as discussed in 
[#3510](https://github.com/apache/datafusion-comet/issues/3510)
   - Support stronger Spark 4.0 integration (e.g., scalar subquery pushdown 
without execution-time protobuf patching)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] Epic: CometNativeScan improvements (per-partition serde, cleanup, DPP, AQE DPP, V2 operator) [datafusion-comet]

Reply via email to