mbutrovich commented on issue #3510: URL: https://github.com/apache/datafusion-comet/issues/3510#issuecomment-4313762170
More thoughts as I chip away at AQE and non-AQE DPP: # CometNativeScanExec Design Debt ## makeCopy loses expression transformations `CometNativeScanExec.apply` uses `scanExec.makeCopy(newArgs)` to create a `FileSourceScanExec` with `CometParquetFileFormat` swapped in. `makeCopy` rebuilds from the original case class fields via reflection, which loses any expression transformations that Spark rules applied to the plan tree node (e.g., `ReusedSubqueryExec` from `ReuseAdaptiveSubquery`). `copy()` can't be used because some Spark distributions modify the `FileSourceScanExec` constructor, breaking `copy()` at runtime ([#190](https://github.com/apache/datafusion-comet/issues/190)). `CometReuseSubquery` exists as a workaround: it re-applies subquery deduplication after `makeCopy` has lost it. ## CometParquetFileFormat wrapper The only reason `CometParquetFileFormat` exists is to make Spark's file scan infrastructure create `CometParquetPartitionReaderFactory` instead of the vanilla `ParquetPartitionReaderFactory`. If native scans fully bypassed Spark's file reading path (using DataFusion's ParquetExec directly for all I/O), this wrapper and the `makeCopy` would both be unnecessary. ## CometScanExec wrapping `CometNativeScanExec` holds `@transient scan: CometScanExec` for lazy access to file partitions via `scan.getFilePartitions()`. This handles bucketing, DPP partition selection, and other V1 scan infrastructure that Spark manages. Decoupling from `CometScanExec` would let `CometScanRule` create `CometNativeScanExec` directly without the intermediate wrapper. ## What fixing these enables Eliminating these layers would: - Let CometScanRule preserve expression transformations from Spark's rules, potentially removing the need for CometReuseSubquery - Simplify the scan creation path (no makeCopy, no CometParquetFileFormat, no CometScanExec intermediary) - Allow collapsing CometScanRule and CometExecRule as discussed in [#3510](https://github.com/apache/datafusion-comet/issues/3510) - Support stronger Spark 4.0 integration (e.g., scalar subquery pushdown without execution-time protobuf patching) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
