schenksj commented on PR #3932: URL: https://github.com/apache/datafusion-comet/pull/3932#issuecomment-4357393646
### Triage update Two more commits since the [last status post](https://github.com/apache/datafusion-comet/pull/3932#issuecomment-4276098364): - `da19481a` — `fix(delta-regression): stop overriding dvFileNamePrefix in Delta test harness`. Our diff was setting `spark.databricks.delta.testOnly.dvFileNamePrefix=""` next to the data-file prefix override. The data-file override is correct (sidesteps `%`-char round-trip in our path parsing), but the DV override broke any test that reads a pre-existing DV fixture: Delta's reader reconstructs the DV path as `<prefix>deletion_vector_<uuid>.bin` and the on-disk fixture has the full prefix baked into the filename, so with prefix="" the lookup misses. Removing just the DV override clears `CheckpointsSuite "checkpoint with DVs"` (1 → 0) and the entire `DeletionVectorsSuite` (~30 → 0; both `Check no resource leak when DV files are missing` and the rest pass). - Validated: `CheckpointsSuite` 33/33, `DeletionVectorsSuite` 29/29. - `bb35abf0` — `feat(delta): file-splitting via byte-range tasks in CometDeltaNativeScan`. Adds Spark-equivalent file splitting (`FILES_MAX_PARTITION_BYTES` / `files.openCostInBytes` / `files.minPartitionNum`) to the native Delta scan path: new optional `byte_range_start` / `byte_range_end` on `DeltaScanTask`, Scala-side computation of `maxSplitBytes` mirroring `FilePartition.maxSplitBytes`, native side honours the range via `PartitionedFile::new_with_range`. DV row indexes are duplicated across chunks (correct since they're absolute file offsets and the scan filters on absolute index). - Doesn't fix the 15 `DeletionVectorsWithPredicatePushdownSuite` split-count assertions on its own — those scans bypass `CometDeltaNativeScan.convert` via the existing `PreparedDeltaFileIndex` DV fallback. Debug logs in both `splitTasks` and `CometScanExec.createFilePartitionsForNonBucketedScan` confirmed that neither runs for the failing tests, so the splitting fix is latent infrastructure for now (it'll start helping once the DV-fallback gate is loosened or replaced by a DV-aware kernel scan). - `DeletionVectorsSuite` regressed from a sanity run: still 29/29 with the splitting commit in place. ### Net regression delta vs. the original 139 - Original full run (before any fix in this stretch): **139** failed. - After the OptimizeGeneratedColumnSuite test-helper fix (`25085889`): 175 → 139. - After `da19481a` (DV-prefix revert): 139 → ~**94** by name-cluster math (full DeletionVectorsSuite + checkpoint-with-DVs cleared, plus 6 `Scan with predicates - no deletes` PredicatePushdown variants). - After `bb35abf0` (split infra): no new pass yet — kernel-path-only impact. A fresh full regression run is the right way to confirm the 94 number and pick up any cross-cluster knock-ons. Holding off on kicking that until the next batch of cluster work lands so we get one re-baseline rather than several. ### Remaining clusters Same shape as the prior triage table, minus what's now cleared: | Cluster | Approx count | Likely root cause | | --- | --- | --- | | DV-bypass file-splitting (PredicatePushdown `partitions.size === 2`) | 15 | DV-fallback path produces 1 partition for a 4MB file with `FILES_MAX_PARTITION_BYTES=2MB`. Investigation needed: scan goes through neither `CometDeltaNativeScan.convert` nor `CometScanExec.createFilePartitionsForNonBucketedScan` — likely a `CometDeltaNativeScanExec` instance constructed via a different rule path, or AQE/cache reuse from the SHALLOW CLONE. | | AQE `setLogicalLinkForNewQueryStage` | ~30 | Comet replaces a scan in joins/windows under AQE; the new query stage assertion fires. | | Scan metric on Comet exec | ~10 | Tests read `numFiles` off the wrapped `FileSourceScanExec`, but the metric lives on the Comet exec only. | | DV metadata column injection (`useMetadataRowIndex`) | ~20 | Native delta-compat path needs to surface `_metadata.row_index`. | | Streaming row production (`q.recentProgress.numInputRows == 0`) | ~5 | DeltaSourceSuiteBase.CheckProgress wiring. | | `DeltaErrors` docs link | ~3 | HTTP 301 from docs.delta.io — environmental. | | Misc / not-yet-clustered | balance | ### Next step Pick one of: (a) DV-metadata-column cluster (highest unique count, well-localised in the `native_delta_compat` path), (b) AQE-link cluster (highest count overall, but harder to localise), or (c) re-run full regression now to get a fresh authoritative count and re-triage from there. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
