schenksj commented on PR #3932:
URL: 
https://github.com/apache/datafusion-comet/pull/3932#issuecomment-4357393646

   ### Triage update
   
   Two more commits since the [last status 
post](https://github.com/apache/datafusion-comet/pull/3932#issuecomment-4276098364):
   
   - `da19481a` — `fix(delta-regression): stop overriding dvFileNamePrefix in 
Delta test harness`. Our diff was setting 
`spark.databricks.delta.testOnly.dvFileNamePrefix=""` next to the data-file 
prefix override. The data-file override is correct (sidesteps `%`-char 
round-trip in our path parsing), but the DV override broke any test that reads 
a pre-existing DV fixture: Delta's reader reconstructs the DV path as 
`<prefix>deletion_vector_<uuid>.bin` and the on-disk fixture has the full 
prefix baked into the filename, so with prefix="" the lookup misses. Removing 
just the DV override clears `CheckpointsSuite "checkpoint with DVs"` (1 → 0) 
and the entire `DeletionVectorsSuite` (~30 → 0; both `Check no resource leak 
when DV files are missing` and the rest pass).
     - Validated: `CheckpointsSuite` 33/33, `DeletionVectorsSuite` 29/29.
   - `bb35abf0` — `feat(delta): file-splitting via byte-range tasks in 
CometDeltaNativeScan`. Adds Spark-equivalent file splitting 
(`FILES_MAX_PARTITION_BYTES` / `files.openCostInBytes` / 
`files.minPartitionNum`) to the native Delta scan path: new optional 
`byte_range_start` / `byte_range_end` on `DeltaScanTask`, Scala-side 
computation of `maxSplitBytes` mirroring `FilePartition.maxSplitBytes`, native 
side honours the range via `PartitionedFile::new_with_range`. DV row indexes 
are duplicated across chunks (correct since they're absolute file offsets and 
the scan filters on absolute index).
     - Doesn't fix the 15 `DeletionVectorsWithPredicatePushdownSuite` 
split-count assertions on its own — those scans bypass 
`CometDeltaNativeScan.convert` via the existing `PreparedDeltaFileIndex` DV 
fallback. Debug logs in both `splitTasks` and 
`CometScanExec.createFilePartitionsForNonBucketedScan` confirmed that neither 
runs for the failing tests, so the splitting fix is latent infrastructure for 
now (it'll start helping once the DV-fallback gate is loosened or replaced by a 
DV-aware kernel scan).
     - `DeletionVectorsSuite` regressed from a sanity run: still 29/29 with the 
splitting commit in place.
   
   ### Net regression delta vs. the original 139
   
   - Original full run (before any fix in this stretch): **139** failed.
   - After the OptimizeGeneratedColumnSuite test-helper fix (`25085889`): 175 → 
139.
   - After `da19481a` (DV-prefix revert): 139 → ~**94** by name-cluster math 
(full DeletionVectorsSuite + checkpoint-with-DVs cleared, plus 6 `Scan with 
predicates - no deletes` PredicatePushdown variants).
   - After `bb35abf0` (split infra): no new pass yet — kernel-path-only impact.
   
   A fresh full regression run is the right way to confirm the 94 number and 
pick up any cross-cluster knock-ons. Holding off on kicking that until the next 
batch of cluster work lands so we get one re-baseline rather than several.
   
   ### Remaining clusters
   
   Same shape as the prior triage table, minus what's now cleared:
   
   | Cluster | Approx count | Likely root cause |
   | --- | --- | --- |
   | DV-bypass file-splitting (PredicatePushdown `partitions.size === 2`) | 15 
| DV-fallback path produces 1 partition for a 4MB file with 
`FILES_MAX_PARTITION_BYTES=2MB`. Investigation needed: scan goes through 
neither `CometDeltaNativeScan.convert` nor 
`CometScanExec.createFilePartitionsForNonBucketedScan` — likely a 
`CometDeltaNativeScanExec` instance constructed via a different rule path, or 
AQE/cache reuse from the SHALLOW CLONE. |
   | AQE `setLogicalLinkForNewQueryStage` | ~30 | Comet replaces a scan in 
joins/windows under AQE; the new query stage assertion fires. |
   | Scan metric on Comet exec | ~10 | Tests read `numFiles` off the wrapped 
`FileSourceScanExec`, but the metric lives on the Comet exec only. |
   | DV metadata column injection (`useMetadataRowIndex`) | ~20 | Native 
delta-compat path needs to surface `_metadata.row_index`. |
   | Streaming row production (`q.recentProgress.numInputRows == 0`) | ~5 | 
DeltaSourceSuiteBase.CheckProgress wiring. |
   | `DeltaErrors` docs link | ~3 | HTTP 301 from docs.delta.io — 
environmental. |
   | Misc / not-yet-clustered | balance |
   
   ### Next step
   
   Pick one of: (a) DV-metadata-column cluster (highest unique count, 
well-localised in the `native_delta_compat` path), (b) AQE-link cluster 
(highest count overall, but harder to localise), or (c) re-run full regression 
now to get a fresh authoritative count and re-triage from there.
   
   Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to