[I] Tracking: remaining Spark 4.1 CI failures on #4093 [datafusion-comet]

via GitHub Sun, 26 Apr 2026 10:08:38 -0700


andygrove opened a new issue, #4098:
URL: https://github.com/apache/datafusion-comet/issues/4098


   Tracking issue for the four remaining clusters of test failures on Spark 4.1 
(4.1.1) once the profile, shims, diff, and SQL-test workflow entry are in 
place. Context PRs: #4093 (Spark 4.1.1 enablement) and #4097 (spark-4.1 profile 
+ shims prep, no tests).
   
   ## Status
   
   - [ ] **OneRowRelationExec not transformed by Comet** (~30 sql-file 
expression tests)
   - [ ] **Native parquet reader: user-defined struct schema mismatch** (2 
tests, Linux + macOS)
   - [ ] **Bloom filter result mismatch** (2 tests)
   - [ ] **`bytesRead` task metric off by 6 to 14 times** (3 tests)
   
   Two earlier clusters are already cleared on the branch (commit `5a60be22d`):
   
   - `CometNativeWriteExec.newTaskTempFile` String overload became 
abstract-throwing in 4.1; switched to the `FileNameSpec` overload. Cleared 17 
parquet-write failures.
   - `remainder function` test expected `[DIVIDE_BY_ZERO]`; Spark 4.1 
introduced `[REMAINDER_BY_ZERO]`. Branched the expected message on 
`isSpark41Plus`.
   
   ---
   
   ## 1. `OneRowRelationExec` not transformed by Comet
   
   **Where:** ~30 failures in `Spark 4.1, JDK 17/auto [expressions]`, all 
`sql-file:` tests like `expressions/cast/cast.sql`, `expressions/datetime/*`, 
`expressions/struct/create_named_struct.sql`, etc.
   
   **Symptom:**
   
   ```
   Expected only Comet native operators, but found Project.
   plan: Project
   +-  Scan OneRowRelation [COMET: Scan OneRowRelation is not supported]
   ```
   
   **Root cause:** Spark 4.1 added a new `OneRowRelationExec` physical leaf and 
stopped folding `SELECT cast(literal)` queries down to `LocalRelation` via 
`ConvertToLocalRelation`. In 4.0 those queries became `LocalTableScanExec`, 
which Comet has a wrapper for (`CometLocalTableScanExec`). In 4.1 they stay as 
`Project + OneRowRelationExec` and Comet's `CometExecRule` falls the whole 
subtree back to Spark.
   
   **Fix options (decision needed):**
   - (a) Add `CometOneRowRelationExec` analogous to `CometLocalTableScanExec`. 
Real fix, biggest scope, needs a Rust-side serde for an empty-row scan.
   - (b) Pre-rewrite `Project + OneRowRelationExec` into `LocalTableScanExec` 
with a single empty row in a Comet planner rule.
   - (c) Test-only allowlist (masks fallback, not recommended).
   
   ---
   
   ## 2. Native parquet reader: user-defined struct schema mismatch
   
   **Where:** `native reader - select struct field with user defined schema - 
native_datafusion` and `- native_iceberg_compat` in both `Spark 4.1, JDK 
17/auto [parquet]` and `macos-14/Spark 4.1, JDK 17, Scala 2.13 [parquet]`.
   
   **Symptom:** `Results do not match for query`, schema is `c0: 
struct<y:int,x:string>` over a parquet relation. Comet's native reader returns 
different rows than Spark.
   
   **Suspected root cause:** Spark 4.1 changed how user-supplied struct schemas 
are reconciled with on-disk Parquet field order, or field pruning behaves 
differently. Compare Spark 4.0 vs 4.1 planning output for this query and check 
whether user-schema field-name-vs-position behavior changed in 
`ParquetReadSupport` or `ParquetSchemaConverter`.
   
   ---
   
   ## 3. Bloom filter result mismatch
   
   **Where:** `test BloomFilterMightContain from random input` and 
`bloom_filter_agg` in `Spark 4.1, JDK 17/auto [exec]`.
   
   **Symptom:** Comet and Spark produce different `might_contain` results for 
the same input.
   
   **Suspected root cause:** Spark 4.1 likely changed the bloom filter binary 
layout, hash seed, or default false-positive probability. Diff 
`BloomFilterImpl` / `BloomFilterAggregate` between 4.0 and 4.1, then mirror in 
Comet's bloom filter code in `native/spark-expr`.
   
   ---
   
   ## 4. `bytesRead` task metric off by 6 to 14 times
   
   **Where:** `native_datafusion scan reports task-level input metrics matching 
Spark`, `input metrics aggregate across multiple native scans in a join`, `... 
in a union` in `Spark 4.1, JDK 17/auto [exec]` (`CometTaskMetricsSuite`).
   
   **Symptom:**
   ```
   9.6 was greater than or equal to 0.7, but 9.6 was not less than or equal to 
1.3
   bytesRead ratio out of range: comet=90498, spark=9427, ratio=9.6
   ```
   
   Two more failures with similar 6.4 and 13.9 ratios.
   
   **Suspected root cause:** Spark 4.1 changed what `inputMetrics.bytesRead` 
accounts for, most likely now reports a smaller subset (e.g. only bytes 
actually read into row buffers, versus full Parquet footer plus row group). 
Compare `ParquetFileReader` / `PartitionedFile` accounting between 4.0 and 4.1 
and adjust Comet's metric source accordingly.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] Tracking: remaining Spark 4.1 CI failures on #4093 [datafusion-comet]

Reply via email to