andygrove opened a new issue, #4098: URL: https://github.com/apache/datafusion-comet/issues/4098
Tracking issue for the four remaining clusters of test failures on Spark 4.1 (4.1.1) once the profile, shims, diff, and SQL-test workflow entry are in place. Context PRs: #4093 (Spark 4.1.1 enablement) and #4097 (spark-4.1 profile + shims prep, no tests). ## Status - [ ] **OneRowRelationExec not transformed by Comet** (~30 sql-file expression tests) - [ ] **Native parquet reader: user-defined struct schema mismatch** (2 tests, Linux + macOS) - [ ] **Bloom filter result mismatch** (2 tests) - [ ] **`bytesRead` task metric off by 6 to 14 times** (3 tests) Two earlier clusters are already cleared on the branch (commit `5a60be22d`): - `CometNativeWriteExec.newTaskTempFile` String overload became abstract-throwing in 4.1; switched to the `FileNameSpec` overload. Cleared 17 parquet-write failures. - `remainder function` test expected `[DIVIDE_BY_ZERO]`; Spark 4.1 introduced `[REMAINDER_BY_ZERO]`. Branched the expected message on `isSpark41Plus`. --- ## 1. `OneRowRelationExec` not transformed by Comet **Where:** ~30 failures in `Spark 4.1, JDK 17/auto [expressions]`, all `sql-file:` tests like `expressions/cast/cast.sql`, `expressions/datetime/*`, `expressions/struct/create_named_struct.sql`, etc. **Symptom:** ``` Expected only Comet native operators, but found Project. plan: Project +- Scan OneRowRelation [COMET: Scan OneRowRelation is not supported] ``` **Root cause:** Spark 4.1 added a new `OneRowRelationExec` physical leaf and stopped folding `SELECT cast(literal)` queries down to `LocalRelation` via `ConvertToLocalRelation`. In 4.0 those queries became `LocalTableScanExec`, which Comet has a wrapper for (`CometLocalTableScanExec`). In 4.1 they stay as `Project + OneRowRelationExec` and Comet's `CometExecRule` falls the whole subtree back to Spark. **Fix options (decision needed):** - (a) Add `CometOneRowRelationExec` analogous to `CometLocalTableScanExec`. Real fix, biggest scope, needs a Rust-side serde for an empty-row scan. - (b) Pre-rewrite `Project + OneRowRelationExec` into `LocalTableScanExec` with a single empty row in a Comet planner rule. - (c) Test-only allowlist (masks fallback, not recommended). --- ## 2. Native parquet reader: user-defined struct schema mismatch **Where:** `native reader - select struct field with user defined schema - native_datafusion` and `- native_iceberg_compat` in both `Spark 4.1, JDK 17/auto [parquet]` and `macos-14/Spark 4.1, JDK 17, Scala 2.13 [parquet]`. **Symptom:** `Results do not match for query`, schema is `c0: struct<y:int,x:string>` over a parquet relation. Comet's native reader returns different rows than Spark. **Suspected root cause:** Spark 4.1 changed how user-supplied struct schemas are reconciled with on-disk Parquet field order, or field pruning behaves differently. Compare Spark 4.0 vs 4.1 planning output for this query and check whether user-schema field-name-vs-position behavior changed in `ParquetReadSupport` or `ParquetSchemaConverter`. --- ## 3. Bloom filter result mismatch **Where:** `test BloomFilterMightContain from random input` and `bloom_filter_agg` in `Spark 4.1, JDK 17/auto [exec]`. **Symptom:** Comet and Spark produce different `might_contain` results for the same input. **Suspected root cause:** Spark 4.1 likely changed the bloom filter binary layout, hash seed, or default false-positive probability. Diff `BloomFilterImpl` / `BloomFilterAggregate` between 4.0 and 4.1, then mirror in Comet's bloom filter code in `native/spark-expr`. --- ## 4. `bytesRead` task metric off by 6 to 14 times **Where:** `native_datafusion scan reports task-level input metrics matching Spark`, `input metrics aggregate across multiple native scans in a join`, `... in a union` in `Spark 4.1, JDK 17/auto [exec]` (`CometTaskMetricsSuite`). **Symptom:** ``` 9.6 was greater than or equal to 0.7, but 9.6 was not less than or equal to 1.3 bytesRead ratio out of range: comet=90498, spark=9427, ratio=9.6 ``` Two more failures with similar 6.4 and 13.9 ratios. **Suspected root cause:** Spark 4.1 changed what `inputMetrics.bytesRead` accounts for, most likely now reports a smaller subset (e.g. only bytes actually read into row buffers, versus full Parquet footer plus row group). Compare `ParquetFileReader` / `PartitionedFile` accounting between 4.0 and 4.1 and adjust Comet's metric source accordingly. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
