[I] Audit Spark SQL configs that affect query semantics and ensure Comet honors them (or falls back) [datafusion-comet]

via GitHub Fri, 01 May 2026 16:50:24 -0700


andygrove opened a new issue, #4180:
URL: https://github.com/apache/datafusion-comet/issues/4180


   ### What is the problem the feature request solves?
   
   ## What is the problem the feature request solves?
   
   **yes, this is all AI generated ... I asked Claude to find all the Spark 
configs that could we should ensure Comet honors**
   
   Spark exposes hundreds of configuration options, many of which alter query 
semantics (result correctness, not just performance). Comet's test matrix 
almost exclusively runs with default configs, which means non-default settings 
may silently produce results that diverge from Spark.
   
   A user running Spark with, e.g., `spark.sql.legacy.timeParserPolicy=LEGACY` 
or `spark.sql.parquet.datetimeRebaseModeInRead=LEGACY` has no guarantee that 
Comet will produce the same results as vanilla Spark. In the worst case, 
results are silently wrong with no warning.
   
   Today, Comet reads and respects only a small subset of semantic configs:
   
   | Config | Handling |
   |---|---|
   | `spark.sql.ansi.enabled` | Native (cast eval mode) |
   | `spark.sql.session.timeZone` | Native (Parquet + timestamp expressions) |
   | `spark.sql.parquet.int96AsTimestamp` | Native |
   | `spark.sql.parquet.int96TimestampConversion` | Disables Comet entirely 
when `true` |
   | `spark.sql.caseSensitiveAnalysis` | Native (Parquet) |
   | `spark.sql.parquet.filterPushdown` (+ subflags) | Native |
   | `spark.sql.nestedSchemaPruningEnabled` | Native |
   | `spark.sql.decimalOperationsAllowPrecisionLoss` | Native |
   | Non-default string collations (Spark 4.0+) | Selective operator fallback |
   
   Everything else is effectively "hope the default works for you."
   
   ## Describe the potential solution
   
   Audit all semantic-affecting Spark SQL configs and, for each one, pick one 
of three states:
   
   1. **Respected natively** — Comet reads the config and produces 
Spark-compatible results. Covered by tests that run with non-default values.
   2. **Fallback trigger** — Non-default value disables Comet for the affected 
query (or entirely, like `int96TimestampConversion`).
   3. **Known-safe to ignore** — Config doesn't change results in code paths 
Comet runs.
   
   Each config needs an explicit decision, a test, and documentation. Below is 
a starting inventory organized by risk.
   
   ### Tier 1 — Correctness bombs (silent wrong results)
   
   These most likely produce divergent results today:
   
   - [ ] `spark.sql.parquet.datetimeRebaseModeInRead` (EXCEPTION / LEGACY / 
CORRECTED) — Julian↔Gregorian rebase for pre-1582 dates. Comet has 
`spark.comet.exceptionOnDatetimeRebase` but it's unclear whether all three 
modes are honored end-to-end.
   - [ ] `spark.sql.parquet.int96RebaseModeInRead` — same as above for INT96.
   - [ ] `spark.sql.parquet.datetimeRebaseModeInWrite` / 
`int96RebaseModeInWrite` — write path.
   - [ ] `spark.sql.legacy.timeParserPolicy` (CORRECTED / LEGACY / EXCEPTION) — 
changes `to_date`, `to_timestamp`, `unix_timestamp`, CSV/JSON date parsing 
entirely (SimpleDateFormat vs DateTimeFormatter).
   - [ ] `spark.sql.parquet.binaryAsString` — changes returned schema (binary 
columns read as string).
                                                                                
                                                                                
                     
   ### Tier 2 — Function / expression-level semantics
   
   - [ ] `spark.sql.legacy.sizeOfNull` — `size(null)` returns `-1` vs `null`.
   - [ ] `spark.sql.legacy.allowNegativeScaleOfDecimal` — decimal parsing and 
arithmetic.
   - [ ] `spark.sql.legacy.typeCoercion.datetimeToString.enabled` — implicit 
date↔string coercion in concat/coalesce.
   - [ ] `spark.sql.legacy.allowCastNumericToTimestamp` — `CAST(bigint AS 
timestamp)` permitted or not.
   - [ ] `spark.sql.legacy.followThreeValuedLogicInArrayExists` — 
`exists(array, pred)` null handling.
   - [ ] `spark.sql.function.concatBinaryAsString` — `concat(binary, binary)` 
return type.
   - [ ] `spark.sql.function.eltOutputAsString` — same for `elt()`.
   - [ ] `spark.sql.legacy.integerGroupingIdEnabled` — `grouping_id()` return 
type.
   - [ ] `spark.sql.legacy.json.allowEmptyString.enabled` — JSON parsing 
behavior.
   
   ### Tier 3 — Schema / type behavior
   
   - [ ] `spark.sql.legacy.charVarcharAsString` — CHAR/VARCHAR 
padding/truncation.
   - [ ] `spark.sql.readSideCharPadding` — CHAR(n) read-side padding.
   - [ ] `spark.sql.legacy.createEmptyCollectionUsingStringType` — empty 
`array()` / `map()` element type.
   
   ### Tier 4 — Already read, but verify coverage
   
   Confirm that these are honored in all Comet code paths, not just the ones 
where they were originally wired up:
   
   - [ ] `spark.sql.ansi.enabled` — beyond cast, does ANSI propagate to 
division-by-zero, integer overflow in `+`/`-`/`*`, array index out-of-bounds, 
etc.?
   - [ ] `spark.sql.caseSensitiveAnalysis` — confirmed for Parquet scan; what 
about nested struct field access in expressions?
   - [ ] `spark.sql.session.timeZone` — confirmed for Parquet and casts; verify 
for all temporal expressions (`date_format`, `from_utc_timestamp`, 
`to_utc_timestamp`, `date_trunc`, etc.).
   - [ ] `spark.sql.decimalOperationsAllowPrecisionLoss` — confirmed for the 
serde path; verify arithmetic kernels match.
   
   ### Suggested process per config
   
   For each item:
   
   1. Identify the Spark code paths that branch on the config.
   2. Check whether Comet's corresponding code path reads the config.
   3. Add a test that runs with the non-default value and asserts 
Spark-vs-Comet parity (or explicit fallback).
   4. Decide: native-respect, fallback-trigger, or known-safe-to-ignore. 
Document the decision.
   
   ### Optional: config whitelist gate
   
   Once the audit is done, we could add an opt-in mode (e.g., 
`spark.comet.strictConfigCompatibility=true`) that falls back to Spark whenever 
a semantic config is set to a non-default value not on the whitelist. This 
would give users a way to trade coverage for correctness guarantees without 
forcing the tradeoff on everyone.
   
   ## Additional context
   
   Motivation: the current "default configs only" testing posture is a 
significant compatibility risk as Comet sees broader adoption. Users routinely 
tune `legacy.timeParserPolicy`, rebase modes, and ANSI behavior in production — 
these are not edge cases.
   
   
   ### Describe the potential solution
   
   _No response_
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] Audit Spark SQL configs that affect query semantics and ensure Comet honors them (or falls back) [datafusion-comet]

Reply via email to