andygrove opened a new issue, #4180:
URL: https://github.com/apache/datafusion-comet/issues/4180
### What is the problem the feature request solves?
## What is the problem the feature request solves?
**yes, this is all AI generated ... I asked Claude to find all the Spark
configs that could we should ensure Comet honors**
Spark exposes hundreds of configuration options, many of which alter query
semantics (result correctness, not just performance). Comet's test matrix
almost exclusively runs with default configs, which means non-default settings
may silently produce results that diverge from Spark.
A user running Spark with, e.g., `spark.sql.legacy.timeParserPolicy=LEGACY`
or `spark.sql.parquet.datetimeRebaseModeInRead=LEGACY` has no guarantee that
Comet will produce the same results as vanilla Spark. In the worst case,
results are silently wrong with no warning.
Today, Comet reads and respects only a small subset of semantic configs:
| Config | Handling |
|---|---|
| `spark.sql.ansi.enabled` | Native (cast eval mode) |
| `spark.sql.session.timeZone` | Native (Parquet + timestamp expressions) |
| `spark.sql.parquet.int96AsTimestamp` | Native |
| `spark.sql.parquet.int96TimestampConversion` | Disables Comet entirely
when `true` |
| `spark.sql.caseSensitiveAnalysis` | Native (Parquet) |
| `spark.sql.parquet.filterPushdown` (+ subflags) | Native |
| `spark.sql.nestedSchemaPruningEnabled` | Native |
| `spark.sql.decimalOperationsAllowPrecisionLoss` | Native |
| Non-default string collations (Spark 4.0+) | Selective operator fallback |
Everything else is effectively "hope the default works for you."
## Describe the potential solution
Audit all semantic-affecting Spark SQL configs and, for each one, pick one
of three states:
1. **Respected natively** — Comet reads the config and produces
Spark-compatible results. Covered by tests that run with non-default values.
2. **Fallback trigger** — Non-default value disables Comet for the affected
query (or entirely, like `int96TimestampConversion`).
3. **Known-safe to ignore** — Config doesn't change results in code paths
Comet runs.
Each config needs an explicit decision, a test, and documentation. Below is
a starting inventory organized by risk.
### Tier 1 — Correctness bombs (silent wrong results)
These most likely produce divergent results today:
- [ ] `spark.sql.parquet.datetimeRebaseModeInRead` (EXCEPTION / LEGACY /
CORRECTED) — Julian↔Gregorian rebase for pre-1582 dates. Comet has
`spark.comet.exceptionOnDatetimeRebase` but it's unclear whether all three
modes are honored end-to-end.
- [ ] `spark.sql.parquet.int96RebaseModeInRead` — same as above for INT96.
- [ ] `spark.sql.parquet.datetimeRebaseModeInWrite` /
`int96RebaseModeInWrite` — write path.
- [ ] `spark.sql.legacy.timeParserPolicy` (CORRECTED / LEGACY / EXCEPTION) —
changes `to_date`, `to_timestamp`, `unix_timestamp`, CSV/JSON date parsing
entirely (SimpleDateFormat vs DateTimeFormatter).
- [ ] `spark.sql.parquet.binaryAsString` — changes returned schema (binary
columns read as string).
### Tier 2 — Function / expression-level semantics
- [ ] `spark.sql.legacy.sizeOfNull` — `size(null)` returns `-1` vs `null`.
- [ ] `spark.sql.legacy.allowNegativeScaleOfDecimal` — decimal parsing and
arithmetic.
- [ ] `spark.sql.legacy.typeCoercion.datetimeToString.enabled` — implicit
date↔string coercion in concat/coalesce.
- [ ] `spark.sql.legacy.allowCastNumericToTimestamp` — `CAST(bigint AS
timestamp)` permitted or not.
- [ ] `spark.sql.legacy.followThreeValuedLogicInArrayExists` —
`exists(array, pred)` null handling.
- [ ] `spark.sql.function.concatBinaryAsString` — `concat(binary, binary)`
return type.
- [ ] `spark.sql.function.eltOutputAsString` — same for `elt()`.
- [ ] `spark.sql.legacy.integerGroupingIdEnabled` — `grouping_id()` return
type.
- [ ] `spark.sql.legacy.json.allowEmptyString.enabled` — JSON parsing
behavior.
### Tier 3 — Schema / type behavior
- [ ] `spark.sql.legacy.charVarcharAsString` — CHAR/VARCHAR
padding/truncation.
- [ ] `spark.sql.readSideCharPadding` — CHAR(n) read-side padding.
- [ ] `spark.sql.legacy.createEmptyCollectionUsingStringType` — empty
`array()` / `map()` element type.
### Tier 4 — Already read, but verify coverage
Confirm that these are honored in all Comet code paths, not just the ones
where they were originally wired up:
- [ ] `spark.sql.ansi.enabled` — beyond cast, does ANSI propagate to
division-by-zero, integer overflow in `+`/`-`/`*`, array index out-of-bounds,
etc.?
- [ ] `spark.sql.caseSensitiveAnalysis` — confirmed for Parquet scan; what
about nested struct field access in expressions?
- [ ] `spark.sql.session.timeZone` — confirmed for Parquet and casts; verify
for all temporal expressions (`date_format`, `from_utc_timestamp`,
`to_utc_timestamp`, `date_trunc`, etc.).
- [ ] `spark.sql.decimalOperationsAllowPrecisionLoss` — confirmed for the
serde path; verify arithmetic kernels match.
### Suggested process per config
For each item:
1. Identify the Spark code paths that branch on the config.
2. Check whether Comet's corresponding code path reads the config.
3. Add a test that runs with the non-default value and asserts
Spark-vs-Comet parity (or explicit fallback).
4. Decide: native-respect, fallback-trigger, or known-safe-to-ignore.
Document the decision.
### Optional: config whitelist gate
Once the audit is done, we could add an opt-in mode (e.g.,
`spark.comet.strictConfigCompatibility=true`) that falls back to Spark whenever
a semantic config is set to a non-default value not on the whitelist. This
would give users a way to trade coverage for correctness guarantees without
forcing the tradeoff on everyone.
## Additional context
Motivation: the current "default configs only" testing posture is a
significant compatibility risk as Comet sees broader adoption. Users routinely
tune `legacy.timeParserPolicy`, rebase modes, and ANSI behavior in production —
these are not edge cases.
### Describe the potential solution
_No response_
### Additional context
_No response_
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]