andygrove opened a new pull request, #21508: URL: https://github.com/apache/datafusion/pull/21508
## Which issue does this PR close? - Closes #17045. ## Rationale for this change The datafusion-spark `.slt` test files contain hardcoded expected values that were originally derived from PySpark but have never been systematically validated against actual Spark. This creates risk of drift as Spark evolves and baked-in errors from the original porting process. This PR adds tooling to detect such issues. ## What changes are included in this PR? A Python validation script (`datafusion/spark/scripts/validate_slt.py`) that: 1. **Parses `.slt` files** — extracts query blocks, expected results, error queries, and DDL statements 2. **Translates DataFusion SQL to PySpark SQL** — converts `::TYPE` casts to `CAST()`, `arrow_cast()` to Spark types, `make_array()` to `array()`. Skips untranslatable constructs (arrow_typeof, spark_cast, Dictionary/Utf8View types, ANSI mode blocks) 3. **Runs queries in PySpark** — executes translated queries against a local SparkSession and formats results to match `.slt` conventions 4. **Compares results** — reports mismatches between PySpark output and hardcoded expected values Usage: ```bash # Validate all .slt files python datafusion/spark/scripts/validate_slt.py # Validate specific file or subdirectory python datafusion/spark/scripts/validate_slt.py --path math/abs.slt python datafusion/spark/scripts/validate_slt.py --path string/ --verbose ``` Initial validation results across 239 files (1919 queries): 1157 passed, 364 failed, 398 skipped. The failures surface genuine findings including timestamp formatting differences, Float32 precision issues, function naming mismatches, and interval handling gaps. ## Are these changes tested? The script is a validation/developer tool (not library code). It was tested by running it against the full set of 239 `.slt` files across all function categories. Results were verified manually against known Spark behavior for representative queries (abs, ascii, hex, array_contains, date_trunc, etc.). ## Are there any user-facing changes? No. This is a developer tool for validating test correctness. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
