[PR] feat: add PySpark validation script for datafusion-spark .slt tests [datafusion]

via GitHub Thu, 09 Apr 2026 07:03:43 -0700


andygrove opened a new pull request, #21508:
URL: https://github.com/apache/datafusion/pull/21508


   ## Which issue does this PR close?
   
   - Closes #17045.
   
   ## Rationale for this change
   
   The datafusion-spark `.slt` test files contain hardcoded expected values 
that were originally derived from PySpark but have never been systematically 
validated against actual Spark. This creates risk of drift as Spark evolves and 
baked-in errors from the original porting process. This PR adds tooling to 
detect such issues.
   
   ## What changes are included in this PR?
   
   A Python validation script (`datafusion/spark/scripts/validate_slt.py`) that:
   
   1. **Parses `.slt` files** — extracts query blocks, expected results, error 
queries, and DDL statements
   2. **Translates DataFusion SQL to PySpark SQL** — converts `::TYPE` casts to 
`CAST()`, `arrow_cast()` to Spark types, `make_array()` to `array()`. Skips 
untranslatable constructs (arrow_typeof, spark_cast, Dictionary/Utf8View types, 
ANSI mode blocks)
   3. **Runs queries in PySpark** — executes translated queries against a local 
SparkSession and formats results to match `.slt` conventions
   4. **Compares results** — reports mismatches between PySpark output and 
hardcoded expected values
   
   Usage:
   ```bash
   # Validate all .slt files
   python datafusion/spark/scripts/validate_slt.py
   
   # Validate specific file or subdirectory
   python datafusion/spark/scripts/validate_slt.py --path math/abs.slt
   python datafusion/spark/scripts/validate_slt.py --path string/ --verbose
   ```
   
   Initial validation results across 239 files (1919 queries): 1157 passed, 364 
failed, 398 skipped. The failures surface genuine findings including timestamp 
formatting differences, Float32 precision issues, function naming mismatches, 
and interval handling gaps.
   
   ## Are these changes tested?
   
   The script is a validation/developer tool (not library code). It was tested 
by running it against the full set of 239 `.slt` files across all function 
categories. Results were verified manually against known Spark behavior for 
representative queries (abs, ascii, hex, array_contains, date_trunc, etc.).
   
   ## Are there any user-facing changes?
   
   No. This is a developer tool for validating test correctness.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] feat: add PySpark validation script for datafusion-spark .slt tests [datafusion]

Reply via email to