Re: [PR] feat: add PySpark validation script for datafusion-spark .slt tests [datafusion]

via GitHub Fri, 10 Apr 2026 14:55:22 -0700


shehabgamin commented on code in PR #21508:
URL: https://github.com/apache/datafusion/pull/21508#discussion_r3066833812



##########
datafusion/spark/scripts/known-failures.txt:
##########
@@ -0,0 +1,74 @@
+# Known failures for PySpark SLT validation.
+# Each line is a .slt file path (relative to spark test dir) to skip.
+# Blank lines and lines starting with # are ignored.
+
+# format_string %t specifiers pass microseconds where Java expects milliseconds
+# https://github.com/apache/datafusion/issues/21515
+string/format_string.slt
+
+# format_string %f/%e/%g/%a not supported with Decimal types in Spark
+# format_string %c requires INT not BIGINT in Spark
+# (also covered by #21515)
+
+# substring handles large negative start positions differently from Spark
+# https://github.com/apache/datafusion/issues/21510
+string/substring.slt
+
+# array_repeat incorrectly returns NULL when element is NULL
+# https://github.com/apache/datafusion/issues/21512
+array/array_repeat.slt
+
+# mod/pmod returns NaN instead of NULL for float division by zero
+# https://github.com/apache/datafusion/issues/21514
+math/mod.slt
+
+# String literal escape sequences (\t, \n) not interpreted like Spark
+# https://github.com/apache/datafusion/issues/21516
+string/soundex.slt
+
+# array_contains rejects NULL typed arguments in Spark
+array/array_contains.slt
+
+# Spark's shuffle() only takes 1 argument, not 2 (no seed parameter)
+array/shuffle.slt
+
+# map(array, array) creates single-entry map in Spark, not map_from_arrays
+# Wrong test: uses map() where map_from_arrays() was intended
+collection/size.slt
+
+# date_add/date_sub overflow: Spark errors, DataFusion wraps
+# CAST(date AS INT) not supported in Spark
+datetime/date_add.slt
+datetime/date_sub.slt
+
+# Interval formatting differs between DataFusion and Spark
+datetime/make_dt_interval.slt
+datetime/make_interval.slt
+
+# TIME data type is Spark 4.0 only

Review Comment:
   Related to 
https://github.com/apache/datafusion/pull/21508#discussion_r3066824998



##########
.github/workflows/spark-pyspark-validation.yml:
##########
@@ -0,0 +1,63 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+name: Spark PySpark Validation
+
+on:
+  push:
+    branches-ignore:
+      - 'gh-readonly-queue/**'
+    paths:
+      - 'datafusion/spark/**'
+      - 'datafusion/sqllogictest/test_files/spark/**'
+      - '.github/workflows/spark-pyspark-validation.yml'
+  pull_request:
+    paths:
+      - 'datafusion/spark/**'
+      - 'datafusion/sqllogictest/test_files/spark/**'
+      - '.github/workflows/spark-pyspark-validation.yml'
+
+permissions:
+  contents: read
+
+concurrency:
+  group: ${{ github.repository }}-${{ github.head_ref || github.sha }}-${{ 
github.workflow }}
+  cancel-in-progress: true
+
+jobs:
+  pyspark-validation:
+    runs-on: ubuntu-latest
+    name: Validate .slt tests against PySpark
+    steps:
+      - uses: actions/checkout@v4
+
+      - name: Set up Python
+        uses: actions/setup-python@v5
+        with:
+          python-version: '3.11'
+
+      - name: Set up Java
+        uses: actions/setup-java@v4
+        with:
+          distribution: 'temurin'
+          java-version: '17'
+
+      - name: Install PySpark
+        run: pip install pyspark==3.5.5

Review Comment:
   We should validate with the latest version of Spark (4.1.x). Any test case 
that passes on Spark 3.5 will also pass on later versions.



##########
datafusion/spark/scripts/validate_slt.py:
##########
@@ -0,0 +1,1210 @@
+#!/usr/bin/env python3
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+"""
+Validate hardcoded expected values in .slt (sqllogictest) test files
+by running the same queries against PySpark and comparing results.
+
+Usage:
+    python validate_slt.py                          # Run all .slt files
+    python validate_slt.py --path math/abs.slt      # Single file
+    python validate_slt.py --path string/           # All files in subdirectory
+    python validate_slt.py --verbose                 # Show details
+    python validate_slt.py --show-skipped            # Show skipped queries
+"""
+
+import argparse
+import math
+import os
+import re
+import sys
+from dataclasses import dataclass, field
+from pathlib import Path
+from typing import Optional
+
+# ---------------------------------------------------------------------------
+# Arrow type -> Spark type mapping
+# ---------------------------------------------------------------------------
+ARROW_TO_SPARK_TYPE = {
+    "Int8": "TINYINT",
+    "Int16": "SMALLINT",
+    "Int32": "INT",
+    "Int64": "BIGINT",
+    "UInt8": "SMALLINT",
+    "UInt16": "INT",
+    "UInt32": "BIGINT",
+    "UInt64": "BIGINT",

Review Comment:
   I don't think functions should be returning UInt64 values, but if they are 
and that's acceptable, then down below we can also add `Utf8View`, `LargeUtf8`, 
`BinaryView`, `LargeBinary`, etc...



##########
datafusion/spark/scripts/validate_slt.py:
##########
@@ -0,0 +1,1210 @@
+#!/usr/bin/env python3
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+"""
+Validate hardcoded expected values in .slt (sqllogictest) test files
+by running the same queries against PySpark and comparing results.
+
+Usage:
+    python validate_slt.py                          # Run all .slt files
+    python validate_slt.py --path math/abs.slt      # Single file
+    python validate_slt.py --path string/           # All files in subdirectory
+    python validate_slt.py --verbose                 # Show details
+    python validate_slt.py --show-skipped            # Show skipped queries
+"""
+
+import argparse
+import math
+import os
+import re
+import sys
+from dataclasses import dataclass, field
+from pathlib import Path
+from typing import Optional
+
+# ---------------------------------------------------------------------------
+# Arrow type -> Spark type mapping
+# ---------------------------------------------------------------------------
+ARROW_TO_SPARK_TYPE = {
+    "Int8": "TINYINT",
+    "Int16": "SMALLINT",
+    "Int32": "INT",
+    "Int64": "BIGINT",
+    "UInt8": "SMALLINT",
+    "UInt16": "INT",
+    "UInt32": "BIGINT",
+    "UInt64": "BIGINT",
+    "Float16": "FLOAT",
+    "Float32": "FLOAT",
+    "Float64": "DOUBLE",
+    "Utf8": "STRING",
+    "Boolean": "BOOLEAN",
+    "Binary": "BINARY",
+    "Date32": "DATE",
+    "Date64": "DATE",
+}
+
+# DataFusion cast type -> Spark type mapping
+DF_TO_SPARK_CAST_TYPE = {
+    "TINYINT": "TINYINT",
+    "SMALLINT": "SMALLINT",
+    "INT": "INT",
+    "INTEGER": "INT",
+    "BIGINT": "BIGINT",
+    "FLOAT": "FLOAT",
+    "REAL": "FLOAT",
+    "DOUBLE": "DOUBLE",
+    "STRING": "STRING",
+    "VARCHAR": "STRING",
+    "TEXT": "STRING",
+    "BOOLEAN": "BOOLEAN",
+    "BINARY": "BINARY",
+    "DATE": "DATE",
+    "TIMESTAMP": "TIMESTAMP",
+    # PostgreSQL-style aliases used in some .slt files
+    "FLOAT8": "DOUBLE",
+    "FLOAT4": "FLOAT",
+    "INT8": "BIGINT",
+    "INT4": "INT",
+    "INT2": "SMALLINT",
+    "BYTEA": "BINARY",
+}
+
+# Unsupported Arrow types for Spark
+UNSUPPORTED_ARROW_TYPES = {
+    "Utf8View",
+    "LargeUtf8",
+    "LargeBinary",
+    "BinaryView",
+}
+
+# ---------------------------------------------------------------------------
+# SLT record types
+# ---------------------------------------------------------------------------
+
+
+@dataclass
+class QueryRecord:
+    """A 'query <TYPE_CODES> [rowsort]' block."""
+
+    type_codes: str
+    sql: str
+    expected: list[str]
+    rowsort: bool
+    line_number: int
+    in_ansi_block: bool = False
+
+
+@dataclass
+class ErrorRecord:
+    """A 'query error <pattern>' or 'statement error <pattern>' block."""
+
+    pattern: str
+    sql: str
+    line_number: int
+    kind: str = "query"  # "query" or "statement"
+    in_ansi_block: bool = False
+
+
+@dataclass
+class StatementRecord:
+    """A 'statement ok' block (DDL/config)."""
+
+    sql: str
+    line_number: int
+    in_ansi_block: bool = False
+
+
+# ---------------------------------------------------------------------------
+# 1. SLT Parser
+# ---------------------------------------------------------------------------
+
+
+def parse_slt(filepath: str) -> list:
+    """Parse an .slt file into a list of records."""
+    with open(filepath) as f:
+        lines = f.readlines()
+
+    records = []
+    i = 0
+    in_ansi_mode = False
+
+    while i < len(lines):
+        line = lines[i].rstrip("\n")
+
+        # Skip blank lines and comments
+        if not line.strip() or line.strip().startswith("#"):
+            i += 1
+            continue
+
+        # query error <pattern>
+        m = re.match(r"^query\s+error\s+(.*)", line)
+        if m:
+            pattern = m.group(1).strip()
+            line_num = i + 1
+            i += 1
+            sql_lines = []
+            while i < len(lines) and lines[i].strip() and not 
lines[i].strip().startswith("#"):
+                stripped = lines[i].rstrip("\n")
+                if (
+                    re.match(r"^query\s", stripped)
+                    or re.match(r"^statement\s", stripped)
+                ):
+                    break
+                sql_lines.append(stripped)
+                i += 1
+            records.append(
+                ErrorRecord(
+                    pattern=pattern,
+                    sql="\n".join(sql_lines),
+                    line_number=line_num,
+                    kind="query",
+                    in_ansi_block=in_ansi_mode,
+                )
+            )
+            continue
+
+        # statement error <pattern>
+        m = re.match(r"^statement\s+error\s*(.*)", line)
+        if m:
+            pattern = m.group(1).strip()
+            line_num = i + 1
+            i += 1
+            sql_lines = []
+            while i < len(lines) and lines[i].strip() and not 
lines[i].strip().startswith("#"):
+                stripped = lines[i].rstrip("\n")
+                if (
+                    re.match(r"^query\s", stripped)
+                    or re.match(r"^statement\s", stripped)
+                ):
+                    break
+                sql_lines.append(stripped)
+                i += 1
+            records.append(
+                ErrorRecord(
+                    pattern=pattern,
+                    sql="\n".join(sql_lines),
+                    line_number=line_num,
+                    kind="statement",
+                    in_ansi_block=in_ansi_mode,
+                )
+            )
+            continue
+
+        # statement ok
+        m = re.match(r"^statement\s+ok\s*$", line)
+        if m:
+            line_num = i + 1
+            i += 1
+            sql_lines = []
+            while i < len(lines) and lines[i].strip() and not 
lines[i].strip().startswith("#"):
+                stripped = lines[i].rstrip("\n")
+                if (
+                    re.match(r"^query\s", stripped)
+                    or re.match(r"^statement\s", stripped)
+                ):
+                    break
+                sql_lines.append(stripped)
+                i += 1
+            sql = "\n".join(sql_lines)
+
+            # Track ANSI mode from statements
+            if re.search(
+                r"set\s+datafusion\.execution\.enable_ansi_mode\s*=\s*true",
+                sql,
+                re.IGNORECASE,
+            ):
+                in_ansi_mode = True
+            elif re.search(
+                r"set\s+datafusion\.execution\.enable_ansi_mode\s*=\s*false",
+                sql,
+                re.IGNORECASE,
+            ):
+                in_ansi_mode = False
+
+            records.append(
+                StatementRecord(
+                    sql=sql, line_number=line_num, in_ansi_block=in_ansi_mode
+                )
+            )
+            continue
+
+        # query <TYPE_CODES> [rowsort]
+        m = re.match(r"^query\s+(\S+)(\s+rowsort)?\s*$", line)
+        if m:
+            type_codes = m.group(1)
+            rowsort = m.group(2) is not None
+            line_num = i + 1
+            i += 1
+
+            # Collect SQL lines until ----
+            sql_lines = []
+            while i < len(lines) and lines[i].rstrip("\n") != "----":
+                sql_lines.append(lines[i].rstrip("\n"))
+                i += 1
+
+            # Skip the ---- separator
+            if i < len(lines) and lines[i].rstrip("\n") == "----":
+                i += 1
+
+            # Collect expected result lines until blank line or next record.
+            # Note: do NOT treat # as a comment here — result values can
+            # start with # (e.g., soundex('#') -> '#').
+            expected = []
+            while i < len(lines):
+                result_line = lines[i].rstrip("\n")
+                if result_line == "":
+                    i += 1
+                    break
+                if re.match(r"^(query|statement)\s", result_line):
+                    break
+                # A ## comment line in the results section signals end of 
results
+                if result_line.startswith("##"):
+                    break
+                expected.append(result_line)
+                i += 1
+
+            records.append(
+                QueryRecord(
+                    type_codes=type_codes,
+                    sql="\n".join(sql_lines),
+                    expected=expected,
+                    rowsort=rowsort,
+                    line_number=line_num,
+                    in_ansi_block=in_ansi_mode,
+                )
+            )
+            continue
+
+        # Unknown line, skip
+        i += 1
+
+    return records
+
+
+# ---------------------------------------------------------------------------
+# 2. SQL Translator (DataFusion -> PySpark)

Review Comment:
   Thoughts on using something like SQLGlot?



##########
datafusion/spark/scripts/validate_slt.py:
##########
@@ -0,0 +1,1210 @@
+#!/usr/bin/env python3
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+"""
+Validate hardcoded expected values in .slt (sqllogictest) test files
+by running the same queries against PySpark and comparing results.
+
+Usage:
+    python validate_slt.py                          # Run all .slt files
+    python validate_slt.py --path math/abs.slt      # Single file
+    python validate_slt.py --path string/           # All files in subdirectory
+    python validate_slt.py --verbose                 # Show details
+    python validate_slt.py --show-skipped            # Show skipped queries
+"""
+
+import argparse
+import math
+import os
+import re
+import sys
+from dataclasses import dataclass, field
+from pathlib import Path
+from typing import Optional
+
+# ---------------------------------------------------------------------------
+# Arrow type -> Spark type mapping
+# ---------------------------------------------------------------------------
+ARROW_TO_SPARK_TYPE = {
+    "Int8": "TINYINT",
+    "Int16": "SMALLINT",
+    "Int32": "INT",
+    "Int64": "BIGINT",
+    "UInt8": "SMALLINT",
+    "UInt16": "INT",
+    "UInt32": "BIGINT",
+    "UInt64": "BIGINT",
+    "Float16": "FLOAT",
+    "Float32": "FLOAT",
+    "Float64": "DOUBLE",
+    "Utf8": "STRING",
+    "Boolean": "BOOLEAN",
+    "Binary": "BINARY",
+    "Date32": "DATE",
+    "Date64": "DATE",
+}
+
+# DataFusion cast type -> Spark type mapping
+DF_TO_SPARK_CAST_TYPE = {
+    "TINYINT": "TINYINT",
+    "SMALLINT": "SMALLINT",
+    "INT": "INT",
+    "INTEGER": "INT",
+    "BIGINT": "BIGINT",
+    "FLOAT": "FLOAT",
+    "REAL": "FLOAT",
+    "DOUBLE": "DOUBLE",
+    "STRING": "STRING",
+    "VARCHAR": "STRING",
+    "TEXT": "STRING",
+    "BOOLEAN": "BOOLEAN",
+    "BINARY": "BINARY",
+    "DATE": "DATE",
+    "TIMESTAMP": "TIMESTAMP",
+    # PostgreSQL-style aliases used in some .slt files
+    "FLOAT8": "DOUBLE",
+    "FLOAT4": "FLOAT",
+    "INT8": "BIGINT",
+    "INT4": "INT",
+    "INT2": "SMALLINT",
+    "BYTEA": "BINARY",
+}
+
+# Unsupported Arrow types for Spark
+UNSUPPORTED_ARROW_TYPES = {

Review Comment:
   Related 
https://github.com/apache/datafusion/pull/21508#discussion_r3066865558



##########
datafusion/spark/scripts/validate_slt.py:
##########
@@ -0,0 +1,1210 @@
+#!/usr/bin/env python3
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+"""
+Validate hardcoded expected values in .slt (sqllogictest) test files
+by running the same queries against PySpark and comparing results.
+
+Usage:
+    python validate_slt.py                          # Run all .slt files
+    python validate_slt.py --path math/abs.slt      # Single file
+    python validate_slt.py --path string/           # All files in subdirectory
+    python validate_slt.py --verbose                 # Show details
+    python validate_slt.py --show-skipped            # Show skipped queries
+"""
+
+import argparse
+import math
+import os
+import re
+import sys
+from dataclasses import dataclass, field
+from pathlib import Path
+from typing import Optional
+
+# ---------------------------------------------------------------------------
+# Arrow type -> Spark type mapping
+# ---------------------------------------------------------------------------
+ARROW_TO_SPARK_TYPE = {
+    "Int8": "TINYINT",
+    "Int16": "SMALLINT",
+    "Int32": "INT",
+    "Int64": "BIGINT",
+    "UInt8": "SMALLINT",
+    "UInt16": "INT",
+    "UInt32": "BIGINT",
+    "UInt64": "BIGINT",
+    "Float16": "FLOAT",
+    "Float32": "FLOAT",
+    "Float64": "DOUBLE",
+    "Utf8": "STRING",
+    "Boolean": "BOOLEAN",
+    "Binary": "BINARY",
+    "Date32": "DATE",
+    "Date64": "DATE",

Review Comment:
   Does this map cleanly?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] feat: add PySpark validation script for datafusion-spark .slt tests [datafusion]

Reply via email to