(spark) branch master updated: [SPARK-53861][PYTHON][INFRA] Factor out streaming tests from `pyspark-sql` and `pyspark-connect`

dongjoon Sat, 18 Oct 2025 07:07:25 -0700

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git



The following commit(s) were added to refs/heads/master by this push:
     new f041118ff1f5 [SPARK-53861][PYTHON][INFRA] Factor out streaming tests 
from `pyspark-sql` and `pyspark-connect`
f041118ff1f5 is described below

commit f041118ff1f5764bc4abf43b99de836417b1892f
Author: Ruifeng Zheng <[email protected]>
AuthorDate: Thu Oct 9 22:27:20 2025 -0700

    [SPARK-53861][PYTHON][INFRA] Factor out streaming tests from `pyspark-sql` 
and `pyspark-connect`
    
    ### What changes were proposed in this pull request?
    Factor out streaming tests from `pyspark-sql` and `pyspark-connect`
    
    ### Why are the changes needed?
    `pyspark-sql` and `pyspark-connect` are still prone to timeout, even after 
we increase `timeout-minutes` to `150`, see 
https://github.com/apache/spark/actions/runs/18389137953/workflow
    
    <img width="1080" height="106" alt="image" 
src="https://github.com/user-attachments/assets/1a4ecabe-2a4c-4306-af71-fbaf4ee27b42";
 />
    
    the streaming tests are large, move them to dedicated testing modules to 
speed up ci:
    ```
    Starting test(python3.11): 
pyspark.sql.tests.pandas.test_pandas_transform_with_state (temp output: 
/Users/runner/work/spark/spark/python/target/92efa305-098c-4839-8bb4-d13c9b60a405/python3.11__pyspark.sql.tests.pandas.test_pandas_transform_with_state__o7ragjh2.log)
    Finished test(python3.11): 
pyspark.sql.tests.pandas.test_pandas_transform_with_state (1509s) ... 2 tests 
were skipped
    Starting test(python3.11): 
pyspark.sql.tests.pandas.test_pandas_transform_with_state_checkpoint_v2 (temp 
output: 
/Users/runner/work/spark/spark/python/target/929d4ad3-5518-4011-85a9-b14355974ead/python3.11__pyspark.sql.tests.pandas.test_pandas_transform_with_state_checkpoint_v2__m41avtp4.log)
    Finished test(python3.11): 
pyspark.sql.tests.pandas.test_pandas_transform_with_state_checkpoint_v2 (1537s) 
... 2 tests were skipped
    ```
    
    ### Does this PR introduce _any_ user-facing change?
    no, infra-only
    
    ### How was this patch tested?
    
    after this change, in PR builder
    <img width="1157" height="128" alt="image" 
src="https://github.com/user-attachments/assets/02b78417-7dd5-4cde-b835-ec53624a9307";
 />
    
    <img width="536" height="99" alt="image" 
src="https://github.com/user-attachments/assets/643269ff-2ddb-4b4c-aaf6-f930681114e1";
 />
    
    <img width="993" height="104" alt="image" 
src="https://github.com/user-attachments/assets/bdf9c3d2-2ed5-4952-8fba-675742b6b37d";
 />
    
    ### Was this patch authored or co-authored using generative AI tooling?
    no
    
    Closes #52564 from zhengruifeng/infra_ss_module.
    
    Authored-by: Ruifeng Zheng <[email protected]>
    Signed-off-by: Dongjoon Hyun <[email protected]>
---
 .github/workflows/build_and_test.yml            |  3 ++
 .github/workflows/python_hosted_runner_test.yml |  2 +
 dev/sparktestsupport/modules.py                 | 69 ++++++++++++++++++-------
 dev/sparktestsupport/utils.py                   | 11 ++--
 4 files changed, 62 insertions(+), 23 deletions(-)

diff --git a/.github/workflows/build_and_test.yml 
b/.github/workflows/build_and_test.yml
index 40f2a353d3ce..4bd4bc17d078 100644
--- a/.github/workflows/build_and_test.yml
+++ b/.github/workflows/build_and_test.yml
@@ -515,6 +515,8 @@ jobs:
             pyspark-core, pyspark-errors, pyspark-streaming, pyspark-logger
           - >-
             pyspark-mllib, pyspark-ml, pyspark-ml-connect, pyspark-pipelines
+          - >-
+            pyspark-structured-streaming, pyspark-structured-streaming-connect
           - >-
             pyspark-connect
           - >-
@@ -532,6 +534,7 @@ jobs:
           - modules: ${{ fromJson(needs.precondition.outputs.required).pyspark 
!= 'true' && 'pyspark-sql, pyspark-resource, pyspark-testing' }}
           - modules: ${{ fromJson(needs.precondition.outputs.required).pyspark 
!= 'true' && 'pyspark-core, pyspark-errors, pyspark-streaming, pyspark-logger' 
}}
           - modules: ${{ fromJson(needs.precondition.outputs.required).pyspark 
!= 'true' && 'pyspark-mllib, pyspark-ml, pyspark-ml-connect' }}
+          - modules: ${{ fromJson(needs.precondition.outputs.required).pyspark 
!= 'true' && 'pyspark-structured-streaming, 
pyspark-structured-streaming-connect' }}
           - modules: ${{ fromJson(needs.precondition.outputs.required).pyspark 
!= 'true' && 'pyspark-connect' }}
           # Always run if pyspark-pandas == 'true', even infra-image is skip 
(such as non-master job)
           # In practice, the build will run in individual PR, but not against 
the individual commit
diff --git a/.github/workflows/python_hosted_runner_test.yml 
b/.github/workflows/python_hosted_runner_test.yml
index 36b086b79e4a..89e0a9435319 100644
--- a/.github/workflows/python_hosted_runner_test.yml
+++ b/.github/workflows/python_hosted_runner_test.yml
@@ -75,6 +75,8 @@ jobs:
             pyspark-core, pyspark-errors, pyspark-streaming
           - >-
             pyspark-mllib, pyspark-ml, pyspark-ml-connect
+          - >-
+            pyspark-structured-streaming, pyspark-structured-streaming-connect
           - >-
             pyspark-connect
           - >-
diff --git a/dev/sparktestsupport/modules.py b/dev/sparktestsupport/modules.py
index 2397bc9c7962..8aab600071dc 100644
--- a/dev/sparktestsupport/modules.py
+++ b/dev/sparktestsupport/modules.py
@@ -512,9 +512,6 @@ pyspark_sql = Module(
         "pyspark.sql.functions.partitioning",
         "pyspark.sql.merge",
         "pyspark.sql.readwriter",
-        "pyspark.sql.streaming.query",
-        "pyspark.sql.streaming.readwriter",
-        "pyspark.sql.streaming.listener",
         "pyspark.sql.udf",
         "pyspark.sql.udtf",
         "pyspark.sql.avro.functions",
@@ -562,10 +559,7 @@ pyspark_sql = Module(
         "pyspark.sql.tests.arrow.test_arrow_udtf",
         "pyspark.sql.tests.pandas.test_pandas_cogrouped_map",
         "pyspark.sql.tests.pandas.test_pandas_grouped_map",
-        "pyspark.sql.tests.pandas.test_pandas_grouped_map_with_state",
         "pyspark.sql.tests.pandas.test_pandas_map",
-        "pyspark.sql.tests.pandas.test_pandas_transform_with_state",
-        
"pyspark.sql.tests.pandas.test_pandas_transform_with_state_checkpoint_v2",
         "pyspark.sql.tests.pandas.test_pandas_udf",
         "pyspark.sql.tests.pandas.test_pandas_udf_grouped_agg",
         "pyspark.sql.tests.pandas.test_pandas_udf_scalar",
@@ -575,14 +569,9 @@ pyspark_sql = Module(
         "pyspark.sql.tests.pandas.test_pandas_sqlmetrics",
         "pyspark.sql.tests.pandas.test_converter",
         "pyspark.sql.tests.test_python_datasource",
-        "pyspark.sql.tests.test_python_streaming_datasource",
         "pyspark.sql.tests.test_readwriter",
         "pyspark.sql.tests.test_serde",
         "pyspark.sql.tests.test_session",
-        "pyspark.sql.tests.streaming.test_streaming",
-        "pyspark.sql.tests.streaming.test_streaming_foreach",
-        "pyspark.sql.tests.streaming.test_streaming_foreach_batch",
-        "pyspark.sql.tests.streaming.test_streaming_listener",
         "pyspark.sql.tests.test_subquery",
         "pyspark.sql.tests.test_types",
         "pyspark.sql.tests.test_udf",
@@ -650,6 +639,34 @@ pyspark_streaming = Module(
 )
 
 
+pyspark_structured_streaming = Module(
+    name="pyspark-structured-streaming",
+    dependencies=[pyspark_core, pyspark_streaming, pyspark_sql],
+    source_file_regexes=[
+        "python/pyspark/sql/streaming",
+        "python/pyspark/sql/pandas",
+        "python/pyspark/sql/worker",
+    ],
+    python_test_goals=[
+        # doctests
+        "pyspark.sql.streaming.query",
+        "pyspark.sql.streaming.readwriter",
+        "pyspark.sql.streaming.listener",
+        # unittests
+        "pyspark.sql.tests.test_python_streaming_datasource",
+        "pyspark.sql.tests.streaming.test_streaming",
+        "pyspark.sql.tests.streaming.test_streaming_foreach",
+        "pyspark.sql.tests.streaming.test_streaming_foreach_batch",
+        "pyspark.sql.tests.streaming.test_streaming_listener",
+        "pyspark.sql.tests.pandas.test_pandas_grouped_map_with_state",
+        "pyspark.sql.tests.pandas.test_pandas_transform_with_state",
+        
"pyspark.sql.tests.pandas.test_pandas_transform_with_state_checkpoint_v2",
+    ],
+    excluded_python_implementations=[
+        "PyPy"  # Skip these tests under PyPy since they require numpy and it 
isn't available there
+    ],
+)
+
 pyspark_mllib = Module(
     name="pyspark-mllib",
     dependencies=[pyspark_core, pyspark_streaming, pyspark_sql, mllib],
@@ -1114,7 +1131,6 @@ pyspark_connect = Module(
         "pyspark.sql.tests.connect.test_parity_udtf",
         "pyspark.sql.tests.connect.test_parity_tvf",
         "pyspark.sql.tests.connect.test_parity_python_datasource",
-        "pyspark.sql.tests.connect.test_parity_python_streaming_datasource",
         "pyspark.sql.tests.connect.test_parity_frame_plot",
         "pyspark.sql.tests.connect.test_parity_frame_plot_plotly",
         "pyspark.sql.tests.connect.test_utils",
@@ -1122,11 +1138,6 @@ pyspark_connect = Module(
         "pyspark.sql.tests.connect.client.test_artifact_localcluster",
         "pyspark.sql.tests.connect.client.test_client",
         "pyspark.sql.tests.connect.client.test_reattach",
-        "pyspark.sql.tests.connect.streaming.test_parity_streaming",
-        "pyspark.sql.tests.connect.streaming.test_parity_listener",
-        "pyspark.sql.tests.connect.streaming.test_parity_foreach",
-        "pyspark.sql.tests.connect.streaming.test_parity_foreach_batch",
-        
"pyspark.sql.tests.connect.streaming.test_parity_transform_with_state_pyspark",
         "pyspark.sql.tests.connect.test_resources",
         "pyspark.sql.tests.connect.shell.test_progress",
         "pyspark.sql.tests.connect.test_df_debug",
@@ -1142,13 +1153,11 @@ pyspark_connect = Module(
         "pyspark.sql.tests.connect.arrow.test_parity_arrow_udtf",
         "pyspark.sql.tests.connect.pandas.test_parity_pandas_map",
         "pyspark.sql.tests.connect.pandas.test_parity_pandas_grouped_map",
-        
"pyspark.sql.tests.connect.pandas.test_parity_pandas_grouped_map_with_state",
         "pyspark.sql.tests.connect.pandas.test_parity_pandas_cogrouped_map",
         "pyspark.sql.tests.connect.pandas.test_parity_pandas_udf",
         "pyspark.sql.tests.connect.pandas.test_parity_pandas_udf_scalar",
         "pyspark.sql.tests.connect.pandas.test_parity_pandas_udf_grouped_agg",
         "pyspark.sql.tests.connect.pandas.test_parity_pandas_udf_window",
-        
"pyspark.sql.tests.connect.pandas.test_parity_pandas_transform_with_state",
     ],
     excluded_python_implementations=[
         "PyPy"  # Skip these tests under PyPy since they require numpy, 
pandas, and pyarrow and
@@ -1156,6 +1165,28 @@ pyspark_connect = Module(
     ],
 )
 
+pyspark_structured_streaming_connect = Module(
+    name="pyspark-structured-streaming-connect",
+    dependencies=[pyspark_connect, pyspark_structured_streaming],
+    source_file_regexes=[
+        "python/pyspark/sql/connect",
+    ],
+    python_test_goals=[
+        # unittests
+        "pyspark.sql.tests.connect.test_parity_python_streaming_datasource",
+        "pyspark.sql.tests.connect.streaming.test_parity_streaming",
+        "pyspark.sql.tests.connect.streaming.test_parity_listener",
+        "pyspark.sql.tests.connect.streaming.test_parity_foreach",
+        "pyspark.sql.tests.connect.streaming.test_parity_foreach_batch",
+        
"pyspark.sql.tests.connect.streaming.test_parity_transform_with_state_pyspark",
+        
"pyspark.sql.tests.connect.pandas.test_parity_pandas_grouped_map_with_state",
+        
"pyspark.sql.tests.connect.pandas.test_parity_pandas_transform_with_state",
+    ],
+    excluded_python_implementations=[
+        "PyPy"  # Skip these tests under PyPy since they require numpy and it 
isn't available there
+    ],
+)
+
 
 pyspark_ml_connect = Module(
     name="pyspark-ml-connect",
diff --git a/dev/sparktestsupport/utils.py b/dev/sparktestsupport/utils.py
index 810185980f1a..0fd27f074ebb 100755
--- a/dev/sparktestsupport/utils.py
+++ b/dev/sparktestsupport/utils.py
@@ -117,16 +117,18 @@ def determine_modules_to_test(changed_modules, 
deduplicated=True):
     ['avro', 'connect', 'docker-integration-tests', 'examples', 'hive', 
'hive-thriftserver',
      'mllib', 'protobuf', 'pyspark-connect', 'pyspark-ml', 
'pyspark-ml-connect', 'pyspark-mllib',
      'pyspark-pandas', 'pyspark-pandas-connect', 'pyspark-pandas-slow',
-     'pyspark-pandas-slow-connect', 'pyspark-pipelines', 'pyspark-sql', 
'pyspark-testing', 'repl',
-     'sparkr', 'sql', 'sql-kafka-0-10']
+     'pyspark-pandas-slow-connect', 'pyspark-pipelines', 'pyspark-sql',
+     'pyspark-structured-streaming', 'pyspark-structured-streaming-connect',
+     'pyspark-testing', 'repl', 'sparkr', 'sql', 'sql-kafka-0-10']
     >>> sorted([x.name for x in determine_modules_to_test(
     ...     [modules.sparkr, modules.sql], deduplicated=False)])
     ... # doctest: +NORMALIZE_WHITESPACE
     ['avro', 'connect', 'docker-integration-tests', 'examples', 'hive', 
'hive-thriftserver',
      'mllib', 'protobuf', 'pyspark-connect', 'pyspark-ml', 
'pyspark-ml-connect', 'pyspark-mllib',
      'pyspark-pandas', 'pyspark-pandas-connect', 'pyspark-pandas-slow',
-     'pyspark-pandas-slow-connect', 'pyspark-pipelines', 'pyspark-sql', 
'pyspark-testing', 'repl',
-     'sparkr', 'sql', 'sql-kafka-0-10']
+     'pyspark-pandas-slow-connect', 'pyspark-pipelines', 'pyspark-sql',
+     'pyspark-structured-streaming', 'pyspark-structured-streaming-connect',
+     'pyspark-testing', 'repl', 'sparkr', 'sql', 'sql-kafka-0-10']
     >>> sorted([x.name for x in determine_modules_to_test(
     ...     [modules.sql, modules.core], deduplicated=False)])
     ... # doctest: +NORMALIZE_WHITESPACE
@@ -135,6 +137,7 @@ def determine_modules_to_test(changed_modules, 
deduplicated=True):
      'pyspark-core', 'pyspark-ml', 'pyspark-ml-connect', 'pyspark-mllib', 
'pyspark-pandas',
      'pyspark-pandas-connect', 'pyspark-pandas-slow', 
'pyspark-pandas-slow-connect',
      'pyspark-pipelines', 'pyspark-resource', 'pyspark-sql', 
'pyspark-streaming',
+     'pyspark-structured-streaming', 'pyspark-structured-streaming-connect',
      'pyspark-testing', 'repl', 'root', 'sparkr', 'sql', 'sql-kafka-0-10', 
'streaming',
      'streaming-kafka-0-10', 'streaming-kinesis-asl']
     """


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

(spark) branch master updated: [SPARK-53861][PYTHON][INFRA] Factor out streaming tests from `pyspark-sql` and `pyspark-connect`

Reply via email to