This is an automated email from the ASF dual-hosted git repository.
dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git
The following commit(s) were added to refs/heads/master by this push:
new f041118ff1f5 [SPARK-53861][PYTHON][INFRA] Factor out streaming tests
from `pyspark-sql` and `pyspark-connect`
f041118ff1f5 is described below
commit f041118ff1f5764bc4abf43b99de836417b1892f
Author: Ruifeng Zheng <[email protected]>
AuthorDate: Thu Oct 9 22:27:20 2025 -0700
[SPARK-53861][PYTHON][INFRA] Factor out streaming tests from `pyspark-sql`
and `pyspark-connect`
### What changes were proposed in this pull request?
Factor out streaming tests from `pyspark-sql` and `pyspark-connect`
### Why are the changes needed?
`pyspark-sql` and `pyspark-connect` are still prone to timeout, even after
we increase `timeout-minutes` to `150`, see
https://github.com/apache/spark/actions/runs/18389137953/workflow
<img width="1080" height="106" alt="image"
src="https://github.com/user-attachments/assets/1a4ecabe-2a4c-4306-af71-fbaf4ee27b42"
/>
the streaming tests are large, move them to dedicated testing modules to
speed up ci:
```
Starting test(python3.11):
pyspark.sql.tests.pandas.test_pandas_transform_with_state (temp output:
/Users/runner/work/spark/spark/python/target/92efa305-098c-4839-8bb4-d13c9b60a405/python3.11__pyspark.sql.tests.pandas.test_pandas_transform_with_state__o7ragjh2.log)
Finished test(python3.11):
pyspark.sql.tests.pandas.test_pandas_transform_with_state (1509s) ... 2 tests
were skipped
Starting test(python3.11):
pyspark.sql.tests.pandas.test_pandas_transform_with_state_checkpoint_v2 (temp
output:
/Users/runner/work/spark/spark/python/target/929d4ad3-5518-4011-85a9-b14355974ead/python3.11__pyspark.sql.tests.pandas.test_pandas_transform_with_state_checkpoint_v2__m41avtp4.log)
Finished test(python3.11):
pyspark.sql.tests.pandas.test_pandas_transform_with_state_checkpoint_v2 (1537s)
... 2 tests were skipped
```
### Does this PR introduce _any_ user-facing change?
no, infra-only
### How was this patch tested?
after this change, in PR builder
<img width="1157" height="128" alt="image"
src="https://github.com/user-attachments/assets/02b78417-7dd5-4cde-b835-ec53624a9307"
/>
<img width="536" height="99" alt="image"
src="https://github.com/user-attachments/assets/643269ff-2ddb-4b4c-aaf6-f930681114e1"
/>
<img width="993" height="104" alt="image"
src="https://github.com/user-attachments/assets/bdf9c3d2-2ed5-4952-8fba-675742b6b37d"
/>
### Was this patch authored or co-authored using generative AI tooling?
no
Closes #52564 from zhengruifeng/infra_ss_module.
Authored-by: Ruifeng Zheng <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
---
.github/workflows/build_and_test.yml | 3 ++
.github/workflows/python_hosted_runner_test.yml | 2 +
dev/sparktestsupport/modules.py | 69 ++++++++++++++++++-------
dev/sparktestsupport/utils.py | 11 ++--
4 files changed, 62 insertions(+), 23 deletions(-)
diff --git a/.github/workflows/build_and_test.yml
b/.github/workflows/build_and_test.yml
index 40f2a353d3ce..4bd4bc17d078 100644
--- a/.github/workflows/build_and_test.yml
+++ b/.github/workflows/build_and_test.yml
@@ -515,6 +515,8 @@ jobs:
pyspark-core, pyspark-errors, pyspark-streaming, pyspark-logger
- >-
pyspark-mllib, pyspark-ml, pyspark-ml-connect, pyspark-pipelines
+ - >-
+ pyspark-structured-streaming, pyspark-structured-streaming-connect
- >-
pyspark-connect
- >-
@@ -532,6 +534,7 @@ jobs:
- modules: ${{ fromJson(needs.precondition.outputs.required).pyspark
!= 'true' && 'pyspark-sql, pyspark-resource, pyspark-testing' }}
- modules: ${{ fromJson(needs.precondition.outputs.required).pyspark
!= 'true' && 'pyspark-core, pyspark-errors, pyspark-streaming, pyspark-logger'
}}
- modules: ${{ fromJson(needs.precondition.outputs.required).pyspark
!= 'true' && 'pyspark-mllib, pyspark-ml, pyspark-ml-connect' }}
+ - modules: ${{ fromJson(needs.precondition.outputs.required).pyspark
!= 'true' && 'pyspark-structured-streaming,
pyspark-structured-streaming-connect' }}
- modules: ${{ fromJson(needs.precondition.outputs.required).pyspark
!= 'true' && 'pyspark-connect' }}
# Always run if pyspark-pandas == 'true', even infra-image is skip
(such as non-master job)
# In practice, the build will run in individual PR, but not against
the individual commit
diff --git a/.github/workflows/python_hosted_runner_test.yml
b/.github/workflows/python_hosted_runner_test.yml
index 36b086b79e4a..89e0a9435319 100644
--- a/.github/workflows/python_hosted_runner_test.yml
+++ b/.github/workflows/python_hosted_runner_test.yml
@@ -75,6 +75,8 @@ jobs:
pyspark-core, pyspark-errors, pyspark-streaming
- >-
pyspark-mllib, pyspark-ml, pyspark-ml-connect
+ - >-
+ pyspark-structured-streaming, pyspark-structured-streaming-connect
- >-
pyspark-connect
- >-
diff --git a/dev/sparktestsupport/modules.py b/dev/sparktestsupport/modules.py
index 2397bc9c7962..8aab600071dc 100644
--- a/dev/sparktestsupport/modules.py
+++ b/dev/sparktestsupport/modules.py
@@ -512,9 +512,6 @@ pyspark_sql = Module(
"pyspark.sql.functions.partitioning",
"pyspark.sql.merge",
"pyspark.sql.readwriter",
- "pyspark.sql.streaming.query",
- "pyspark.sql.streaming.readwriter",
- "pyspark.sql.streaming.listener",
"pyspark.sql.udf",
"pyspark.sql.udtf",
"pyspark.sql.avro.functions",
@@ -562,10 +559,7 @@ pyspark_sql = Module(
"pyspark.sql.tests.arrow.test_arrow_udtf",
"pyspark.sql.tests.pandas.test_pandas_cogrouped_map",
"pyspark.sql.tests.pandas.test_pandas_grouped_map",
- "pyspark.sql.tests.pandas.test_pandas_grouped_map_with_state",
"pyspark.sql.tests.pandas.test_pandas_map",
- "pyspark.sql.tests.pandas.test_pandas_transform_with_state",
-
"pyspark.sql.tests.pandas.test_pandas_transform_with_state_checkpoint_v2",
"pyspark.sql.tests.pandas.test_pandas_udf",
"pyspark.sql.tests.pandas.test_pandas_udf_grouped_agg",
"pyspark.sql.tests.pandas.test_pandas_udf_scalar",
@@ -575,14 +569,9 @@ pyspark_sql = Module(
"pyspark.sql.tests.pandas.test_pandas_sqlmetrics",
"pyspark.sql.tests.pandas.test_converter",
"pyspark.sql.tests.test_python_datasource",
- "pyspark.sql.tests.test_python_streaming_datasource",
"pyspark.sql.tests.test_readwriter",
"pyspark.sql.tests.test_serde",
"pyspark.sql.tests.test_session",
- "pyspark.sql.tests.streaming.test_streaming",
- "pyspark.sql.tests.streaming.test_streaming_foreach",
- "pyspark.sql.tests.streaming.test_streaming_foreach_batch",
- "pyspark.sql.tests.streaming.test_streaming_listener",
"pyspark.sql.tests.test_subquery",
"pyspark.sql.tests.test_types",
"pyspark.sql.tests.test_udf",
@@ -650,6 +639,34 @@ pyspark_streaming = Module(
)
+pyspark_structured_streaming = Module(
+ name="pyspark-structured-streaming",
+ dependencies=[pyspark_core, pyspark_streaming, pyspark_sql],
+ source_file_regexes=[
+ "python/pyspark/sql/streaming",
+ "python/pyspark/sql/pandas",
+ "python/pyspark/sql/worker",
+ ],
+ python_test_goals=[
+ # doctests
+ "pyspark.sql.streaming.query",
+ "pyspark.sql.streaming.readwriter",
+ "pyspark.sql.streaming.listener",
+ # unittests
+ "pyspark.sql.tests.test_python_streaming_datasource",
+ "pyspark.sql.tests.streaming.test_streaming",
+ "pyspark.sql.tests.streaming.test_streaming_foreach",
+ "pyspark.sql.tests.streaming.test_streaming_foreach_batch",
+ "pyspark.sql.tests.streaming.test_streaming_listener",
+ "pyspark.sql.tests.pandas.test_pandas_grouped_map_with_state",
+ "pyspark.sql.tests.pandas.test_pandas_transform_with_state",
+
"pyspark.sql.tests.pandas.test_pandas_transform_with_state_checkpoint_v2",
+ ],
+ excluded_python_implementations=[
+ "PyPy" # Skip these tests under PyPy since they require numpy and it
isn't available there
+ ],
+)
+
pyspark_mllib = Module(
name="pyspark-mllib",
dependencies=[pyspark_core, pyspark_streaming, pyspark_sql, mllib],
@@ -1114,7 +1131,6 @@ pyspark_connect = Module(
"pyspark.sql.tests.connect.test_parity_udtf",
"pyspark.sql.tests.connect.test_parity_tvf",
"pyspark.sql.tests.connect.test_parity_python_datasource",
- "pyspark.sql.tests.connect.test_parity_python_streaming_datasource",
"pyspark.sql.tests.connect.test_parity_frame_plot",
"pyspark.sql.tests.connect.test_parity_frame_plot_plotly",
"pyspark.sql.tests.connect.test_utils",
@@ -1122,11 +1138,6 @@ pyspark_connect = Module(
"pyspark.sql.tests.connect.client.test_artifact_localcluster",
"pyspark.sql.tests.connect.client.test_client",
"pyspark.sql.tests.connect.client.test_reattach",
- "pyspark.sql.tests.connect.streaming.test_parity_streaming",
- "pyspark.sql.tests.connect.streaming.test_parity_listener",
- "pyspark.sql.tests.connect.streaming.test_parity_foreach",
- "pyspark.sql.tests.connect.streaming.test_parity_foreach_batch",
-
"pyspark.sql.tests.connect.streaming.test_parity_transform_with_state_pyspark",
"pyspark.sql.tests.connect.test_resources",
"pyspark.sql.tests.connect.shell.test_progress",
"pyspark.sql.tests.connect.test_df_debug",
@@ -1142,13 +1153,11 @@ pyspark_connect = Module(
"pyspark.sql.tests.connect.arrow.test_parity_arrow_udtf",
"pyspark.sql.tests.connect.pandas.test_parity_pandas_map",
"pyspark.sql.tests.connect.pandas.test_parity_pandas_grouped_map",
-
"pyspark.sql.tests.connect.pandas.test_parity_pandas_grouped_map_with_state",
"pyspark.sql.tests.connect.pandas.test_parity_pandas_cogrouped_map",
"pyspark.sql.tests.connect.pandas.test_parity_pandas_udf",
"pyspark.sql.tests.connect.pandas.test_parity_pandas_udf_scalar",
"pyspark.sql.tests.connect.pandas.test_parity_pandas_udf_grouped_agg",
"pyspark.sql.tests.connect.pandas.test_parity_pandas_udf_window",
-
"pyspark.sql.tests.connect.pandas.test_parity_pandas_transform_with_state",
],
excluded_python_implementations=[
"PyPy" # Skip these tests under PyPy since they require numpy,
pandas, and pyarrow and
@@ -1156,6 +1165,28 @@ pyspark_connect = Module(
],
)
+pyspark_structured_streaming_connect = Module(
+ name="pyspark-structured-streaming-connect",
+ dependencies=[pyspark_connect, pyspark_structured_streaming],
+ source_file_regexes=[
+ "python/pyspark/sql/connect",
+ ],
+ python_test_goals=[
+ # unittests
+ "pyspark.sql.tests.connect.test_parity_python_streaming_datasource",
+ "pyspark.sql.tests.connect.streaming.test_parity_streaming",
+ "pyspark.sql.tests.connect.streaming.test_parity_listener",
+ "pyspark.sql.tests.connect.streaming.test_parity_foreach",
+ "pyspark.sql.tests.connect.streaming.test_parity_foreach_batch",
+
"pyspark.sql.tests.connect.streaming.test_parity_transform_with_state_pyspark",
+
"pyspark.sql.tests.connect.pandas.test_parity_pandas_grouped_map_with_state",
+
"pyspark.sql.tests.connect.pandas.test_parity_pandas_transform_with_state",
+ ],
+ excluded_python_implementations=[
+ "PyPy" # Skip these tests under PyPy since they require numpy and it
isn't available there
+ ],
+)
+
pyspark_ml_connect = Module(
name="pyspark-ml-connect",
diff --git a/dev/sparktestsupport/utils.py b/dev/sparktestsupport/utils.py
index 810185980f1a..0fd27f074ebb 100755
--- a/dev/sparktestsupport/utils.py
+++ b/dev/sparktestsupport/utils.py
@@ -117,16 +117,18 @@ def determine_modules_to_test(changed_modules,
deduplicated=True):
['avro', 'connect', 'docker-integration-tests', 'examples', 'hive',
'hive-thriftserver',
'mllib', 'protobuf', 'pyspark-connect', 'pyspark-ml',
'pyspark-ml-connect', 'pyspark-mllib',
'pyspark-pandas', 'pyspark-pandas-connect', 'pyspark-pandas-slow',
- 'pyspark-pandas-slow-connect', 'pyspark-pipelines', 'pyspark-sql',
'pyspark-testing', 'repl',
- 'sparkr', 'sql', 'sql-kafka-0-10']
+ 'pyspark-pandas-slow-connect', 'pyspark-pipelines', 'pyspark-sql',
+ 'pyspark-structured-streaming', 'pyspark-structured-streaming-connect',
+ 'pyspark-testing', 'repl', 'sparkr', 'sql', 'sql-kafka-0-10']
>>> sorted([x.name for x in determine_modules_to_test(
... [modules.sparkr, modules.sql], deduplicated=False)])
... # doctest: +NORMALIZE_WHITESPACE
['avro', 'connect', 'docker-integration-tests', 'examples', 'hive',
'hive-thriftserver',
'mllib', 'protobuf', 'pyspark-connect', 'pyspark-ml',
'pyspark-ml-connect', 'pyspark-mllib',
'pyspark-pandas', 'pyspark-pandas-connect', 'pyspark-pandas-slow',
- 'pyspark-pandas-slow-connect', 'pyspark-pipelines', 'pyspark-sql',
'pyspark-testing', 'repl',
- 'sparkr', 'sql', 'sql-kafka-0-10']
+ 'pyspark-pandas-slow-connect', 'pyspark-pipelines', 'pyspark-sql',
+ 'pyspark-structured-streaming', 'pyspark-structured-streaming-connect',
+ 'pyspark-testing', 'repl', 'sparkr', 'sql', 'sql-kafka-0-10']
>>> sorted([x.name for x in determine_modules_to_test(
... [modules.sql, modules.core], deduplicated=False)])
... # doctest: +NORMALIZE_WHITESPACE
@@ -135,6 +137,7 @@ def determine_modules_to_test(changed_modules,
deduplicated=True):
'pyspark-core', 'pyspark-ml', 'pyspark-ml-connect', 'pyspark-mllib',
'pyspark-pandas',
'pyspark-pandas-connect', 'pyspark-pandas-slow',
'pyspark-pandas-slow-connect',
'pyspark-pipelines', 'pyspark-resource', 'pyspark-sql',
'pyspark-streaming',
+ 'pyspark-structured-streaming', 'pyspark-structured-streaming-connect',
'pyspark-testing', 'repl', 'root', 'sparkr', 'sql', 'sql-kafka-0-10',
'streaming',
'streaming-kafka-0-10', 'streaming-kinesis-asl']
"""
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]