(spark) branch master updated: [SPARK-47831][PS][CONNECT][TESTS] Run Pandas API on Spark for pyspark-connect package

gurwls223 Fri, 12 Apr 2024 02:40:42 -0700

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git



The following commit(s) were added to refs/heads/master by this push:
     new daf260f74e12 [SPARK-47831][PS][CONNECT][TESTS] Run Pandas API on Spark 
for pyspark-connect package
daf260f74e12 is described below

commit daf260f74e12fc5e9fad6091f6230e71a9e6c9c1
Author: Hyukjin Kwon <[email protected]>
AuthorDate: Fri Apr 12 18:39:22 2024 +0900

    [SPARK-47831][PS][CONNECT][TESTS] Run Pandas API on Spark for 
pyspark-connect package
    
    ### What changes were proposed in this pull request?
    
    This PR proposes to extends `pyspark-connect` scheduled job to run Pandas 
API on Spark tests as well.
    
    ### Why are the changes needed?
    
    In order to make sure pure Python library works with Pandas API on Spark.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No, test-only.
    
    ### How was this patch tested?
    
    https://github.com/HyukjinKwon/spark/actions/runs/8659133747/job/23744381515
    
    ### Was this patch authored or co-authored using generative AI tooling?
    
    No
    
    Closes #46001 from HyukjinKwon/test-ps-scheduledjob.
    
    Authored-by: Hyukjin Kwon <[email protected]>
    Signed-off-by: Hyukjin Kwon <[email protected]>
---
 .github/workflows/build_python_connect.yml         | 12 +++++++---
 python/packaging/connect/setup.py                  | 26 ++++++++++++++++++++++
 .../tests/connect/test_parity_memory_profiler.py   |  3 +++
 3 files changed, 38 insertions(+), 3 deletions(-)

diff --git a/.github/workflows/build_python_connect.yml 
b/.github/workflows/build_python_connect.yml
index 8deee026131e..6bd1b4526b0d 100644
--- a/.github/workflows/build_python_connect.yml
+++ b/.github/workflows/build_python_connect.yml
@@ -72,18 +72,24 @@ jobs:
           python packaging/connect/setup.py sdist
           cd dist
           pip install pyspark-connect-*.tar.gz
-          pip install scikit-learn torch torchvision torcheval
+          pip install 'six==1.16.0' 'pandas<=2.2.2' scipy 'plotly>=4.8' 
'mlflow>=2.8.1' coverage matplotlib openpyxl 'memory-profiler>=0.61.0' 
'scikit-learn>=1.3.2' torch torchvision torcheval deepspeed 
unittest-xml-reporting
       - name: Run tests
         env:
-          SPARK_CONNECT_TESTING_REMOTE: sc://localhost
           SPARK_TESTING: 1
+          SPARK_CONNECT_TESTING_REMOTE: sc://localhost
         run: |
+          # Make less noisy
+          cp conf/log4j2.properties.template conf/log4j2.properties
+          sed -i 's/rootLogger.level = info/rootLogger.level = warn/g' 
conf/log4j2.properties
           # Start a Spark Connect server
-          ./sbin/start-connect-server.sh --jars `find 
connector/connect/server/target -name spark-connect*SNAPSHOT.jar`
+          ./sbin/start-connect-server.sh --driver-java-options 
"-Dlog4j.configurationFile=file:$GITHUB_WORKSPACE/conf/log4j2.properties" 
--jars `find connector/connect/server/target -name spark-connect*SNAPSHOT.jar`
           # Remove Py4J and PySpark zipped library to make sure there is no 
JVM connection
           rm python/lib/*
           rm -r python/pyspark
+          # Several tests related to catalog requires to run them 
sequencially, e.g., writing a table in a listener.
           ./python/run-tests --parallelism=1 --python-executables=python3 
--modules pyspark-connect,pyspark-ml-connect
+          # None of tests are dependent on each other in Pandas API on Spark 
so run them in parallel
+          ./python/run-tests --parallelism=4 --python-executables=python3 
--modules 
pyspark-pandas-connect-part0,pyspark-pandas-connect-part1,pyspark-pandas-connect-part2,pyspark-pandas-connect-part3
       - name: Upload test results to report
         if: always()
         uses: actions/upload-artifact@v4
diff --git a/python/packaging/connect/setup.py 
b/python/packaging/connect/setup.py
index 419ed36b4236..fe1e7486faa9 100755
--- a/python/packaging/connect/setup.py
+++ b/python/packaging/connect/setup.py
@@ -78,6 +78,32 @@ if "SPARK_TESTING" in os.environ:
         "pyspark.sql.tests.pandas",
         "pyspark.sql.tests.streaming",
         "pyspark.ml.tests.connect",
+        "pyspark.pandas.tests",
+        "pyspark.pandas.tests.computation",
+        "pyspark.pandas.tests.data_type_ops",
+        "pyspark.pandas.tests.diff_frames_ops",
+        "pyspark.pandas.tests.frame",
+        "pyspark.pandas.tests.groupby",
+        "pyspark.pandas.tests.indexes",
+        "pyspark.pandas.tests.io",
+        "pyspark.pandas.tests.plot",
+        "pyspark.pandas.tests.resample",
+        "pyspark.pandas.tests.reshape",
+        "pyspark.pandas.tests.series",
+        "pyspark.pandas.tests.window",
+        "pyspark.pandas.tests.connect",
+        "pyspark.pandas.tests.connect.computation",
+        "pyspark.pandas.tests.connect.data_type_ops",
+        "pyspark.pandas.tests.connect.diff_frames_ops",
+        "pyspark.pandas.tests.connect.frame",
+        "pyspark.pandas.tests.connect.groupby",
+        "pyspark.pandas.tests.connect.indexes",
+        "pyspark.pandas.tests.connect.io",
+        "pyspark.pandas.tests.connect.plot",
+        "pyspark.pandas.tests.connect.resample",
+        "pyspark.pandas.tests.connect.reshape",
+        "pyspark.pandas.tests.connect.series",
+        "pyspark.pandas.tests.connect.window",
     ]
 
 try:
diff --git a/python/pyspark/sql/tests/connect/test_parity_memory_profiler.py 
b/python/pyspark/sql/tests/connect/test_parity_memory_profiler.py
index 513e49a144e5..f95e0bfbf8d6 100644
--- a/python/pyspark/sql/tests/connect/test_parity_memory_profiler.py
+++ b/python/pyspark/sql/tests/connect/test_parity_memory_profiler.py
@@ -18,10 +18,13 @@ import inspect
 import os
 import unittest
 
+from pyspark.util import is_remote_only
 from pyspark.tests.test_memory_profiler import MemoryProfiler2TestsMixin, 
_do_computation
 from pyspark.testing.connectutils import ReusedConnectTestCase
 
 
+# TODO(SPARK-47830): Reeanble MemoryProfilerParityTests for pyspark-connect
[email protected](is_remote_only(), "Skipped for now")
 class MemoryProfilerParityTests(MemoryProfiler2TestsMixin, 
ReusedConnectTestCase):
     def setUp(self) -> None:
         super().setUp()


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

(spark) branch master updated: [SPARK-47831][PS][CONNECT][TESTS] Run Pandas API on Spark for pyspark-connect package

Reply via email to