This is an automated email from the ASF dual-hosted git repository. yangjie01 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git
The following commit(s) were added to refs/heads/master by this push: new a9128a08c82d [SPARK-51876][PYTHON][TESTS] Add configuration for `log4j.configurationFile` in the PySpark submission args of `run_individual_python_test` a9128a08c82d is described below commit a9128a08c82d1fa847c231d655c1ff58baf022fd Author: yangjie01 <yangji...@baidu.com> AuthorDate: Wed Apr 23 19:18:26 2025 +0800 [SPARK-51876][PYTHON][TESTS] Add configuration for `log4j.configurationFile` in the PySpark submission args of `run_individual_python_test` ### What changes were proposed in this pull request? This PR adds the configuration for `log4j.configurationFile` in the PySpark submission args of the `run_individual_python_test` function. This allows the Java processes initiated by the submitted PySpark jobs to use a reasonable logging level. ### Why are the changes needed? Prevent the Java processes corresponding to the PySpark jobs submitted by `run_individual_python_test` from using unexpected logging levels and avoid disk wastage in GA tasks. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Pass GitHub Actions - Locally test: run ``` python/run-tests --testnames pyspark.sql.tests.pandas.test_pandas_transform_with_state --python-executables=python3.11 ``` **Before** The Java process corresponding to the PySpark job unexpectedly uses the log4j test configuration in the Scala test directory. We can see the submission command for the Java process as follows: ``` /Users/yangjie01/Tools/zulu17/bin/java -cp hive-jackson/*:/Users/yangjie01/spark/conf/:...:/Users/yangjie01/spark/common/kvstore/target/scala-2.13/test-classes/:/Users/yangjie01/spark/common/network-common/target/scala-2.13/test-classes/:/Users/yangjie01/spark/common/network-shuffle/target/scala-2.13/test-classes/:...:/Users/yangjie01/spark/assembly/target/scala-2.13/jars/* -Xmx1g -Djava.io.tmpdir=/Users/yangjie01/spark/python/target/7836a5a1-a39e-4b31-8ab1-f14b901668a4 -Xss4M -XX:+Ig [...] ``` The Java process launched by this PySpark job uses the log4j configuration located in the `common/kvstore/target/scala-2.13/test-classes` directory. The `rootLogger.level` in this configuration file is set to `debug`, which is why we see a large amount of DEBUG logs in the `unit-tests.log`. https://github.com/apache/spark/blob/master/common/kvstore/src/test/resources/log4j2.properties ``` 25/04/23 17:02:04.688 main DEBUG MutableMetricsFactory: field org.apache.hadoop.metrics2.lib.MutableRate org.apache.hadoop.security.UserGroupInformation$UgiMetrics.getGroups with annotation org.apache.hadoop.metrics2.annotation.Metric(sampleName="Ops", always=false, valueName="Time", about="", interval=10, type=DEFAULT, value={"GetGroups"}) 25/04/23 17:02:04.694 main DEBUG MutableMetricsFactory: field org.apache.hadoop.metrics2.lib.MutableRate org.apache.hadoop.security.UserGroupInformation$UgiMetrics.loginFailure with annotation org.apache.hadoop.metrics2.annotation.Metric(sampleName="Ops", always=false, valueName="Time", about="", interval=10, type=DEFAULT, value={"Rate of failed kerberos logins and latency (milliseconds)"}) 25/04/23 17:02:04.694 main DEBUG MutableMetricsFactory: field org.apache.hadoop.metrics2.lib.MutableRate org.apache.hadoop.security.UserGroupInformation$UgiMetrics.loginSuccess with annotation org.apache.hadoop.metrics2.annotation.Metric(sampleName="Ops", always=false, valueName="Time", about="", interval=10, type=DEFAULT, value={"Rate of successful kerberos logins and latency (milliseconds)"}) 25/04/23 17:02:04.694 main DEBUG MutableMetricsFactory: field private org.apache.hadoop.metrics2.lib.MutableGaugeInt org.apache.hadoop.security.UserGroupInformation$UgiMetrics.renewalFailures with annotation org.apache.hadoop.metrics2.annotation.Metric(sampleName="Ops", always=false, valueName="Time", about="", interval=10, type=DEFAULT, value={"Renewal failures since last successful login"}) 25/04/23 17:02:04.695 main DEBUG MutableMetricsFactory: field private org.apache.hadoop.metrics2.lib.MutableGaugeLong org.apache.hadoop.security.UserGroupInformation$UgiMetrics.renewalFailuresTotal with annotation org.apache.hadoop.metrics2.annotation.Metric(sampleName="Ops", always=false, valueName="Time", about="", interval=10, type=DEFAULT, value={"Renewal failures since startup"}) 25/04/23 17:02:04.696 main DEBUG MetricsSystemImpl: UgiMetrics, User and group related metrics 25/04/23 17:02:04.744 main DEBUG ShutdownHookManager: Adding shutdown hook 25/04/23 17:02:04.758 main DEBUG Shell: Failed to detect a valid hadoop home directory ... 25/04/23 17:02:04.763 main DEBUG Shell: setsid is not available on this machine. So not using it. 25/04/23 17:02:04.763 main DEBUG Shell: setsid exited with exit code 0 25/04/23 17:02:04.808 main DEBUG PythonGatewayServer: Started PythonGatewayServer on localhost/127.0.0.1 with port 50689 25/04/23 17:02:04.921 Thread-2 INFO SparkContext: Running Spark version 4.1.0-SNAPSHOT 25/04/23 17:02:04.922 Thread-2 INFO SparkContext: OS info Mac OS X, 15.4, aarch64 25/04/23 17:02:04.922 Thread-2 INFO SparkContext: Java version 17.0.14 25/04/23 17:02:04.929 Thread-2 DEBUG SecurityUtil: Setting hadoop.security.token.service.use_ip to true 25/04/23 17:02:04.955 Thread-2 DEBUG Groups: Creating new Groups object 25/04/23 17:02:04.955 Thread-2 DEBUG NativeCodeLoader: Trying to load the custom-built native-hadoop library... ... ``` **After** The Java process corresponding to the PySpark job will use the log4j configuration specified by `-Dlog4j.configurationFile`: ``` /Users/yangjie01/Tools/zulu17/bin/java -cp hive-jackson/*:/Users/yangjie01/spark/conf/:...:/Users/yangjie01/spark/common/kvstore/target/scala-2.13/test-classes/:/Users/yangjie01/spark/common/network-common/target/scala-2.13/test-classes/:...:/Users/yangjie01/spark/core/target/jars/*:/Users/yangjie01/spark/mllib/target/jars/*:/Users/yangjie01/spark/assembly/target/scala-2.13/jars/slf4j-api-2.0.17.jar:/Users/yangjie01/spark/assembly/target/scala-2.13/jars/* -Xmx1g -Djava.io.tmpdir=/User [...] ``` From `spark.*.extraJavaOptions`, we can see the configuration `-Dlog4j.configurationFile=/Users/yangjie01/spark/python/test_support/log4j2.properties`. At the same time, there are no longer DEBUG-level logs in unit-tests.log: ``` 25/04/23 17:08:23.584 Thread-2 INFO SparkContext: Running Spark version 4.1.0-SNAPSHOT 25/04/23 17:08:23.588 Thread-2 INFO SparkContext: OS info Mac OS X, 15.4, aarch64 25/04/23 17:08:23.588 Thread-2 INFO SparkContext: Java version 17.0.14 25/04/23 17:08:23.625 Thread-2 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 25/04/23 17:08:23.655 Thread-2 INFO ResourceUtils: ============================================================== 25/04/23 17:08:23.655 Thread-2 INFO ResourceUtils: No custom resources configured for spark.driver. 25/04/23 17:08:23.656 Thread-2 INFO ResourceUtils: ============================================================== 25/04/23 17:08:23.656 Thread-2 INFO SparkContext: Submitted application: ReusedSQLTestCase 25/04/23 17:08:23.669 Thread-2 INFO ResourceProfile: Default ResourceProfile created, executor resources: Map(cores -> name: cores, amount: 1, script: , vendor: , memory -> name: memory, amount: 1024, script: , vendor: , offHeap -> name: offHeap, amount: 0, script: , vendor: ), task resources: Map(cpus -> name: cpus, amount: 1.0) 25/04/23 17:08:23.670 Thread-2 INFO ResourceProfile: Limiting resource is cpu 25/04/23 17:08:23.671 Thread-2 INFO ResourceProfileManager: Added ResourceProfile id: 0 25/04/23 17:08:23.698 Thread-2 INFO SecurityManager: Changing view acls to: yangjie01 25/04/23 17:08:23.699 Thread-2 INFO SecurityManager: Changing modify acls to: yangjie01 25/04/23 17:08:23.699 Thread-2 INFO SecurityManager: Changing view acls groups to: yangjie01 25/04/23 17:08:23.699 Thread-2 INFO SecurityManager: Changing modify acls groups to: yangjie01 25/04/23 17:08:23.700 Thread-2 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: yangjie01 groups with view permissions: EMPTY; users with modify permissions: yangjie01; groups with modify permissions: EMPTY; RPC SSL disabled 25/04/23 17:08:23.823 Thread-2 INFO Utils: Successfully started service 'sparkDriver' on port 51764. 25/04/23 17:08:23.840 Thread-2 INFO SparkEnv: Registering MapOutputTracker 25/04/23 17:08:23.846 Thread-2 INFO SparkEnv: Registering BlockManagerMaster 25/04/23 17:08:23.855 Thread-2 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information 25/04/23 17:08:23.856 Thread-2 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up 25/04/23 17:08:23.857 Thread-2 INFO SparkEnv: Registering BlockManagerMasterHeartbeat 25/04/23 17:08:23.871 Thread-2 INFO DiskBlockManager: Created local directory at /Users/yangjie01/SourceCode/git/spark-sbt/python/target/37c5fe4f-2abd-4591-90bf-01b745c1af4c/blockmgr-790a3b12-ed2c-4c57-bfd1-eb88ac3d4ebf ... ``` ### Was this patch authored or co-authored using generative AI tooling? No Closes #50675 from LuciferYang/pyspark-log4j. Authored-by: yangjie01 <yangji...@baidu.com> Signed-off-by: yangjie01 <yangji...@baidu.com> --- python/run-tests.py | 7 +++++-- python/test_support/log4j2.properties | 31 +++++++++++++++++++++++++++++++ 2 files changed, 36 insertions(+), 2 deletions(-) diff --git a/python/run-tests.py b/python/run-tests.py index 8752f264cd75..091fcfe73ac1 100755 --- a/python/run-tests.py +++ b/python/run-tests.py @@ -118,8 +118,11 @@ def run_individual_python_test(target_dir, test_name, pyspark_python, keep_test_ metastore_dir = os.path.join(metastore_dir, str(uuid.uuid4())) os.mkdir(metastore_dir) - # Also override the JVM's temp directory by setting driver and executor options. - java_options = "-Djava.io.tmpdir={0}".format(tmp_dir) + # Also override the JVM's temp directory and log4j conf by setting driver and executor options. + log4j2_path = os.path.join(SPARK_HOME, "python/test_support/log4j2.properties") + java_options = "-Djava.io.tmpdir={0} -Dlog4j.configurationFile={1}".format( + tmp_dir, log4j2_path + ) java_options = java_options + " -Xss4M" spark_args = [ "--conf", "spark.driver.extraJavaOptions='{0}'".format(java_options), diff --git a/python/test_support/log4j2.properties b/python/test_support/log4j2.properties new file mode 100644 index 000000000000..6629658c1d28 --- /dev/null +++ b/python/test_support/log4j2.properties @@ -0,0 +1,31 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +# Set everything to be logged to the file target/unit-tests.log +rootLogger.level = info +rootLogger.appenderRef.file.ref = File + +appender.file.type = File +appender.file.name = File +appender.file.fileName = target/unit-tests.log +appender.file.append = true +appender.file.layout.type = PatternLayout +appender.file.layout.pattern = %d{yy/MM/dd HH:mm:ss.SSS} %t %p %c{1}: %m%n%ex + +# Silence verbose logs from 3rd-party libraries. +logger.netty.name = io.netty +logger.netty.level = info --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org