This is an automated email from the ASF dual-hosted git repository.

yangjie01 pushed a commit to branch branch-4.0
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-4.0 by this push:
     new a547d4d08491 [SPARK-51876][PYTHON][TESTS] Add configuration for 
`log4j.configurationFile` in the PySpark submission args of 
`run_individual_python_test`
a547d4d08491 is described below

commit a547d4d08491f9bc371db575ed3c4d7c02d43845
Author: yangjie01 <yangji...@baidu.com>
AuthorDate: Wed Apr 23 19:18:26 2025 +0800

    [SPARK-51876][PYTHON][TESTS] Add configuration for 
`log4j.configurationFile` in the PySpark submission args of 
`run_individual_python_test`
    
    ### What changes were proposed in this pull request?
    This PR adds the configuration for `log4j.configurationFile` in the PySpark 
submission args of the `run_individual_python_test` function. This allows the 
Java processes initiated by the submitted PySpark jobs to use a reasonable 
logging level.
    
    ### Why are the changes needed?
    Prevent the Java processes corresponding to the PySpark jobs submitted by 
`run_individual_python_test` from using unexpected logging levels and avoid 
disk wastage in GA tasks.
    
    ### Does this PR introduce _any_ user-facing change?
    No
    
    ### How was this patch tested?
    - Pass GitHub Actions
    - Locally test:
    
    run
    
    ```
    python/run-tests --testnames 
pyspark.sql.tests.pandas.test_pandas_transform_with_state 
--python-executables=python3.11
    ```
    
    **Before**
    
    The Java process corresponding to the PySpark job unexpectedly uses the 
log4j test configuration in the Scala test directory.
    
    We can see the submission command for the Java process as follows:
    
    ```
    /Users/yangjie01/Tools/zulu17/bin/java -cp 
hive-jackson/*:/Users/yangjie01/spark/conf/:...:/Users/yangjie01/spark/common/kvstore/target/scala-2.13/test-classes/:/Users/yangjie01/spark/common/network-common/target/scala-2.13/test-classes/:/Users/yangjie01/spark/common/network-shuffle/target/scala-2.13/test-classes/:...:/Users/yangjie01/spark/assembly/target/scala-2.13/jars/*
 -Xmx1g 
-Djava.io.tmpdir=/Users/yangjie01/spark/python/target/7836a5a1-a39e-4b31-8ab1-f14b901668a4
 -Xss4M -XX:+Ig [...]
    ```
    
    The Java process launched by this PySpark job uses the log4j configuration 
located in the `common/kvstore/target/scala-2.13/test-classes` directory. The 
`rootLogger.level` in this configuration file is set to `debug`, which is why 
we see a large amount of DEBUG logs in the `unit-tests.log`.
    
    
https://github.com/apache/spark/blob/master/common/kvstore/src/test/resources/log4j2.properties
    
    ```
    25/04/23 17:02:04.688 main DEBUG MutableMetricsFactory: field 
org.apache.hadoop.metrics2.lib.MutableRate 
org.apache.hadoop.security.UserGroupInformation$UgiMetrics.getGroups with 
annotation org.apache.hadoop.metrics2.annotation.Metric(sampleName="Ops", 
always=false, valueName="Time", about="", interval=10, type=DEFAULT, 
value={"GetGroups"})
    25/04/23 17:02:04.694 main DEBUG MutableMetricsFactory: field 
org.apache.hadoop.metrics2.lib.MutableRate 
org.apache.hadoop.security.UserGroupInformation$UgiMetrics.loginFailure with 
annotation org.apache.hadoop.metrics2.annotation.Metric(sampleName="Ops", 
always=false, valueName="Time", about="", interval=10, type=DEFAULT, 
value={"Rate of failed kerberos logins and latency (milliseconds)"})
    25/04/23 17:02:04.694 main DEBUG MutableMetricsFactory: field 
org.apache.hadoop.metrics2.lib.MutableRate 
org.apache.hadoop.security.UserGroupInformation$UgiMetrics.loginSuccess with 
annotation org.apache.hadoop.metrics2.annotation.Metric(sampleName="Ops", 
always=false, valueName="Time", about="", interval=10, type=DEFAULT, 
value={"Rate of successful kerberos logins and latency (milliseconds)"})
    25/04/23 17:02:04.694 main DEBUG MutableMetricsFactory: field private 
org.apache.hadoop.metrics2.lib.MutableGaugeInt 
org.apache.hadoop.security.UserGroupInformation$UgiMetrics.renewalFailures with 
annotation org.apache.hadoop.metrics2.annotation.Metric(sampleName="Ops", 
always=false, valueName="Time", about="", interval=10, type=DEFAULT, 
value={"Renewal failures since last successful login"})
    25/04/23 17:02:04.695 main DEBUG MutableMetricsFactory: field private 
org.apache.hadoop.metrics2.lib.MutableGaugeLong 
org.apache.hadoop.security.UserGroupInformation$UgiMetrics.renewalFailuresTotal 
with annotation org.apache.hadoop.metrics2.annotation.Metric(sampleName="Ops", 
always=false, valueName="Time", about="", interval=10, type=DEFAULT, 
value={"Renewal failures since startup"})
    25/04/23 17:02:04.696 main DEBUG MetricsSystemImpl: UgiMetrics, User and 
group related metrics
    25/04/23 17:02:04.744 main DEBUG ShutdownHookManager: Adding shutdown hook
    25/04/23 17:02:04.758 main DEBUG Shell: Failed to detect a valid hadoop 
home directory
    ...
    25/04/23 17:02:04.763 main DEBUG Shell: setsid is not available on this 
machine. So not using it.
    25/04/23 17:02:04.763 main DEBUG Shell: setsid exited with exit code 0
    25/04/23 17:02:04.808 main DEBUG PythonGatewayServer: Started 
PythonGatewayServer on localhost/127.0.0.1 with port 50689
    25/04/23 17:02:04.921 Thread-2 INFO SparkContext: Running Spark version 
4.1.0-SNAPSHOT
    25/04/23 17:02:04.922 Thread-2 INFO SparkContext: OS info Mac OS X, 15.4, 
aarch64
    25/04/23 17:02:04.922 Thread-2 INFO SparkContext: Java version 17.0.14
    25/04/23 17:02:04.929 Thread-2 DEBUG SecurityUtil: Setting 
hadoop.security.token.service.use_ip to true
    25/04/23 17:02:04.955 Thread-2 DEBUG Groups:  Creating new Groups object
    25/04/23 17:02:04.955 Thread-2 DEBUG NativeCodeLoader: Trying to load the 
custom-built native-hadoop library...
    ...
    ```
    
    **After**
    
    The Java process corresponding to the PySpark job will use the log4j 
configuration specified by `-Dlog4j.configurationFile`:
    
    ```
    /Users/yangjie01/Tools/zulu17/bin/java -cp 
hive-jackson/*:/Users/yangjie01/spark/conf/:...:/Users/yangjie01/spark/common/kvstore/target/scala-2.13/test-classes/:/Users/yangjie01/spark/common/network-common/target/scala-2.13/test-classes/:...:/Users/yangjie01/spark/core/target/jars/*:/Users/yangjie01/spark/mllib/target/jars/*:/Users/yangjie01/spark/assembly/target/scala-2.13/jars/slf4j-api-2.0.17.jar:/Users/yangjie01/spark/assembly/target/scala-2.13/jars/*
 -Xmx1g -Djava.io.tmpdir=/User [...]
    ```
    
    From `spark.*.extraJavaOptions`, we can see the configuration 
`-Dlog4j.configurationFile=/Users/yangjie01/spark/python/test_support/log4j2.properties`.
    
    At the same time, there are no longer DEBUG-level logs in unit-tests.log:
    
    ```
    25/04/23 17:08:23.584 Thread-2 INFO SparkContext: Running Spark version 
4.1.0-SNAPSHOT
    25/04/23 17:08:23.588 Thread-2 INFO SparkContext: OS info Mac OS X, 15.4, 
aarch64
    25/04/23 17:08:23.588 Thread-2 INFO SparkContext: Java version 17.0.14
    25/04/23 17:08:23.625 Thread-2 WARN NativeCodeLoader: Unable to load 
native-hadoop library for your platform... using builtin-java classes where 
applicable
    25/04/23 17:08:23.655 Thread-2 INFO ResourceUtils: 
==============================================================
    25/04/23 17:08:23.655 Thread-2 INFO ResourceUtils: No custom resources 
configured for spark.driver.
    25/04/23 17:08:23.656 Thread-2 INFO ResourceUtils: 
==============================================================
    25/04/23 17:08:23.656 Thread-2 INFO SparkContext: Submitted application: 
ReusedSQLTestCase
    25/04/23 17:08:23.669 Thread-2 INFO ResourceProfile: Default 
ResourceProfile created, executor resources: Map(cores -> name: cores, amount: 
1, script: , vendor: , memory -> name: memory, amount: 1024, script: , vendor: 
, offHeap -> name: offHeap, amount: 0, script: , vendor: ), task resources: 
Map(cpus -> name: cpus, amount: 1.0)
    25/04/23 17:08:23.670 Thread-2 INFO ResourceProfile: Limiting resource is 
cpu
    25/04/23 17:08:23.671 Thread-2 INFO ResourceProfileManager: Added 
ResourceProfile id: 0
    25/04/23 17:08:23.698 Thread-2 INFO SecurityManager: Changing view acls to: 
yangjie01
    25/04/23 17:08:23.699 Thread-2 INFO SecurityManager: Changing modify acls 
to: yangjie01
    25/04/23 17:08:23.699 Thread-2 INFO SecurityManager: Changing view acls 
groups to: yangjie01
    25/04/23 17:08:23.699 Thread-2 INFO SecurityManager: Changing modify acls 
groups to: yangjie01
    25/04/23 17:08:23.700 Thread-2 INFO SecurityManager: SecurityManager: 
authentication disabled; ui acls disabled; users with view permissions: 
yangjie01 groups with view permissions: EMPTY; users with modify permissions: 
yangjie01; groups with modify permissions: EMPTY; RPC SSL disabled
    25/04/23 17:08:23.823 Thread-2 INFO Utils: Successfully started service 
'sparkDriver' on port 51764.
    25/04/23 17:08:23.840 Thread-2 INFO SparkEnv: Registering MapOutputTracker
    25/04/23 17:08:23.846 Thread-2 INFO SparkEnv: Registering BlockManagerMaster
    25/04/23 17:08:23.855 Thread-2 INFO BlockManagerMasterEndpoint: Using 
org.apache.spark.storage.DefaultTopologyMapper for getting topology information
    25/04/23 17:08:23.856 Thread-2 INFO BlockManagerMasterEndpoint: 
BlockManagerMasterEndpoint up
    25/04/23 17:08:23.857 Thread-2 INFO SparkEnv: Registering 
BlockManagerMasterHeartbeat
    25/04/23 17:08:23.871 Thread-2 INFO DiskBlockManager: Created local 
directory at 
/Users/yangjie01/SourceCode/git/spark-sbt/python/target/37c5fe4f-2abd-4591-90bf-01b745c1af4c/blockmgr-790a3b12-ed2c-4c57-bfd1-eb88ac3d4ebf
    ...
    ```
    
    ### Was this patch authored or co-authored using generative AI tooling?
    No
    
    Closes #50675 from LuciferYang/pyspark-log4j.
    
    Authored-by: yangjie01 <yangji...@baidu.com>
    Signed-off-by: yangjie01 <yangji...@baidu.com>
    (cherry picked from commit a9128a08c82d1fa847c231d655c1ff58baf022fd)
    Signed-off-by: yangjie01 <yangji...@baidu.com>
---
 python/run-tests.py                   |  7 +++++--
 python/test_support/log4j2.properties | 31 +++++++++++++++++++++++++++++++
 2 files changed, 36 insertions(+), 2 deletions(-)

diff --git a/python/run-tests.py b/python/run-tests.py
index 64ac48e210db..d020e6d38108 100755
--- a/python/run-tests.py
+++ b/python/run-tests.py
@@ -117,8 +117,11 @@ def run_individual_python_test(target_dir, test_name, 
pyspark_python, keep_test_
         metastore_dir = os.path.join(metastore_dir, str(uuid.uuid4()))
     os.mkdir(metastore_dir)
 
-    # Also override the JVM's temp directory by setting driver and executor 
options.
-    java_options = "-Djava.io.tmpdir={0}".format(tmp_dir)
+    # Also override the JVM's temp directory and log4j conf by setting driver 
and executor options.
+    log4j2_path = os.path.join(SPARK_HOME, 
"python/test_support/log4j2.properties")
+    java_options = "-Djava.io.tmpdir={0} -Dlog4j.configurationFile={1}".format(
+        tmp_dir, log4j2_path
+    )
     java_options = java_options + " -Xss4M"
     spark_args = [
         "--conf", "spark.driver.extraJavaOptions='{0}'".format(java_options),
diff --git a/python/test_support/log4j2.properties 
b/python/test_support/log4j2.properties
new file mode 100644
index 000000000000..6629658c1d28
--- /dev/null
+++ b/python/test_support/log4j2.properties
@@ -0,0 +1,31 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+# Set everything to be logged to the file target/unit-tests.log
+rootLogger.level = info
+rootLogger.appenderRef.file.ref = File
+
+appender.file.type = File
+appender.file.name = File
+appender.file.fileName = target/unit-tests.log
+appender.file.append = true
+appender.file.layout.type = PatternLayout
+appender.file.layout.pattern = %d{yy/MM/dd HH:mm:ss.SSS} %t %p %c{1}: %m%n%ex
+
+# Silence verbose logs from 3rd-party libraries.
+logger.netty.name = io.netty
+logger.netty.level = info


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

Reply via email to