(spark) branch master updated: [SPARK-54147][SQL] Set OMP_NUM_THREADS to spark.task.cpus by default in BaseScriptTransformationExec

yumwang Wed, 05 Nov 2025 01:43:25 -0800

This is an automated email from the ASF dual-hosted git repository.

yumwang pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git



The following commit(s) were added to refs/heads/master by this push:
     new 20af57c1340a [SPARK-54147][SQL] Set OMP_NUM_THREADS to spark.task.cpus 
by default in BaseScriptTransformationExec
20af57c1340a is described below

commit 20af57c1340ac31797e4b5078f5c465d45d4813d
Author: TongWei1105 <[email protected]>
AuthorDate: Wed Nov 5 17:42:29 2025 +0800

    [SPARK-54147][SQL] Set OMP_NUM_THREADS to spark.task.cpus by default in 
BaseScriptTransformationExec
    
    ### What changes were proposed in this pull request?
    
    Set OMP_NUM_THREADS to spark.task.cpus by default in 
BaseScriptTransformationExec
    
    ### Why are the changes needed?
    
    When we use the TRANSFORM function to invoke a Python script，the Python 
script uses packages such as PyTorch or NumPy. Since these libraries, by 
default, start a number of intra-op threads equal to the number of available 
CPU cores on the node, this can lead to CPU overload.
    ```
    ADD ARCHIVE s3://example-bucket/udf/emotion/emotion_predict.zip;
    ADD ARCHIVE s3://example-bucket/udf/emotion/python_env.zip;
    
    INSERT OVERWRITE TABLE demo_db.text_emotion_result PARTITION (dt = 'XXX')
    SELECT
        TRANSFORM(
            id,
            title,
            content
        )
        USING './python_env.zip/python_env/bin/python 
emotion_predict.zip/emotion_predict/predict.py'
        AS (id, title, content, emotion_label, emotion_score)
    FROM (
        SELECT /*+ REPARTITION(1000) */
            id, title, content
        FROM demo_db.text_input_data
        WHERE dt = 'XXX'
    ) src;
    ```
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    Manually.
    
    Closes #52850 from TongWei1105/SPARK-54147.
    
    Authored-by: TongWei1105 <[email protected]>
    Signed-off-by: Yuming Wang <[email protected]>
---
 .../org/apache/spark/sql/execution/BaseScriptTransformationExec.scala | 4 ++++
 1 file changed, 4 insertions(+)

diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/BaseScriptTransformationExec.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/BaseScriptTransformationExec.scala
index bfd813ad5ef1..7450032aa8a1 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/BaseScriptTransformationExec.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/BaseScriptTransformationExec.scala
@@ -84,6 +84,10 @@ trait BaseScriptTransformationExec extends UnaryExecNode {
     val path = System.getenv("PATH") + File.pathSeparator +
       SparkFiles.getRootDirectory()
     builder.environment().put("PATH", path)
+    // if OMP_NUM_THREADS is not explicitly set, override it with the value of 
"spark.task.cpus"
+    if (System.getenv("OMP_NUM_THREADS") == null) {
+      builder.environment().put("OMP_NUM_THREADS", 
conf.getConfString("spark.task.cpus", "1"))
+    }
 
     val proc = builder.start()
     val inputStream = proc.getInputStream


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

(spark) branch master updated: [SPARK-54147][SQL] Set OMP_NUM_THREADS to spark.task.cpus by default in BaseScriptTransformationExec

Reply via email to