This is an automated email from the ASF dual-hosted git repository.
yumwang pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git
The following commit(s) were added to refs/heads/master by this push:
new 20af57c1340a [SPARK-54147][SQL] Set OMP_NUM_THREADS to spark.task.cpus
by default in BaseScriptTransformationExec
20af57c1340a is described below
commit 20af57c1340ac31797e4b5078f5c465d45d4813d
Author: TongWei1105 <[email protected]>
AuthorDate: Wed Nov 5 17:42:29 2025 +0800
[SPARK-54147][SQL] Set OMP_NUM_THREADS to spark.task.cpus by default in
BaseScriptTransformationExec
### What changes were proposed in this pull request?
Set OMP_NUM_THREADS to spark.task.cpus by default in
BaseScriptTransformationExec
### Why are the changes needed?
When we use the TRANSFORM function to invoke a Python script,the Python
script uses packages such as PyTorch or NumPy. Since these libraries, by
default, start a number of intra-op threads equal to the number of available
CPU cores on the node, this can lead to CPU overload.
```
ADD ARCHIVE s3://example-bucket/udf/emotion/emotion_predict.zip;
ADD ARCHIVE s3://example-bucket/udf/emotion/python_env.zip;
INSERT OVERWRITE TABLE demo_db.text_emotion_result PARTITION (dt = 'XXX')
SELECT
TRANSFORM(
id,
title,
content
)
USING './python_env.zip/python_env/bin/python
emotion_predict.zip/emotion_predict/predict.py'
AS (id, title, content, emotion_label, emotion_score)
FROM (
SELECT /*+ REPARTITION(1000) */
id, title, content
FROM demo_db.text_input_data
WHERE dt = 'XXX'
) src;
```
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Manually.
Closes #52850 from TongWei1105/SPARK-54147.
Authored-by: TongWei1105 <[email protected]>
Signed-off-by: Yuming Wang <[email protected]>
---
.../org/apache/spark/sql/execution/BaseScriptTransformationExec.scala | 4 ++++
1 file changed, 4 insertions(+)
diff --git
a/sql/core/src/main/scala/org/apache/spark/sql/execution/BaseScriptTransformationExec.scala
b/sql/core/src/main/scala/org/apache/spark/sql/execution/BaseScriptTransformationExec.scala
index bfd813ad5ef1..7450032aa8a1 100644
---
a/sql/core/src/main/scala/org/apache/spark/sql/execution/BaseScriptTransformationExec.scala
+++
b/sql/core/src/main/scala/org/apache/spark/sql/execution/BaseScriptTransformationExec.scala
@@ -84,6 +84,10 @@ trait BaseScriptTransformationExec extends UnaryExecNode {
val path = System.getenv("PATH") + File.pathSeparator +
SparkFiles.getRootDirectory()
builder.environment().put("PATH", path)
+ // if OMP_NUM_THREADS is not explicitly set, override it with the value of
"spark.task.cpus"
+ if (System.getenv("OMP_NUM_THREADS") == null) {
+ builder.environment().put("OMP_NUM_THREADS",
conf.getConfString("spark.task.cpus", "1"))
+ }
val proc = builder.start()
val inputStream = proc.getInputStream
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]