This is an automated email from the ASF dual-hosted git repository.
dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git
The following commit(s) were added to refs/heads/master by this push:
new d5c8a3180cfb [SPARK-54784][ML][DOCS] Document the security policy on
ml models
d5c8a3180cfb is described below
commit d5c8a3180cfbf83ae0b5d0e1e78f52f631898736
Author: Ruifeng Zheng <[email protected]>
AuthorDate: Thu Feb 12 11:14:23 2026 -0800
[SPARK-54784][ML][DOCS] Document the security policy on ml models
### What changes were proposed in this pull request?
Document the security policy on ml models
### Why are the changes needed?
for security purpose
### Does this PR introduce _any_ user-facing change?
yes, doc-only change
### How was this patch tested?
manually check
<img width="1038" height="1206" alt="image"
src="https://github.com/user-attachments/assets/f47b89dc-b8c4-4ff8-93b2-38070544aa4d"
/>
### Was this patch authored or co-authored using generative AI tooling?
with ChatGPT
Closes #54246 from zhengruifeng/doc_ml_security.
Authored-by: Ruifeng Zheng <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
---
docs/_includes/nav-left-wrapper-ml.html | 1 +
docs/ml-security.md | 74 +++++++++++++++++++++++++++++++++
2 files changed, 75 insertions(+)
diff --git a/docs/_includes/nav-left-wrapper-ml.html
b/docs/_includes/nav-left-wrapper-ml.html
index 00ac6cc0dbc7..9240cd6ac545 100644
--- a/docs/_includes/nav-left-wrapper-ml.html
+++ b/docs/_includes/nav-left-wrapper-ml.html
@@ -4,5 +4,6 @@
{% include nav-left.html nav=include.nav-ml %}
<h3><a href="mllib-guide.html">MLlib: RDD-based API Guide</a></h3>
{% include nav-left.html nav=include.nav-mllib %}
+ <h3><a href="ml-security.html">ML Model Security</a></h3>
</div>
</div>
\ No newline at end of file
diff --git a/docs/ml-security.md b/docs/ml-security.md
new file mode 100644
index 000000000000..e295f83ebf37
--- /dev/null
+++ b/docs/ml-security.md
@@ -0,0 +1,74 @@
+---
+layout: global
+title: "ML Model Security"
+displayTitle: "Spark ML Model Security"
+license: |
+ Licensed to the Apache Software Foundation (ASF) under one or more
+ contributor license agreements. See the NOTICE file distributed with
+ this work for additional information regarding copyright ownership.
+ The ASF licenses this file to You under the Apache License, Version 2.0
+ (the "License"); you may not use this file except in compliance with
+ the License. You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
+---
+
+# Overview
+
+In Apache Spark, loading a machine learning (ML) model is fundamentally
equivalent to loading and executing code.
+Spark ML models often contain serialized objects, transformation logic, and
execution graphs that are evaluated by the Spark runtime
+during model loading and inference.
+The principle is not unique to Spark, it applies equally to scikit-learn,
PyTorch, TensorFlow, and other modern ML ecosystems.
+As a result, loading a model from an untrusted source introduces the same
security risks as executing untrusted software.
+
+# Why Loading an ML Model Is Equivalent to Loading Code?
+
+Spark ML frameworks serialize not only data (such as weights and parameters)
but also executable structures and behaviors.
+Because of this, model loading is not merely data parsing. It involves
interpreting and executing instructions, which means a malicious model can:
+
+* Execute arbitrary commands
+* Access or exfiltrate data
+* Modify system state
+* Install backdoors or malware
+
+In practice, loading a model from an untrusted source is equivalent to running
a program downloaded from the internet.
+
+# Security Implications
+
+Because Spark ML models can embed executable logic, loading untrusted models
can lead to:
+
+* Remote code execution (RCE)
+* Data exfiltration from Spark jobs
+* Compromise of cluster nodes
+* Privilege escalation within Spark environments
+* Supply-chain attacks through model distribution
+
+These risks are amplified in distributed environments, where a malicious model
may execute across multiple cluster nodes.
+
+# Responsibility of End Users
+
+Because loading ML models is equivalent to loading executable code, the
responsibility for security ultimately lies with the end user or deploying
organization.
+End users are responsible for ensuring that ML models are subject to the same
security assessment, validation, and operational controls as any third-party
software.
+This includes:
+
+* Verifying the source and authenticity of the model
+* Ensuring integrity and provenance
+* Applying organizational security policies
+* Performing risk assessments before deployment
+
+Frameworks and libraries can provide safeguards, but they cannot guarantee
security when loading arbitrary third-party models.
+
+# Best Practices
+
+* Load models only from trusted and verified sources
+* Validate cryptographic hashes or digital signatures
+* Execute models in isolated environments
+* Restrict filesystem, network, and credential access
+* Keep Spark, ML libraries, and dependencies fully patched
+
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]