This is an automated email from the ASF dual-hosted git repository.

pan3793 pushed a commit to branch branch-4.x
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-4.x by this push:
     new a612450ad9ab [SPARK-57130][BUILD] `make-distribution.sh` copies only 
git-tracked files for python
a612450ad9ab is described below

commit a612450ad9abb092bff242ba3eb70da4636fac26
Author: Cheng Pan <[email protected]>
AuthorDate: Sat May 30 19:19:05 2026 +0800

    [SPARK-57130][BUILD] `make-distribution.sh` copies only git-tracked files 
for python
    
    ### What changes were proposed in this pull request?
    
    `make-distribution.sh` copies only git-tracked files for `python` folder, 
when `git` and `cpio` commands are available and under a git repo, instead of 
raw `cp`.
    
    ### Why are the changes needed?
    
    I find that sometimes `make-distribution.sh` produces an unreasonably large 
tarball because it copies the entire `python` folder to the `dist` directory, 
which may contain generated files, e.g., compiled PySpark docs.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    Run `dev/make-distribution.sh` manually.
    
    Also tested the performance of the new command, on macOS, `cpio` is 
slightly slower than raw `cp`, but good enough.
    ```
    $ time git ls-files -z "$PWD/python" | cpio -0pdm "target"
    42452 blocks
    git ls-files -z "$PWD/python"  0.01s user 0.01s system 76% cpu 0.027 total
    cpio -0pdm "target"  0.05s user 1.10s system 77% cpu 1.480 total
    $ rm -rf target/python
    $ time cp -r "$PWD/python" "target"
    cp -r "$PWD/python" "target"  0.02s user 0.56s system 78% cpu 0.731 total
    ```
    
    on Linux, `cpio` is faster
    ```
    $ time git ls-files -z "$PWD/python" | cpio -0pdm "target"
    46385 blocks
    git ls-files -z "$PWD/python"  0.01s user 0.01s system 81% cpu 0.022 total
    cpio -0pdm "target"  0.05s user 1.02s system 84% cpu 1.260 total
    $ rm -rf target/python
    $ time cp -r "$PWD/python" "target"
    cp -r "$PWD/python" "target"  0.02s user 0.57s system 73% cpu 0.807 total
    ```
    
    ### Was this patch authored or co-authored using generative AI tooling?
    
    Generated-by: DeepSeek V4 Pro.
    
    Closes #56186 from pan3793/SPARK-57130.
    
    Authored-by: Cheng Pan <[email protected]>
    Signed-off-by: Cheng Pan <[email protected]>
    (cherry picked from commit 319dc6e05f1c2774142bbc4dadb5f1389cadd2b0)
    Signed-off-by: Cheng Pan <[email protected]>
---
 dev/make-distribution.sh | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/dev/make-distribution.sh b/dev/make-distribution.sh
index e43e4afbd0e2..7cd9ea7889e8 100755
--- a/dev/make-distribution.sh
+++ b/dev/make-distribution.sh
@@ -294,7 +294,11 @@ mkdir "$DISTDIR/conf"
 cp "$SPARK_HOME"/conf/*.template "$DISTDIR/conf"
 cp "$SPARK_HOME/README.md" "$DISTDIR"
 cp -r "$SPARK_HOME/bin" "$DISTDIR"
-cp -r "$SPARK_HOME/python" "$DISTDIR"
+if command -v git && command -v cpio && git rev-parse --git-dir 2>/dev/null; 
then
+  git ls-files -z "$SPARK_HOME/python" | cpio -0pdm "$DISTDIR"
+else
+  cp -r "$SPARK_HOME/python" "$DISTDIR"
+fi
 
 # Remove the python distribution from dist/ if we built it
 if [ "$MAKE_PIP" == "true" ]; then


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to