Copilot commented on code in PR #2907:
URL: https://github.com/apache/sedona/pull/2907#discussion_r3197032646


##########
docker/sedona-docker.dockerfile.dockerignore:
##########
@@ -7,3 +7,16 @@
 !docs/usecases/**
 !python/**
 !spark-shaded/**
+
+# Re-exclude Python build artifacts and pyc caches so a tree that has
+# had `python -m build` run against it does not balloon the COPY layers.
+# Note: `**/target/` is intentionally NOT excluded here because the
+# build.sh `latest` mode runs `mvn clean install -DskipTests` and then
+# install-sedona.sh inside the image copies the freshly-built shaded
+# JAR from `${SEDONA_HOME}/spark-shaded/target/`. Stale Maven `target/`
+# directories from a prior local build are cleaned up by build.sh
+# (non-latest branch) instead — see docker/build.sh.
+python/build/
+python/dist/
+**/*.egg-info/
+**/__pycache__/

Review Comment:
   This ignorefile intentionally keeps Maven `target/` in the build context 
(see note), which means a manual `docker build -f 
docker/sedona-docker.dockerfile .` for a *published* `sedona_version` (i.e., 
not using `docker/build.sh`) can still accidentally COPY a stale 
`spark-shaded/target/` into the image and bloat the resulting layers. Consider 
excluding `spark-shaded/target/` (or `**/target/`) and then explicitly 
re-including only `spark-shaded/target/sedona-spark-shaded-*.jar` for the 
`sedona_version=latest` path so both workflows are protected.
   



##########
docker/sedona-docker.dockerfile:
##########
@@ -43,14 +43,14 @@ ENV PYTHONPATH=${SPARK_HOME}/python
 # Set up OS libraries and PySpark
 RUN apt-get update
 RUN apt-get install -y openjdk-17-jdk-headless curl python3-pip maven
-RUN pip3 install pipenv --break-system-packages
+RUN pip3 install --no-cache-dir pipenv --break-system-packages
 COPY ./docker/install-spark.sh ${SEDONA_HOME}/docker/
 RUN chmod +x ${SEDONA_HOME}/docker/install-spark.sh
 RUN ${SEDONA_HOME}/docker/install-spark.sh ${spark_version} 
${hadoop_s3_version} ${aws_sdk_version}
 
 # Install Python dependencies
 COPY docker/requirements.txt /opt/requirements.txt
-RUN pip3 install -r /opt/requirements.txt --break-system-packages
+RUN pip3 install --no-cache-dir -r /opt/requirements.txt 
--break-system-packages

Review Comment:
   `--no-cache-dir` is added for the pipenv and requirements installs, but 
later the Dockerfile runs `install-sedona.sh`, which performs additional `pip3 
install ...` steps without `--no-cache-dir`. That can still repopulate 
`/root/.cache/pip` in a later layer, partially negating the size win. Consider 
setting `ENV PIP_NO_CACHE_DIR=1` (or updating the script’s pip invocations) so 
all pip installs during the build are cache-free.



##########
docker/build.sh:
##########
@@ -81,6 +81,17 @@ if [ "$SEDONA_VERSION" = "latest" ]; then
 
     # The compilation must take place outside Docker to avoid unnecessary 
maven packages
     mvn clean install -DskipTests -Dspark="${SEDONA_SPARK_VERSION}" 
-Dscala=2.13
+else
+    # When building against a published Sedona version, install-sedona.sh
+    # downloads the shaded JAR from Maven Central inside the container and
+    # never reads spark-shaded/target/. Any stale Maven artifacts in the
+    # local tree would still be pulled into the build context by the
+    # `COPY ./spark-shaded/` step, ship in the published manifest, and add
+    # to every pull's download size — even though the dockerfile's trailing
+    # `RUN rm -rf ${SEDONA_HOME}` deletes them from the runtime filesystem.
+    # apache/sedona:1.9.0 hit this regression and shipped 1.1 GB heavier than
+    # 1.8.1 (4.03 GB vs 2.97 GB compressed) for exactly this reason.
+    rm -rf spark-shaded/target python/build python/dist
 fi

Review Comment:
   This new cleanup deletes relative paths (`rm -rf spark-shaded/target ...`) 
without first ensuring the script is running from the repo root. If 
`docker/build.sh` is invoked from another working directory that happens to 
contain a `spark-shaded/target` path, it could delete unrelated files; if 
invoked from a different dir without that path, it won’t actually prevent the 
Docker context from including stale artifacts. Consider `cd`ing to the repo 
root at the top of the script (e.g., via the script’s directory or `git 
rev-parse --show-toplevel`) before running `rm` and `docker buildx build`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to