jiayuasu opened a new pull request, #2907: URL: https://github.com/apache/sedona/pull/2907
## Did you read the Contributor Guide? - Yes, I have read the [Contributor Rules](https://sedona.apache.org/latest/community/rule/) and [Contributor Development Guide](https://sedona.apache.org/latest/community/develop/) ## Is this PR related to a ticket? - No: this is a CI / release-tooling update. The PR name follows the format `[CI] my subject`. ## What changes were proposed in this PR? Two related fixes that together restore the `apache/sedona` docker image download size to its pre-1.9.0 footprint. ### Background — what regressed Per the Docker Hub tags API, `apache/sedona:1.9.0` ships at **4.03 GB** compressed (per-arch, both amd64 and arm64) while `apache/sedona:1.8.1` is **2.97 GB**. A layer-by-layer comparison of the published manifests (37 layers per arch) shows all but one layer are identical to within a few MB: | Layer | 1.8.1 | 1.9.0 | Δ | |---|---|---|---| | L08 (`install-spark.sh`) | 1241 MB | 1239 MB | ~0 | | L10 (`pip install -r requirements.txt`) | 715 MB | 728 MB | +13 | | **L12 (`COPY ./spark-shaded/`)** | **0.7 MB** | **1104 MB** | **+1103** | | L18 (`install-zeppelin.sh`) | 504 MB | 504 MB | 0 | Layer 12 is `COPY ./spark-shaded/ ${SEDONA_HOME}/spark-shaded/` (line 57 of `docker/sedona-docker.dockerfile`). The 1.9.0 release was published from a tree that had Maven build outputs in `spark-shaded/target/` (likely from a prior `mvn package`); the existing `.dockerignore` allow-list `!spark-shaded/**` re-included everything under that directory, so the COPY swept in ~1.1 GB of JARs and test classes. The dockerfile's trailing `RUN rm -rf ${SEDONA_HOME}` *does* delete the content from the running container's filesystem, so `du` inside the container looks normal — but it cannot shrink the prior layer that already committed those bytes. The wasted ~1 GB stays in the published manifest and inflates every pull's download size. ### Fix 1 — `docker/sedona-docker.dockerfile.dockerignore` Re-exclude Maven and Python build outputs after the allow-list (last-match-wins). Even on a release machine that has stale `target/` directories, the Docker build context will not include them. ```diff * !docker/** !zeppelin/** !docs/usecases/** !python/** !spark-shaded/** + +# Re-exclude Maven build outputs and Python build artifacts so a tree +# that has had `mvn package` or `python -m build` run against it does +# not balloon the COPY layers ... +**/target/ +python/build/ +python/dist/ +**/*.egg-info/ +**/__pycache__/ ``` ### Fix 2 — `docker/sedona-docker.dockerfile` Pass `--no-cache-dir` to both `pip3 install` invocations. Without it, pip leaves ~439 MB of wheel downloads under `/root/.cache/pip` in the requirements.txt install layer — measured `du -sh /root/.cache` inside the running 1.9.0 image. `--no-cache-dir` skips that write and brings the pip layer down to roughly its installed-package size, with no runtime impact. ## How was this patch tested? 1. **Probe build proves the deny rule actually fires.** Synthesized a 200 MB fake JAR at `spark-shaded/target/fake-jar.jar`, then built a one-liner `FROM alpine; COPY ./spark-shaded/ /shaded/` Dockerfile twice — once with the new `.dockerignore`, once with the old: | Variant | `/shaded` size in resulting image | |---|---| | With `**/target/` deny rule | **24 KB** (pom.xml + .gitignore only) | | Without the deny rule | **200 MB** (fake jar leaked in) | 2. **Full image rebuild confirms `--no-cache-dir` shrinks the pip layer locally.** Local tree has no `target/` (clean checkout), so the deny rule is a no-op for our local build; the win comes purely from `--no-cache-dir`: | Image | Total size | `pip install -r` layer | |---|---|---| | `sedona:dev` (master @ HEAD) | 5.06 GB | 1.48 GB | | `sedona:trim` (this PR) | **4.81 GB** | **1.24 GB** | 3. **Existing Docker-build CI matrix exercises the change.** The path filter (`docker/**`) widened in #2889 means this PR triggers `docker-build.yml`, which runs `./docker/build.sh ... local ...` and `docker/test-notebooks.sh` against the resulting image — so the existing 6-notebook test suite verifies the new dockerfile end-to-end. ## Expected impact on the next 1.9.0 re-publish Combining both fixes on a release-machine tree should drop the published image from 4.03 GB → ~3.0 GB compressed (roughly the 1.8.1 baseline). Recipe: ```bash git clean -fdX -- spark-shaded/ python/ # belt-and-suspenders, in case the new .dockerignore misses anything ./docker/build.sh 4.0.1 1.9.0 release 33.5 ``` `release` mode does `--platform linux/amd64,linux/arm64 --output type=registry`, so it pushes both arches to Docker Hub directly. ## Did this PR include necessary documentation updates? - No public API changes. - No documentation updates needed; the `.dockerignore` comment block explains the regression and the rationale for any future contributor running `git blame`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
