jiayuasu opened a new pull request, #2907:
URL: https://github.com/apache/sedona/pull/2907

   ## Did you read the Contributor Guide?
   
   - Yes, I have read the [Contributor 
Rules](https://sedona.apache.org/latest/community/rule/) and [Contributor 
Development Guide](https://sedona.apache.org/latest/community/develop/)
   
   ## Is this PR related to a ticket?
   
   - No: this is a CI / release-tooling update. The PR name follows the format 
`[CI] my subject`.
   
   ## What changes were proposed in this PR?
   
   Two related fixes that together restore the `apache/sedona` docker image 
download size to its pre-1.9.0 footprint.
   
   ### Background — what regressed
   
   Per the Docker Hub tags API, `apache/sedona:1.9.0` ships at **4.03 GB** 
compressed (per-arch, both amd64 and arm64) while `apache/sedona:1.8.1` is 
**2.97 GB**. A layer-by-layer comparison of the published manifests (37 layers 
per arch) shows all but one layer are identical to within a few MB:
   
   | Layer | 1.8.1 | 1.9.0 | Δ |
   |---|---|---|---|
   | L08 (`install-spark.sh`) | 1241 MB | 1239 MB | ~0 |
   | L10 (`pip install -r requirements.txt`) | 715 MB | 728 MB | +13 |
   | **L12 (`COPY ./spark-shaded/`)** | **0.7 MB** | **1104 MB** | **+1103** |
   | L18 (`install-zeppelin.sh`) | 504 MB | 504 MB | 0 |
   
   Layer 12 is `COPY ./spark-shaded/ ${SEDONA_HOME}/spark-shaded/` (line 57 of 
`docker/sedona-docker.dockerfile`). The 1.9.0 release was published from a tree 
that had Maven build outputs in `spark-shaded/target/` (likely from a prior 
`mvn package`); the existing `.dockerignore` allow-list `!spark-shaded/**` 
re-included everything under that directory, so the COPY swept in ~1.1 GB of 
JARs and test classes.
   
   The dockerfile's trailing `RUN rm -rf ${SEDONA_HOME}` *does* delete the 
content from the running container's filesystem, so `du` inside the container 
looks normal — but it cannot shrink the prior layer that already committed 
those bytes. The wasted ~1 GB stays in the published manifest and inflates 
every pull's download size.
   
   ### Fix 1 — `docker/sedona-docker.dockerfile.dockerignore`
   
   Re-exclude Maven and Python build outputs after the allow-list 
(last-match-wins). Even on a release machine that has stale `target/` 
directories, the Docker build context will not include them.
   
   ```diff
    *
    !docker/**
    !zeppelin/**
    !docs/usecases/**
    !python/**
    !spark-shaded/**
   +
   +# Re-exclude Maven build outputs and Python build artifacts so a tree
   +# that has had `mvn package` or `python -m build` run against it does
   +# not balloon the COPY layers ...
   +**/target/
   +python/build/
   +python/dist/
   +**/*.egg-info/
   +**/__pycache__/
   ```
   
   ### Fix 2 — `docker/sedona-docker.dockerfile`
   
   Pass `--no-cache-dir` to both `pip3 install` invocations. Without it, pip 
leaves ~439 MB of wheel downloads under `/root/.cache/pip` in the 
requirements.txt install layer — measured `du -sh /root/.cache` inside the 
running 1.9.0 image. `--no-cache-dir` skips that write and brings the pip layer 
down to roughly its installed-package size, with no runtime impact.
   
   ## How was this patch tested?
   
   1. **Probe build proves the deny rule actually fires.** Synthesized a 200 MB 
fake JAR at `spark-shaded/target/fake-jar.jar`, then built a one-liner `FROM 
alpine; COPY ./spark-shaded/ /shaded/` Dockerfile twice — once with the new 
`.dockerignore`, once with the old:
   
      | Variant | `/shaded` size in resulting image |
      |---|---|
      | With `**/target/` deny rule | **24 KB** (pom.xml + .gitignore only) |
      | Without the deny rule | **200 MB** (fake jar leaked in) |
   
   2. **Full image rebuild confirms `--no-cache-dir` shrinks the pip layer 
locally.** Local tree has no `target/` (clean checkout), so the deny rule is a 
no-op for our local build; the win comes purely from `--no-cache-dir`:
   
      | Image | Total size | `pip install -r` layer |
      |---|---|---|
      | `sedona:dev` (master @ HEAD) | 5.06 GB | 1.48 GB |
      | `sedona:trim` (this PR) | **4.81 GB** | **1.24 GB** |
   
   3. **Existing Docker-build CI matrix exercises the change.** The path filter 
(`docker/**`) widened in #2889 means this PR triggers `docker-build.yml`, which 
runs `./docker/build.sh ... local ...` and `docker/test-notebooks.sh` against 
the resulting image — so the existing 6-notebook test suite verifies the new 
dockerfile end-to-end.
   
   ## Expected impact on the next 1.9.0 re-publish
   
   Combining both fixes on a release-machine tree should drop the published 
image from 4.03 GB → ~3.0 GB compressed (roughly the 1.8.1 baseline). Recipe:
   
   ```bash
   git clean -fdX -- spark-shaded/ python/   # belt-and-suspenders, in case the 
new .dockerignore misses anything
   ./docker/build.sh 4.0.1 1.9.0 release 33.5
   ```
   
   `release` mode does `--platform linux/amd64,linux/arm64 --output 
type=registry`, so it pushes both arches to Docker Hub directly.
   
   ## Did this PR include necessary documentation updates?
   
   - No public API changes.
   - No documentation updates needed; the `.dockerignore` comment block 
explains the regression and the rationale for any future contributor running 
`git blame`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to