andygrove opened a new pull request, #4325:
URL: https://github.com/apache/datafusion-comet/pull/4325
## Which issue does this PR close?
Closes #.
## Rationale for this change
The two-module split between `comet-common` and `comet-spark` was originally
motivated by Arrow shading: `common` shaded Arrow into
`org.apache.comet.shaded.arrow.*` at its package phase, and `spark` depended on
`common` with Arrow excluded so it didn't see the unshaded names directly. The
split has become a recurring friction point for any code that needs both Arrow
vectors and Spark internals, since such code can only live in one module or the
other but not naturally in both. The recent date/time JVM UDF prototype work
hit this directly: a UDF that needs to call into Spark's `TimestampFormatter`
cannot live in `common` (no Spark dep), but a UDF that needs to operate on
Arrow vectors cannot easily live in `spark` (shaded Arrow type signatures plus
Scala pickled-type metadata pointing at the unshaded
`org.apache.arrow.memory.RootAllocator` that the `spark` pom excludes from its
classpath).
Folding `common` into `spark` removes the split and the friction. The Arrow
relocation is preserved by moving the shade-plugin config from `common/pom.xml`
into `spark`'s existing shade execution.
## What changes are included in this PR?
Mechanical refactor; no functional code changes.
- All sources, tests, resources, and Spark-version shim directories
(`spark-3.4`, `spark-3.5`, `spark-3.x`, `spark-4.x`) moved from `common/` to
`spark/` via `git mv` so history follows.
- `common/pom.xml` deleted. Its responsibilities folded into `spark/pom.xml`:
- Compile-scope deps: `parquet-column`, `parquet-format-structures`,
`arrow-vector`, `scala-collection-compat`.
- `arrow-memory-unsafe` and `arrow-c-data` elevated from test-scope to
compile-scope (the test-scope workaround existed because Arrow shading happened
in a sibling module's package phase; with one module this is no longer needed).
- Arrow `<relocation>` block (and the JNI-class relocation exclusions for
`org.apache.arrow.c.jni.*` and
`ArrayStreamExporter$ExportedArrayStreamPrivateData`) added to spark's existing
shade execution alongside protobuf and guava relocations.
- Arrow `<include>` added to the shade `<artifactSet>`; `arrow-vector`
codegen filter and `arrow-git.properties` excludes added to the global filter.
- `git-commit-id-maven-plugin` (generates `comet-git-info.properties`)
moved to spark.
- Native library `<resources>` block (bundles `libcomet.{dylib,so}` under
`org/apache/comet/{platform}/{arch}` inside the jar) moved to spark.
- Parent `pom.xml`: `<module>common</module>` removed.
- `spark-integration/pom.xml`: dep on `comet-common` removed; an Arrow
`<exclusion>` added to its dep on `comet-spark` (since the JNI-related Arrow
classes are intentionally kept at `org.apache.arrow.*` inside the shaded
`comet-spark` jar and would otherwise duplicate against a transitive
`arrow-c-data`).
- Duplicate `log4j.properties` / `log4j2.properties` resource files
(identical content in common and spark) dropped from the merged tree.
- `Makefile`: `common/target/...` references changed to `spark/target/...`
(native-lib jar packaging); `./mvnw compile -pl common` in the test-rust target
changed to a full compile.
- `dev/release/build-release-comet.sh`: `common/target` path reference
updated to `spark/target`.
- `.github/workflows/spark_sql_test.yml` and `iceberg_spark_test.yml`: path
filters that referenced `common/src/...` and `common/pom.xml` removed (the
contents are now under `spark/`, which is already in the filter list).
## How are these changes tested?
Full build (`make`) passes against the default Spark 4.1 profile. Downstream
PRs that previously hit the cross-module friction (notably the date/time JVM
UDF DateFormat work) will be the most natural exercise of the new layout.
This PR was scaffolded with the project's brainstorming and writing-plans
skills.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]