andygrove opened a new pull request, #4325:
URL: https://github.com/apache/datafusion-comet/pull/4325

   ## Which issue does this PR close?
   
   Closes #.
   
   ## Rationale for this change
   
   The two-module split between `comet-common` and `comet-spark` was originally 
motivated by Arrow shading: `common` shaded Arrow into 
`org.apache.comet.shaded.arrow.*` at its package phase, and `spark` depended on 
`common` with Arrow excluded so it didn't see the unshaded names directly. The 
split has become a recurring friction point for any code that needs both Arrow 
vectors and Spark internals, since such code can only live in one module or the 
other but not naturally in both. The recent date/time JVM UDF prototype work 
hit this directly: a UDF that needs to call into Spark's `TimestampFormatter` 
cannot live in `common` (no Spark dep), but a UDF that needs to operate on 
Arrow vectors cannot easily live in `spark` (shaded Arrow type signatures plus 
Scala pickled-type metadata pointing at the unshaded 
`org.apache.arrow.memory.RootAllocator` that the `spark` pom excludes from its 
classpath).
   
   Folding `common` into `spark` removes the split and the friction. The Arrow 
relocation is preserved by moving the shade-plugin config from `common/pom.xml` 
into `spark`'s existing shade execution.
   
   ## What changes are included in this PR?
   
   Mechanical refactor; no functional code changes.
   
   - All sources, tests, resources, and Spark-version shim directories 
(`spark-3.4`, `spark-3.5`, `spark-3.x`, `spark-4.x`) moved from `common/` to 
`spark/` via `git mv` so history follows.
   - `common/pom.xml` deleted. Its responsibilities folded into `spark/pom.xml`:
     - Compile-scope deps: `parquet-column`, `parquet-format-structures`, 
`arrow-vector`, `scala-collection-compat`.
     - `arrow-memory-unsafe` and `arrow-c-data` elevated from test-scope to 
compile-scope (the test-scope workaround existed because Arrow shading happened 
in a sibling module's package phase; with one module this is no longer needed).
     - Arrow `<relocation>` block (and the JNI-class relocation exclusions for 
`org.apache.arrow.c.jni.*` and 
`ArrayStreamExporter$ExportedArrayStreamPrivateData`) added to spark's existing 
shade execution alongside protobuf and guava relocations.
     - Arrow `<include>` added to the shade `<artifactSet>`; `arrow-vector` 
codegen filter and `arrow-git.properties` excludes added to the global filter.
     - `git-commit-id-maven-plugin` (generates `comet-git-info.properties`) 
moved to spark.
     - Native library `<resources>` block (bundles `libcomet.{dylib,so}` under 
`org/apache/comet/{platform}/{arch}` inside the jar) moved to spark.
   - Parent `pom.xml`: `<module>common</module>` removed.
   - `spark-integration/pom.xml`: dep on `comet-common` removed; an Arrow 
`<exclusion>` added to its dep on `comet-spark` (since the JNI-related Arrow 
classes are intentionally kept at `org.apache.arrow.*` inside the shaded 
`comet-spark` jar and would otherwise duplicate against a transitive 
`arrow-c-data`).
   - Duplicate `log4j.properties` / `log4j2.properties` resource files 
(identical content in common and spark) dropped from the merged tree.
   - `Makefile`: `common/target/...` references changed to `spark/target/...` 
(native-lib jar packaging); `./mvnw compile -pl common` in the test-rust target 
changed to a full compile.
   - `dev/release/build-release-comet.sh`: `common/target` path reference 
updated to `spark/target`.
   - `.github/workflows/spark_sql_test.yml` and `iceberg_spark_test.yml`: path 
filters that referenced `common/src/...` and `common/pom.xml` removed (the 
contents are now under `spark/`, which is already in the filter list).
   
   ## How are these changes tested?
   
   Full build (`make`) passes against the default Spark 4.1 profile. Downstream 
PRs that previously hit the cross-module friction (notably the date/time JVM 
UDF DateFormat work) will be the most natural exercise of the new layout.
   
   This PR was scaffolded with the project's brainstorming and writing-plans 
skills.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to