mbutrovich opened a new pull request, #4345: URL: https://github.com/apache/datafusion-comet/pull/4345
## Which issue does this PR close? Closes #. ## Rationale for this change Peeled off from #4267 to keep that PR scoped to codegen. The cache-shape change is independent of any consumer and benefits the codegen dispatcher (#4267), the regex `CometUDF` (#4239), and the JSON `CometUDF` (#4305) equally. The current process-wide `ConcurrentHashMap<String, CometUDF>` requires every `CometUDF` to be strictly stateless: one shared instance services all tasks. A thread-local cache would not help because Tokio work-stealing on the scan-free execution path can move a Spark task's future between workers across batches, losing per-batch state. Keying by Spark task attempt ID gives continuity within a task and isolation across tasks regardless of which worker is polling. ## What changes are included in this PR? - `CometUdfBridge.INSTANCES` becomes `ConcurrentHashMap<Long, ConcurrentHashMap<String, CometUDF>>` keyed by `(taskAttemptId, className)`. - A `TaskCompletionListener` registered on the first cache miss for a task evicts the per-task entry on task end. - `NO_TASK_ID = -1L` sentinel covers calls without a `TaskContext` (unit tests, direct native driver runs); that bucket is not evicted because no task-completion event fires. - `CometUDF` Scaladoc updates the contract to "may hold per-task state in fields" and documents the single-threaded-per-instance invariant (Spark runs one native future per partition, Tokio polls one future per worker at a time). - Defensive assertions on `evaluate` preconditions, the post-install `TaskContext` invariant, and the cache-side invariants (single listener registration, non-null cache, reflective-instantiate success). ## How are these changes tested? No new tests in this PR for the same reason as #4306: the Arrow shading boundary in `common/` blocks unit tests that subclass `CometUDF`. End-to-end coverage lands with each consumer (#4267, #4239, #4305) when it drives the bridge. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
