andygrove opened a new issue, #4072:
URL: https://github.com/apache/datafusion-comet/issues/4072

   ## Describe the problem
   
   Profiling with async-profiler on native execution shows a significant amount 
of time spent in metric reporting. This happens on the per-batch hot path, so 
the cost is paid for every batch returned from DataFusion back to the JVM.
   
   ## Where the cost comes from
   
   `update_comet_metric` is called from `jni_api.rs`:
   - Once for every batch returned to the JVM (`executePlan` fast path)
   - Periodically during `ScanExec` polling (every ~100 polls, gated by 
`metrics_update_interval`)
   - At stream end
   
   For each call, the pipeline does the following:
   
   **Rust side** (`native/core/src/execution/metrics/utils.rs`):
   1. `to_native_metric_node` recursively walks the `SparkPlan` tree.
   2. Per node: allocates a fresh `HashMap<String, i64>`, inserts each metric 
with `name.to_string()` (a per-metric allocation), and pushes the children into 
a new `Vec`.
   3. Serializes the resulting `NativeMetricNode` protobuf via `prost`'s 
`encode_to_vec()`: allocates a `Vec<u8>`.
   4. `byte_array_from_slice` allocates a new Java `byte[]` and copies the 
protobuf bytes in.
   5. One JNI upcall to `set_all_from_bytes([B)V`.
   
   **JVM side** (`CometMetricNode.scala`):
   
   6. `Metric.NativeMetricNode.parseFrom(bytes)`: full protobuf tree 
re-allocation.
   7. Recursive walk pairing the deserialized tree with the local 
`CometMetricNode` tree via `zip(children)`.
   8. Per metric: `metrics.get(metricName)` lookup on the node's `Map[String, 
SQLMetric]`, then `SQLMetric.set(v)`.
   
   So every update cycle includes: a tree of per-node `HashMap` allocations, 
per-metric `String` allocations, protobuf serialization, a Java byte-array 
allocation and copy, protobuf deserialization into a fresh tree, and a 
per-metric hash-map lookup on the JVM side, none of which carry state across 
cycles even though the tree shape and metric names are stable for the lifetime 
of the plan.
   
   ## Impact
   
   Visible in async-profiler flame graphs as a disproportionate share of 
native-thread CPU during query execution, especially for plans with many nodes 
and small-to-medium batch sizes (where update frequency is high relative to 
per-batch work).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to