mbutrovich commented on PR #3172:
URL:
https://github.com/apache/datafusion-comet/pull/3172#issuecomment-4245078575
Claude summarized my notes for me. Hopefully it didn't transcribe anything
wrong or hallucinate :)
> ## PR #3172: `array_position`
>
> Good implementation overall. Spark compatibility looks correct: returns
`Int64`/`0` (not DataFusion's `UInt64`/`null`), NaN equality handled properly,
null array/element returns null, null values within arrays are skipped, 1-based
indexing is right. Benchmarks show 1.1-1.2X faster than Spark.
>
> ### NULL buffer handling
>
> All the typed paths (`position_primitive`, `position_float`,
`position_boolean`, `position_string`, `position_fallback`) build a
`Vec<Option<i64>>` and convert to `Int64Array::from()`. This constructs the
null buffer element-by-element.
>
> Recent DataFusion optimizations (e.g. apache/datafusion#21464,
apache/datafusion#21471, apache/datafusion#21482) show 3-37% improvement by
computing the null buffer upfront:
>
> 1. `NullBuffer::union(list_array.nulls(), element.nulls())` to get the
combined output nulls
> 2. Build `Vec<i64>` (not `Option<i64>`) for the values
> 3. Construct via `Int64Array::new(ScalarBuffer::from(values),
combined_nulls)`
>
> This avoids per-row `is_null()` checks in the hot loop and the Option
tracking overhead. Not a blocker given the speedup is already decent, but it's
a straightforward optimization that could widen the gap further.
>
> ### Signature
>
> `Signature::variadic_any` accepts any number of arguments of any type.
Since `array_position` always takes exactly 2,
`Signature::new(TypeSignature::Any(2), Volatility::Immutable)` would be more
precise. The serde already validates so this is just defense in depth.
>
> ### Unwrap on downcast
>
> `downcast_ref::<GenericListArray<O>>().unwrap()` in
`generic_array_position` could panic on invalid input. Should be unreachable
given the match guard, but `.ok_or_else(|| DataFusionError::Internal("expected
list array".into()))?` would be safer and consistent with patterns elsewhere.
>
> ### Test gap
>
> Consider adding a nested array test like `array_position(array(array(1,2),
array(3,4)), array(1,2))`. Spark's unit tests cover this case and it would
exercise the `position_fallback` code path.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]