mbutrovich commented on PR #3172:
URL: 
https://github.com/apache/datafusion-comet/pull/3172#issuecomment-4245078575

   Claude summarized my notes for me. Hopefully it didn't transcribe anything 
wrong or hallucinate :)
   
   > ## PR #3172: `array_position`
   > 
   > Good implementation overall. Spark compatibility looks correct: returns 
`Int64`/`0` (not DataFusion's `UInt64`/`null`), NaN equality handled properly, 
null array/element returns null, null values within arrays are skipped, 1-based 
indexing is right. Benchmarks show 1.1-1.2X faster than Spark.
   > 
   > ### NULL buffer handling
   > 
   > All the typed paths (`position_primitive`, `position_float`, 
`position_boolean`, `position_string`, `position_fallback`) build a 
`Vec<Option<i64>>` and convert to `Int64Array::from()`. This constructs the 
null buffer element-by-element.
   > 
   > Recent DataFusion optimizations (e.g. apache/datafusion#21464, 
apache/datafusion#21471, apache/datafusion#21482) show 3-37% improvement by 
computing the null buffer upfront:
   > 
   > 1. `NullBuffer::union(list_array.nulls(), element.nulls())` to get the 
combined output nulls
   > 2. Build `Vec<i64>` (not `Option<i64>`) for the values
   > 3. Construct via `Int64Array::new(ScalarBuffer::from(values), 
combined_nulls)`
   > 
   > This avoids per-row `is_null()` checks in the hot loop and the Option 
tracking overhead. Not a blocker given the speedup is already decent, but it's 
a straightforward optimization that could widen the gap further.
   > 
   > ### Signature
   > 
   > `Signature::variadic_any` accepts any number of arguments of any type. 
Since `array_position` always takes exactly 2, 
`Signature::new(TypeSignature::Any(2), Volatility::Immutable)` would be more 
precise. The serde already validates so this is just defense in depth.
   > 
   > ### Unwrap on downcast
   > 
   > `downcast_ref::<GenericListArray<O>>().unwrap()` in 
`generic_array_position` could panic on invalid input. Should be unreachable 
given the match guard, but `.ok_or_else(|| DataFusionError::Internal("expected 
list array".into()))?` would be safer and consistent with patterns elsewhere.
   > 
   > ### Test gap
   > 
   > Consider adding a nested array test like `array_position(array(array(1,2), 
array(3,4)), array(1,2))`. Spark's unit tests cover this case and it would 
exercise the `position_fallback` code path.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to