andygrove commented on PR #4672:
URL:
https://github.com/apache/datafusion-comet/pull/4672#issuecomment-4800350232
Thanks for the rework @sandugood, and for pulling the benchmark numbers and
the `cargo bench` invocation into the description. The primitive, date, and
timestamp paths are implemented the way #4626 was after. Reading the values as
a slice and building a `NullBuffer` from the flipped Spark bitmap is the right
shape, and the speedups are great.
One thing I want to call out as a nice catch: threading the timezone through
and calling `.with_timezone_opt(timezone)` is actually load-bearing, not just
defensive. `PrimitiveBuilder::append_array` asserts `data_type` equality
(`primitive_builder.rs:291`), so without it the timestamp path would panic
whenever the column carries a timezone. Good that you handled it.
A few items before this is ready:
**1. The boolean nullable path is UB, and this revision moved it the wrong
way.** This is item 1 from the earlier review. The new nullable branch does:
```rust
let slice = unsafe { std::slice::from_raw_parts(self.element_offset as
*const bool, num_elements) };
for (idx, &value) in slice.iter().enumerate() { ... }
```
Rust requires every `bool` to be exactly `0` or `1`. For null slots the JVM
byte is not guaranteed to be `0`/`1`, and `for (idx, &value)` materializes the
`bool` out of each byte even for slots that get skipped, which is UB. The
previous version of this branch read `*const u8` and did `*ptr != 0`, and the
non-nullable branch right below still does the safe thing:
```rust
let values = unsafe { std::slice::from_raw_parts(self.element_offset as
*const u8, num_elements) };
for &value in values { builder.append_value(value != 0); }
```
Could you mirror that `*const u8` pattern in the nullable branch
(`builder.append_value(value != 0)`)? Since Spark stores booleans as one byte
each rather than bit-packed, a per-element loop here is reasonable, so the `u8`
cast is really all that is needed. This is very likely the UB you hit in CI
earlier, since there is a `miri` workflow. Worth confirming against a nullable
boolean array with a null slot once the cast is in.
**2. Copy-paste leftover in `append_dates_to_builder`.** The `debug_assert!`
message reads `"append_timestamps: element_offset is null"` and the comment
above says `contiguous i64 data`, but this is the dates path. Both should say
`append_dates` and `i32`.
**3. A couple of targeted tests would help.** The bitmap flip and the
partial-byte handling are easy to get subtly wrong. Would you be open to adding
unit tests for nullable arrays where `num_elements` is not a multiple of 8 with
a null in the final byte (say 7, 9, 17), plus all-null and all-valid cases?
That locks in the trailing-bit behavior, which currently relies on the
`BooleanBuffer` length trimming the flipped padding bits.
**4. Formatting.** The `process_primitive_lists` macro in `row.rs` looks
like the indentation drifted. A quick `cargo fmt` should clean it up.
**5. Minor.** The byte reinterpretation of the null word assumes
little-endian bit ordering. That matches Arrow and every platform we target, so
it is fine, but a one-line comment next to the `null_words as *const u8` cast
noting the little-endian assumption would help the next reader.
The boolean path is the only blocker. Everything else is small. Once that is
a `u8` read I think this is in good shape.
*Disclosure: this review was assisted by an AI tool (Anthropic Claude via
Claude Code). I read the diff, the linked issue, and the prior review thread
before forming these comments, and I take responsibility for them.*
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]