[I] [C++][Acero] Swiss table still has risks of overflow [arrow]

via GitHub Tue, 11 Feb 2025 22:42:17 -0800


zanmato1984 opened a new issue, #45506:
URL: https://github.com/apache/arrow/issues/45506

### Describe the enhancement requested

After fixing #44513 and #45334, I kept looking for possible overflow risks
in our Swiss join implementation. And my finding follows.

### Background

In the Swiss table, a "block" consists of `8` keys (rows). When the number
of rows is large enough, a block occupies `40` bytes, aka. `num_block_bytes`:
`4` bytes for each key and one `8` bytes header. Blocks are stored continuously
in a buffer namely `uint8_t * blocks_`. So locating the address of
`block_id`-th block requires indexing like:
```
blocks_ + num_block_bytes * block_id
```

### Risks

The limit of number of rows in Swiss table is `2^32`. So we can have `2^32 /
8` blocks at most, therefore the `block_id` is normally represented using
`uint32_t`. The `num_block_bytes` is represented using regular `int`. If no
explicit type promotion is conducted, `num_block_bytes * block_id` will perform
32-bit multiplication and overflow may happen (`2^32 / 8 * 40` > `2^32`).

In our code base, there are places where such calculations are done with
promoting to 64-bit multiplication so overflow is avoided, to name a few:

https://github.com/apache/arrow/blob/e79d60d943700494143c2b1d4e52041a555c901d/cpp/src/arrow/compute/key_map_internal.cc#L262-L263

https://github.com/apache/arrow/blob/e79d60d943700494143c2b1d4e52041a555c901d/cpp/src/arrow/compute/key_map_internal.cc#L408-L409
However requiring such explicit type promotion is error-prone.

What may cause real trouble is where such calculations are still in 32-bit,
there is one:

https://github.com/apache/arrow/blob/e79d60d943700494143c2b1d4e52041a555c901d/cpp/src/arrow/compute/key_map_internal.cc#L226-L227

(I wish I could come up with a concrete test case that such overflow results
in wrong data - it's possible. But it's non-trivial and wouldn't be practical
to run in limited resources.)

Given that such code are either correct but error-prone, or possible for
real overflow, and once issues happen I can't imagine how painful the debugging
will be, we should refactor them in a more overflow-safe fashion.

### Component(s)

C++

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[I] [C++][Acero] Swiss table still has risks of overflow [arrow]

Reply via email to