zanmato1984 opened a new issue, #45506: URL: https://github.com/apache/arrow/issues/45506
### Describe the enhancement requested After fixing #44513 and #45334, I kept looking for possible overflow risks in our Swiss join implementation. And my finding follows. ### Background In the Swiss table, a "block" consists of `8` keys (rows). When the number of rows is large enough, a block occupies `40` bytes, aka. `num_block_bytes`: `4` bytes for each key and one `8` bytes header. Blocks are stored continuously in a buffer namely `uint8_t * blocks_`. So locating the address of `block_id`-th block requires indexing like: ``` blocks_ + num_block_bytes * block_id ``` ### Risks The limit of number of rows in Swiss table is `2^32`. So we can have `2^32 / 8` blocks at most, therefore the `block_id` is normally represented using `uint32_t`. The `num_block_bytes` is represented using regular `int`. If no explicit type promotion is conducted, `num_block_bytes * block_id` will perform 32-bit multiplication and overflow may happen (`2^32 / 8 * 40` > `2^32`). In our code base, there are places where such calculations are done with promoting to 64-bit multiplication so overflow is avoided, to name a few: https://github.com/apache/arrow/blob/e79d60d943700494143c2b1d4e52041a555c901d/cpp/src/arrow/compute/key_map_internal.cc#L262-L263 https://github.com/apache/arrow/blob/e79d60d943700494143c2b1d4e52041a555c901d/cpp/src/arrow/compute/key_map_internal.cc#L408-L409 However requiring such explicit type promotion is error-prone. What may cause real trouble is where such calculations are still in 32-bit, there is one: https://github.com/apache/arrow/blob/e79d60d943700494143c2b1d4e52041a555c901d/cpp/src/arrow/compute/key_map_internal.cc#L226-L227 (I wish I could come up with a concrete test case that such overflow results in wrong data - it's possible. But it's non-trivial and wouldn't be practical to run in limited resources.) Given that such code are either correct but error-prone, or possible for real overflow, and once issues happen I can't imagine how painful the debugging will be, we should refactor them in a more overflow-safe fashion. ### Component(s) C++ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org