zanmato1984 opened a new issue, #45506:
URL: https://github.com/apache/arrow/issues/45506

   ### Describe the enhancement requested
   
   After fixing #44513 and #45334, I kept looking for possible overflow risks 
in our Swiss join implementation. And my finding follows.
   
   ### Background
   
   In the Swiss table, a "block" consists of `8` keys (rows). When the number 
of rows is large enough, a block occupies `40` bytes, aka. `num_block_bytes`: 
`4` bytes for each key and one `8` bytes header. Blocks are stored continuously 
in a buffer namely `uint8_t * blocks_`. So locating the address of 
`block_id`-th block requires indexing like:
   ```
   blocks_ + num_block_bytes * block_id
   ```
   
   ### Risks
   
   The limit of number of rows in Swiss table is `2^32`. So we can have `2^32 / 
8` blocks at most, therefore the `block_id` is normally represented using 
`uint32_t`. The `num_block_bytes` is represented using regular `int`. If no 
explicit type promotion is conducted, `num_block_bytes * block_id` will perform 
32-bit multiplication and overflow may happen (`2^32 / 8 * 40` > `2^32`).
   
   In our code base, there are places where such calculations are done with 
promoting to 64-bit multiplication so overflow is avoided, to name a few:
   
https://github.com/apache/arrow/blob/e79d60d943700494143c2b1d4e52041a555c901d/cpp/src/arrow/compute/key_map_internal.cc#L262-L263
   
https://github.com/apache/arrow/blob/e79d60d943700494143c2b1d4e52041a555c901d/cpp/src/arrow/compute/key_map_internal.cc#L408-L409
   However requiring such explicit type promotion is error-prone.
   
   What may cause real trouble is where such calculations are still in 32-bit, 
there is one:
   
https://github.com/apache/arrow/blob/e79d60d943700494143c2b1d4e52041a555c901d/cpp/src/arrow/compute/key_map_internal.cc#L226-L227
   
   (I wish I could come up with a concrete test case that such overflow results 
in wrong data - it's possible. But it's non-trivial and wouldn't be practical 
to run in limited resources.)
   
   Given that such code are either correct but error-prone, or possible for 
real overflow, and once issues happen I can't imagine how painful the debugging 
will be, we should refactor them in a more overflow-safe fashion.
   
   ### Component(s)
   
   C++


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to