dmitry-chirkov-dremio opened a new issue, #49614:
URL: https://github.com/apache/arrow/issues/49614

   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   `arrow::util::base64_decode` (in `cpp/src/arrow/vendored/base64.cpp`) 
silently stops decoding when it encounters a character that is not in the 
base64 alphabet (`[A-Za-z0-9+/=]`). It returns a partial result with no 
indication that the input was malformed.
   
   For example, decoding `"aGVsbG8gd29ybGQh"` (the valid base64 encoding of 
`"hello world!"`) works correctly. But if someone passes the _raw_ string 
`"hello world!"` to `base64_decode`, it silently returns `"\x85\xe9\xa5"` - the 
partial decode of `"hell"` (4 valid base64 chars) - and discards everything 
from the space onward. No error, no empty result, just silent data corruption.
   
   The root cause is on this line of `base64.cpp`:
   ```
   while (in_len-- && (encoded_string[in_] != '=') && 
is_base64(encoded_string[in_])) {
   ```
   The is_base64() check causes the loop to terminate on any non-base64 
character, and the function then returns whatever partial result was 
accumulated.
   
   **Impact**
   Currently all internal callers pass machine-generated base64 (roundtripped 
through  base64_encode), so they are unaffected. However:
   - Gandiva exposes base64_decode as a user-facing SQL function 
(unbase64/from_base64) via gdv_fn_base64_decode_utf8. If a user passes 
malformed input, they get silent partial results rather than an error.
   - The silent-truncation behavior is a correctness hazard for any future 
caller that doesn't pre-validate its input.
   
   **Expected behavior**
   `base64_decode` should either:
   1. Return a Result<std::string> / Status and report an error on invalid 
characters, or
   2. At minimum, skip invalid characters (like many other base64 decoders do) 
rather than stopping entirely.
   Option (1) is the safer choice since partial silent decoding is almost never 
what a caller wants.
   
   **Reproducer test**
   The following test demonstrates the bug. It can be added to an existing test 
file or run standalone. It passes today (showing the current broken behavior), 
but the EXPECT_EQ on the last line is the wrong result - a correct 
implementation should either decode the full valid input or return an error.
   ```
   #include <gtest/gtest.h>
   #include "arrow/util/base64.h"
   
   TEST(Base64, DecodeStopsOnInvalidCharacter) {
     // "hello world!" base64-encodes to "aGVsbG8gd29ybGQh"
     // Roundtrip works fine:
     std::string original = "hello world!";
     std::string encoded = arrow::util::base64_encode(original);
     EXPECT_EQ(encoded, "aGVsbG8gd29ybGQh");
     EXPECT_EQ(arrow::util::base64_decode(encoded), original);
   
     // But feeding a non-base64 string silently truncates instead of erroring.
     // The space at index 5 causes decoding to stop after "hell" (4 valid 
base64 chars),
     // producing a 3-byte garbage result from partial decoding of "hell".
     std::string bad_input = "hello world!";
     std::string result = arrow::util::base64_decode(bad_input);
   
     // This is the CURRENT (buggy) behavior: partial decode of "hell" → 3 bytes
     EXPECT_EQ(result.size(), 3u);  // only decoded 4 chars ("hell") → 3 output 
bytes
     // "world!" was silently ignored
   
     // A correct implementation should EITHER:
     //   - return an error/status for invalid input, OR
     //   - return an empty string, OR
     //   - at minimum, not silently truncate
     // EXPECT_TRUE(arrow::util::base64_decode(bad_input).ok());  // if it 
returned Result<>
   }
   ```
   
   ### Component(s)
   
   C++


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to