dmitry-chirkov-dremio opened a new issue, #49817:
URL: https://github.com/apache/arrow/issues/49817

   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   `arrow::Decimal128::FromString` silently truncates (wraps mod 2¹²⁸) when the 
input string has more than 38 significant digits, producing a garbage unscaled 
value **and returning `Status::OK`**. The out-parameter `precision` does 
reflect the true parsed precision (> 38), but the `Decimal128` value itself is 
already corrupted, and most callers do not validate `precision` against 
`Decimal128::kMaxPrecision`.
   
   The same issue applies symmetrically in `Decimal256::FromString` when 
precision > 76 - both paths share `DecimalFromString<T>` in 
`cpp/src/arrow/util/decimal.cc`.
   
   **Why?**
   
   `DecimalFromString` feeds the digit string into `ShiftAndAdd`, which 
multiplies-and-adds into a fixed-size `uint64_t` array sized to the target 
decimal's bit width:
   
   ```cpp
   std::array<uint64_t, Decimal::kBitWidth / 64> little_endian_array{};
   ShiftAndAdd(dec.whole_digits, little_endian_array.data(),
               little_endian_array.size());
   ShiftAndAdd(dec.fractional_digits, little_endian_array.data(),
               little_endian_array.size());
   ```
   
   `ShiftAndAdd` carries high bits only through `out_size` limbs and drops the 
remaining carry on the floor - no overflow check. `DecimalFromString` then 
returns `Status::OK()` without validating `parsed_precision <= kMaxPrecision` 
(and also without validating `parsed_scale <= kMaxScale`).
   
   Relevant code (paths as of apache/arrow `main`):
   
   - `ShiftAndAdd` - `cpp/src/arrow/util/decimal.cc:747`
   - `DecimalFromString` - `cpp/src/arrow/util/decimal.cc:857`
   
   **Repro**
   ```cpp
   // arrow_decimal_repro.cc
   #include <arrow/util/decimal.h>
   #include <iostream>
   #include <string>
   
   int main() {
     const std::string input =
         "1.55555555555555555555555555555555555555555555555555";
   
     arrow::Decimal128 dec;
     int32_t precision = 0, scale = 0;
     auto st = arrow::Decimal128::FromString(input, &dec, &precision, &scale);
   
     std::cout << "input:     " << input << "\n"
               << "status:    " << st.ToString() << "\n"
               << "precision: " << precision << "\n"
               << "scale:     " << scale << "\n"
               << "dec (raw): " << dec.ToString(0) << "\n";
     return 0;
   }
   ```
   
   Observed output (Arrow `main`, macOS arm64, clang++ -std=c++17):
   
   ```
   input:     1.55555555555555555555555555555555555555555555555555
   status:    OK
   precision: 51
   scale:     50
   dec (raw): 151473969238762845885664163847966963939
   ```
   
   That unscaled value is exactly the input (51-digit integer 
`155555555555555555555555555555555555555555555555555`) reduced mod 2¹²⁸:
   
   ```python
   >>> 155555555555555555555555555555555555555555555555555 % (1 << 128)
   151473969238762845885664163847966963939
   ```
   
   **Desired (?) behavior**
   Return `Status::Invalid` when parsed precision exceeds `kMaxPrecision` (or 
parsed scale exceeds `kMaxScale`) for the target decimal type, matching the 
error-reporting contract of `Decimal::Rescale` and friends.
   
   **Impact (breaking change?)**
   Any caller that trusts `FromString` to fail on invalid input silently 
produces wrong answers for user-supplied decimal strings longer than 38 
significant digits. Downstream, Gandiva's `castDECIMAL_utf8` - and any compute 
kernel that routes VARCHAR-to-DECIMAL through `FromString` before rescaling via 
`Decimal128::Rescale` / `gandiva::decimalops::Convert` - returns numerically 
wrong results for perfectly legal SQL casts such as:
   
   ```sql
   CAST('1.55555555555555555555555555555555555555555555555555' AS 
DECIMAL(38,37))
   -- expected: 1.5555555555555555555555555555555555556  (HALF_UP)
   -- observed:  0.0000000000015147396923876284588566416
   ```
   
   
   ### Component(s)
   
   C++, Gandiva


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to