dmitry-chirkov-dremio opened a new issue, #49817:
URL: https://github.com/apache/arrow/issues/49817
### Describe the bug, including details regarding any error messages,
version, and platform.
`arrow::Decimal128::FromString` silently truncates (wraps mod 2¹²⁸) when the
input string has more than 38 significant digits, producing a garbage unscaled
value **and returning `Status::OK`**. The out-parameter `precision` does
reflect the true parsed precision (> 38), but the `Decimal128` value itself is
already corrupted, and most callers do not validate `precision` against
`Decimal128::kMaxPrecision`.
The same issue applies symmetrically in `Decimal256::FromString` when
precision > 76 - both paths share `DecimalFromString<T>` in
`cpp/src/arrow/util/decimal.cc`.
**Why?**
`DecimalFromString` feeds the digit string into `ShiftAndAdd`, which
multiplies-and-adds into a fixed-size `uint64_t` array sized to the target
decimal's bit width:
```cpp
std::array<uint64_t, Decimal::kBitWidth / 64> little_endian_array{};
ShiftAndAdd(dec.whole_digits, little_endian_array.data(),
little_endian_array.size());
ShiftAndAdd(dec.fractional_digits, little_endian_array.data(),
little_endian_array.size());
```
`ShiftAndAdd` carries high bits only through `out_size` limbs and drops the
remaining carry on the floor - no overflow check. `DecimalFromString` then
returns `Status::OK()` without validating `parsed_precision <= kMaxPrecision`
(and also without validating `parsed_scale <= kMaxScale`).
Relevant code (paths as of apache/arrow `main`):
- `ShiftAndAdd` - `cpp/src/arrow/util/decimal.cc:747`
- `DecimalFromString` - `cpp/src/arrow/util/decimal.cc:857`
**Repro**
```cpp
// arrow_decimal_repro.cc
#include <arrow/util/decimal.h>
#include <iostream>
#include <string>
int main() {
const std::string input =
"1.55555555555555555555555555555555555555555555555555";
arrow::Decimal128 dec;
int32_t precision = 0, scale = 0;
auto st = arrow::Decimal128::FromString(input, &dec, &precision, &scale);
std::cout << "input: " << input << "\n"
<< "status: " << st.ToString() << "\n"
<< "precision: " << precision << "\n"
<< "scale: " << scale << "\n"
<< "dec (raw): " << dec.ToString(0) << "\n";
return 0;
}
```
Observed output (Arrow `main`, macOS arm64, clang++ -std=c++17):
```
input: 1.55555555555555555555555555555555555555555555555555
status: OK
precision: 51
scale: 50
dec (raw): 151473969238762845885664163847966963939
```
That unscaled value is exactly the input (51-digit integer
`155555555555555555555555555555555555555555555555555`) reduced mod 2¹²⁸:
```python
>>> 155555555555555555555555555555555555555555555555555 % (1 << 128)
151473969238762845885664163847966963939
```
**Desired (?) behavior**
Return `Status::Invalid` when parsed precision exceeds `kMaxPrecision` (or
parsed scale exceeds `kMaxScale`) for the target decimal type, matching the
error-reporting contract of `Decimal::Rescale` and friends.
**Impact (breaking change?)**
Any caller that trusts `FromString` to fail on invalid input silently
produces wrong answers for user-supplied decimal strings longer than 38
significant digits. Downstream, Gandiva's `castDECIMAL_utf8` - and any compute
kernel that routes VARCHAR-to-DECIMAL through `FromString` before rescaling via
`Decimal128::Rescale` / `gandiva::decimalops::Convert` - returns numerically
wrong results for perfectly legal SQL casts such as:
```sql
CAST('1.55555555555555555555555555555555555555555555555555' AS
DECIMAL(38,37))
-- expected: 1.5555555555555555555555555555555555556 (HALF_UP)
-- observed: 0.0000000000015147396923876284588566416
```
### Component(s)
C++, Gandiva
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]