iabhi4 opened a new issue, #46589: URL: https://github.com/apache/arrow/issues/46589
### Describe the bug, including details regarding any error messages, version, and platform. ### Description The `utf8_is_digit` kernel in `pyarrow.compute` does not fully replicate Python's `str.isdigit()` behavior, especially with certain Unicode digit characters. For example, the character `'³'` (U+00B3 SUPERSCRIPT THREE) returns `True` with Python’s `str.isdigit()` but returns `False` when passed to `pyarrow.compute.utf8_is_digit`. This divergence leads to downstream inconsistencies, particularly in pandas when using `StringDtype(storage="pyarrow")`. --- ### Reproduction ```python import pyarrow as pa import pyarrow.compute as pc arr = pa.array(['3', '٣', '५', '123', '³']) print(pc.utf8_is_digit(arr).to_pylist()) ``` **Output:** ``` [True, True, True, True, False] # <-- '³' incorrectly returns False ``` **Expected Output (matches `str.isdigit()`):** ``` [True, True, True, True, True] ``` --- ### Notes - The issue seems to stem from the implementation of `IsDigitUnicode::PredicateCharacterAll` not including characters in the Unicode "No" (Number, Other) category, such as superscript digits (`³`, `²`, etc.). - Python's behavior can be verified as: ```python print("³".isdigit()) # True ``` --- ### Impact This affects pandas string operations like `.str.isdigit()` when using `pyarrow` storage. Python string-based behavior passes, but pyarrow-based behavior fails for characters like `'³'`. --- ### System Info Tested with: - PyArrow 20.0.0 (pip-installed) - Pyarrow `main` 0.1.dev17578+g218c886 - Python 3.12 - Debian-based Linux (Ubuntu) ### Component(s) Python -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org