iabhi4 opened a new issue, #46589:
URL: https://github.com/apache/arrow/issues/46589

   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   ### Description
   
   The `utf8_is_digit` kernel in `pyarrow.compute` does not fully replicate 
Python's `str.isdigit()` behavior, especially with certain Unicode digit 
characters.
   
   For example, the character `'³'` (U+00B3 SUPERSCRIPT THREE) returns `True` 
with Python’s `str.isdigit()` but returns `False` when passed to 
`pyarrow.compute.utf8_is_digit`.
   
   This divergence leads to downstream inconsistencies, particularly in pandas 
when using `StringDtype(storage="pyarrow")`.
   
   ---
   
   ### Reproduction
   
   ```python
   import pyarrow as pa
   import pyarrow.compute as pc
   
   arr = pa.array(['3', '٣', '५', '123', '³'])
   print(pc.utf8_is_digit(arr).to_pylist())
   ```
   
   **Output:**
   ```
   [True, True, True, True, False]  # <-- '³' incorrectly returns False
   ```
   
   **Expected Output (matches `str.isdigit()`):**
   ```
   [True, True, True, True, True]
   ```
   
   ---
   
   ### Notes
   
   - The issue seems to stem from the implementation of 
`IsDigitUnicode::PredicateCharacterAll` not including characters in the Unicode 
"No" (Number, Other) category, such as superscript digits (`³`, `²`, etc.).
   - Python's behavior can be verified as:
   
   ```python
   print("³".isdigit())  # True
   ```
   
   ---
   
   ### Impact
   
   This affects pandas string operations like `.str.isdigit()` when using 
`pyarrow` storage. Python string-based behavior passes, but pyarrow-based 
behavior fails for characters like `'³'`.
   
   ---
   
   ### System Info
   
   Tested with:
   - PyArrow 20.0.0 (pip-installed)
   - Pyarrow `main` 0.1.dev17578+g218c886
   - Python 3.12
   - Debian-based Linux (Ubuntu)
   
   ### Component(s)
   
   Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to