dxdc opened a new issue, #43741:
URL: https://github.com/apache/arrow/issues/43741
### Describe the bug, including details regarding any error messages,
version, and platform.
## Summary
When using `pyarrow.csv.read_csv` with `ReadOptions(use_threads=True)` and
encountering a `UnicodeDecodeError`, Python hangs indefinitely during the
shutdown process. This issue occurs consistently across multiple Python
versions and `pyarrow` versions. Due to the proprietary nature of the data, the
exact file causing the issue cannot be submitted. Attempts to substantially
alter the file to remove sensitive information result in the problem
disappearing, making it challenging to provide a reproducible test case.
I hope that someone familiar with the internals of the pyarrow.csv module,
particularly with the threading and shutdown procedures, can help identify and
resolve this issue. Given the difficulty in reproducing the problem with a
non-sensitive file, insights from those who have a deeper understanding of the
relevant code might be crucial in pinpointing the root cause.
## Steps to Reproduce
1. Run the following Python script:
```python
import atexit
import pyarrow.csv as pv
@atexit.register
def on_exit():
print("Program exited successfully.")
# setting use_threads to False does not hang python
read_options = pv.ReadOptions(encoding="big5", use_threads=True)
parse_options = pv.ParseOptions(delimiter="|")
with open("sample.txt", "rb") as f:
try:
table = pv.read_csv(f, read_options=read_options,
parse_options=parse_options)
except Exception as e:
print(f"An error occurred: {e}")
raise
```
2. Use a file (`sample.txt`) that contains data in an encoding (e.g., Big5,
Shift-JIS) likely to trigger a `UnicodeDecodeError`.
**Note:** The file used in testing contains proprietary data and cannot
be submitted. Efforts to create a similar file by altering the original have
resulted in the issue no longer being reproducible.
3. Observe that the script prints "Program exited successfully." but then
hangs indefinitely during the Python shutdown process.
## Expected Behavior
The script should exit cleanly after execution, even if a
`UnicodeDecodeError` occurs.
## Actual Behavior
The script hangs indefinitely during the logging shutdown process after
encountering a `UnicodeDecodeError`. This behavior is consistent when
`use_threads=True` is set.
## Output
The output includes a traceback ending with a `UnicodeDecodeError`, followed
by a hang during the logging shutdown process. Below is the detailed Pdb step
trace after the program exits:
```
File "pyarrow/_csv.pyx", line 1261, in pyarrow._csv.read_csv
File "pyarrow/_csv.pyx", line 1270, in pyarrow._csv.read_csv
File "pyarrow/error.pxi", line 155, in
pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 89, in pyarrow.lib.check_status
File "pyarrow/io.pxi", line 1973, in pyarrow.lib._cb_transform
File "pyarrow/io.pxi", line 2014, in pyarrow.lib.Transcoder.call
UnicodeDecodeError: 'big5' codec can't decode byte 0x96 in position 68:
illegal multibyte sequence
Program exited successfully.
--Return--
/path/to/your/script.py(15)on_exit()->None
-> pdb.set_trace()
(Pdb) n
--Call--
/usr/lib/python3.11/logging/init.py(2170)shutdown()
-> def shutdown(handlerList=_handlerList):
(Pdb) n
/usr/lib/python3.11/logging/init.py(2177)shutdown()
-> print(handlerList)
(Pdb) n
[<weakref at 0x10ac5d210; to '_StderrHandler' at 0x10abf3690>]
/usr/lib/python3.11/logging/init.py(2178)shutdown()
-> for wr in reversed(handlerList[:]):
(Pdb) n
/usr/lib/python3.11/logging/init.py(2181)shutdown()
-> try:
(Pdb) n
/usr/lib/python3.11/logging/init.py(2182)shutdown()
-> h = wr()
(Pdb) n
/usr/lib/python3.11/logging/init.py(2183)shutdown()
-> if h:
(Pdb) n
/usr/lib/python3.11/logging/init.py(2184)shutdown()
-> try:
(Pdb) n
/usr/lib/python3.11/logging/init.py(2185)shutdown()
-> h.acquire()
(Pdb) n
/usr/lib/python3.11/logging/init.py(2186)shutdown()
-> h.flush()
(Pdb) n
/usr/lib/python3.11/logging/init.py(2187)shutdown()
-> h.close()
(Pdb) n
/usr/lib/python3.11/logging/init.py(2195)shutdown()
-> h.release()
(Pdb) n
/usr/lib/python3.11/logging/init.py(2178)shutdown()
-> for wr in reversed(handlerList[:]):
(Pdb) n
^C #### if I don't CTRL-C it hangs forever
```
## Environment
- Python versions tested: 3.9, 3.10, 3.11, 3.12
- `pyarrow` versions tested: 9.0.0 to 17.0.0
- OS tested: Mac and Linux, including AWS Fargate
## Additional Information
- The issue occurs consistently when a `UnicodeDecodeError` is raised during
CSV parsing with `use_threads=True`.
- It may also have to do with an inconsistent number of columns
- Specifying the separator is not required
- Disabling threading (`use_threads=False`) resolves the issue.
- The problematic file contains non-ASCII data, but attempts to generate a
similar "fake" file with random ASCII data and specific byte substitutions have
not successfully reproduced the issue.
## Suggested Priority
High - The hang is significant as it prevents Python from exiting cleanly,
which could impact various applications relying on `pyarrow` for multi-threaded
CSV processing.
Please let me know if additional information is required.
### Component(s)
Python
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]