dxdc opened a new issue, #43741: URL: https://github.com/apache/arrow/issues/43741
### Describe the bug, including details regarding any error messages, version, and platform. ## Summary When using `pyarrow.csv.read_csv` with `ReadOptions(use_threads=True)` and encountering a `UnicodeDecodeError`, Python hangs indefinitely during the shutdown process. This issue occurs consistently across multiple Python versions and `pyarrow` versions. Due to the proprietary nature of the data, the exact file causing the issue cannot be submitted. Attempts to substantially alter the file to remove sensitive information result in the problem disappearing, making it challenging to provide a reproducible test case. I hope that someone familiar with the internals of the pyarrow.csv module, particularly with the threading and shutdown procedures, can help identify and resolve this issue. Given the difficulty in reproducing the problem with a non-sensitive file, insights from those who have a deeper understanding of the relevant code might be crucial in pinpointing the root cause. ## Steps to Reproduce 1. Run the following Python script: ```python import atexit import pyarrow.csv as pv @atexit.register def on_exit(): print("Program exited successfully.") # setting use_threads to False does not hang python read_options = pv.ReadOptions(encoding="big5", use_threads=True) parse_options = pv.ParseOptions(delimiter="|") with open("sample.txt", "rb") as f: try: table = pv.read_csv(f, read_options=read_options, parse_options=parse_options) except Exception as e: print(f"An error occurred: {e}") raise ``` 2. Use a file (`sample.txt`) that contains data in an encoding (e.g., Big5, Shift-JIS) likely to trigger a `UnicodeDecodeError`. **Note:** The file used in testing contains proprietary data and cannot be submitted. Efforts to create a similar file by altering the original have resulted in the issue no longer being reproducible. 3. Observe that the script prints "Program exited successfully." but then hangs indefinitely during the Python shutdown process. ## Expected Behavior The script should exit cleanly after execution, even if a `UnicodeDecodeError` occurs. ## Actual Behavior The script hangs indefinitely during the logging shutdown process after encountering a `UnicodeDecodeError`. This behavior is consistent when `use_threads=True` is set. ## Output The output includes a traceback ending with a `UnicodeDecodeError`, followed by a hang during the logging shutdown process. Below is the detailed Pdb step trace after the program exits: ``` File "pyarrow/_csv.pyx", line 1261, in pyarrow._csv.read_csv File "pyarrow/_csv.pyx", line 1270, in pyarrow._csv.read_csv File "pyarrow/error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status File "pyarrow/error.pxi", line 89, in pyarrow.lib.check_status File "pyarrow/io.pxi", line 1973, in pyarrow.lib._cb_transform File "pyarrow/io.pxi", line 2014, in pyarrow.lib.Transcoder.call UnicodeDecodeError: 'big5' codec can't decode byte 0x96 in position 68: illegal multibyte sequence Program exited successfully. --Return-- /path/to/your/script.py(15)on_exit()->None -> pdb.set_trace() (Pdb) n --Call-- /usr/lib/python3.11/logging/init.py(2170)shutdown() -> def shutdown(handlerList=_handlerList): (Pdb) n /usr/lib/python3.11/logging/init.py(2177)shutdown() -> print(handlerList) (Pdb) n [<weakref at 0x10ac5d210; to '_StderrHandler' at 0x10abf3690>] /usr/lib/python3.11/logging/init.py(2178)shutdown() -> for wr in reversed(handlerList[:]): (Pdb) n /usr/lib/python3.11/logging/init.py(2181)shutdown() -> try: (Pdb) n /usr/lib/python3.11/logging/init.py(2182)shutdown() -> h = wr() (Pdb) n /usr/lib/python3.11/logging/init.py(2183)shutdown() -> if h: (Pdb) n /usr/lib/python3.11/logging/init.py(2184)shutdown() -> try: (Pdb) n /usr/lib/python3.11/logging/init.py(2185)shutdown() -> h.acquire() (Pdb) n /usr/lib/python3.11/logging/init.py(2186)shutdown() -> h.flush() (Pdb) n /usr/lib/python3.11/logging/init.py(2187)shutdown() -> h.close() (Pdb) n /usr/lib/python3.11/logging/init.py(2195)shutdown() -> h.release() (Pdb) n /usr/lib/python3.11/logging/init.py(2178)shutdown() -> for wr in reversed(handlerList[:]): (Pdb) n ^C #### if I don't CTRL-C it hangs forever ``` ## Environment - Python versions tested: 3.9, 3.10, 3.11, 3.12 - `pyarrow` versions tested: 9.0.0 to 17.0.0 - OS tested: Mac and Linux, including AWS Fargate ## Additional Information - The issue occurs consistently when a `UnicodeDecodeError` is raised during CSV parsing with `use_threads=True`. - It may also have to do with an inconsistent number of columns - Specifying the separator is not required - Disabling threading (`use_threads=False`) resolves the issue. - The problematic file contains non-ASCII data, but attempts to generate a similar "fake" file with random ASCII data and specific byte substitutions have not successfully reproduced the issue. ## Suggested Priority High - The hang is significant as it prevents Python from exiting cleanly, which could impact various applications relying on `pyarrow` for multi-threaded CSV processing. Please let me know if additional information is required. ### Component(s) Python -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org