dxdc opened a new issue, #43741:
URL: https://github.com/apache/arrow/issues/43741

   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   ## Summary
   When using `pyarrow.csv.read_csv` with `ReadOptions(use_threads=True)` and 
encountering a `UnicodeDecodeError`, Python hangs indefinitely during the 
shutdown process. This issue occurs consistently across multiple Python 
versions and `pyarrow` versions. Due to the proprietary nature of the data, the 
exact file causing the issue cannot be submitted. Attempts to substantially 
alter the file to remove sensitive information result in the problem 
disappearing, making it challenging to provide a reproducible test case.
   
   I hope that someone familiar with the internals of the pyarrow.csv module, 
particularly with the threading and shutdown procedures, can help identify and 
resolve this issue. Given the difficulty in reproducing the problem with a 
non-sensitive file, insights from those who have a deeper understanding of the 
relevant code might be crucial in pinpointing the root cause.
   
   ## Steps to Reproduce
   1. Run the following Python script:
   
       ```python
       import atexit
       import pyarrow.csv as pv
   
       @atexit.register
       def on_exit():
           print("Program exited successfully.")
   
       # setting use_threads to False does not hang python
       read_options = pv.ReadOptions(encoding="big5", use_threads=True)
       parse_options = pv.ParseOptions(delimiter="|")
   
       with open("sample.txt", "rb") as f:
           try:
               table = pv.read_csv(f, read_options=read_options, 
parse_options=parse_options)
           except Exception as e:
               print(f"An error occurred: {e}")
               raise
       ```
   
   2. Use a file (`sample.txt`) that contains data in an encoding (e.g., Big5, 
Shift-JIS) likely to trigger a `UnicodeDecodeError`.
   
       **Note:** The file used in testing contains proprietary data and cannot 
be submitted. Efforts to create a similar file by altering the original have 
resulted in the issue no longer being reproducible.
   
   3. Observe that the script prints "Program exited successfully." but then 
hangs indefinitely during the Python shutdown process.
   
   ## Expected Behavior
   The script should exit cleanly after execution, even if a 
`UnicodeDecodeError` occurs.
   
   ## Actual Behavior
   The script hangs indefinitely during the logging shutdown process after 
encountering a `UnicodeDecodeError`. This behavior is consistent when 
`use_threads=True` is set.
   
   ## Output
   The output includes a traceback ending with a `UnicodeDecodeError`, followed 
by a hang during the logging shutdown process. Below is the detailed Pdb step 
trace after the program exits:
   
   ```
   File "pyarrow/_csv.pyx", line 1261, in pyarrow._csv.read_csv
   File "pyarrow/_csv.pyx", line 1270, in pyarrow._csv.read_csv
   File "pyarrow/error.pxi", line 155, in 
pyarrow.lib.pyarrow_internal_check_status
   File "pyarrow/error.pxi", line 89, in pyarrow.lib.check_status
   File "pyarrow/io.pxi", line 1973, in pyarrow.lib._cb_transform
   File "pyarrow/io.pxi", line 2014, in pyarrow.lib.Transcoder.call
   UnicodeDecodeError: 'big5' codec can't decode byte 0x96 in position 68: 
illegal multibyte sequence
   Program exited successfully.
   --Return--
   
   /path/to/your/script.py(15)on_exit()->None
   -> pdb.set_trace()
   (Pdb) n
   --Call--
   /usr/lib/python3.11/logging/init.py(2170)shutdown()
   -> def shutdown(handlerList=_handlerList):
   (Pdb) n
   /usr/lib/python3.11/logging/init.py(2177)shutdown()
   -> print(handlerList)
   (Pdb) n
   [<weakref at 0x10ac5d210; to '_StderrHandler' at 0x10abf3690>]
   /usr/lib/python3.11/logging/init.py(2178)shutdown()
   -> for wr in reversed(handlerList[:]):
   (Pdb) n
   /usr/lib/python3.11/logging/init.py(2181)shutdown()
   -> try:
   (Pdb) n
   /usr/lib/python3.11/logging/init.py(2182)shutdown()
   -> h = wr()
   (Pdb) n
   /usr/lib/python3.11/logging/init.py(2183)shutdown()
   -> if h:
   (Pdb) n
   /usr/lib/python3.11/logging/init.py(2184)shutdown()
   -> try:
   (Pdb) n
   /usr/lib/python3.11/logging/init.py(2185)shutdown()
   -> h.acquire()
   (Pdb) n
   /usr/lib/python3.11/logging/init.py(2186)shutdown()
   -> h.flush()
   (Pdb) n
   /usr/lib/python3.11/logging/init.py(2187)shutdown()
   -> h.close()
   (Pdb) n
   /usr/lib/python3.11/logging/init.py(2195)shutdown()
   -> h.release()
   (Pdb) n
   /usr/lib/python3.11/logging/init.py(2178)shutdown()
   -> for wr in reversed(handlerList[:]):
   (Pdb) n
   ^C #### if I don't CTRL-C it hangs forever
   ```
   
   ## Environment
   - Python versions tested: 3.9, 3.10, 3.11, 3.12
   - `pyarrow` versions tested: 9.0.0 to 17.0.0
   - OS tested: Mac and Linux, including AWS Fargate
   
   ## Additional Information
   - The issue occurs consistently when a `UnicodeDecodeError` is raised during 
CSV parsing with `use_threads=True`.
   - It may also have to do with an inconsistent number of columns
   - Specifying the separator is not required
   - Disabling threading (`use_threads=False`) resolves the issue.
   - The problematic file contains non-ASCII data, but attempts to generate a 
similar "fake" file with random ASCII data and specific byte substitutions have 
not successfully reproduced the issue.
   
   ## Suggested Priority
   High - The hang is significant as it prevents Python from exiting cleanly, 
which could impact various applications relying on `pyarrow` for multi-threaded 
CSV processing.
   
   Please let me know if additional information is required.
   
   
   ### Component(s)
   
   Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to