lasuomela opened a new issue, #44146:
URL: https://github.com/apache/arrow/issues/44146

   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   Hi,
   
   I'm using Huggingface Datasets to read data from disk, which internally 
utilizes PyArrow. Reading data, I intermittently run into the following type of 
error:
   
   ```
     pa_table = opened_stream.read_all()
     File "pyarrow/ipc.pxi", line 762, in pyarrow.lib.RecordBatchReader.read_all
     File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
   OSError: Expected to be able to read 1013159856 bytes for message body, got 
513144496
   ```
   
   Any ideas about what typically produces this kind of errors with PyArrow, or 
any debugging tips?
   
   Context:
   
   Huggingface Datasets reads the data with an approach equivalent to:
   
   ```
       import pyarrow as pa
       filename = 'some_path.arrow'
       memory_mapped_stream = pa.memory_map(filename)
       opened_stream = pa.ipc.open_stream(memory_mapped_stream)
       pa_table = opened_stream.read_all()
   ```
   
   I access the same data concurrently from 4 processes. I assume the problem 
isn't in the data itself, because at a later time I can read the piece of data 
where the error occurs without any problem. Furthermore, this error only occurs 
on one of the computers I'm using. Could this be caused by some 'system' 
related thing? For example, the maximum number of simultaneously memory mapped 
files is limited in `/proc/sys/vm/max_map_count`.
   
   Pyarrow version: Tried 15.0, 17.0
   
   ### Component(s)
   
   Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to