lasuomela opened a new issue, #44146:
URL: https://github.com/apache/arrow/issues/44146
### Describe the bug, including details regarding any error messages,
version, and platform.
Hi,
I'm using Huggingface Datasets to read data from disk, which internally
utilizes PyArrow. Reading data, I intermittently run into the following type of
error:
```
pa_table = opened_stream.read_all()
File "pyarrow/ipc.pxi", line 762, in pyarrow.lib.RecordBatchReader.read_all
File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
OSError: Expected to be able to read 1013159856 bytes for message body, got
513144496
```
Any ideas about what typically produces this kind of errors with PyArrow, or
any debugging tips?
Context:
Huggingface Datasets reads the data with an approach equivalent to:
```
import pyarrow as pa
filename = 'some_path.arrow'
memory_mapped_stream = pa.memory_map(filename)
opened_stream = pa.ipc.open_stream(memory_mapped_stream)
pa_table = opened_stream.read_all()
```
I access the same data concurrently from 4 processes. I assume the problem
isn't in the data itself, because at a later time I can read the piece of data
where the error occurs without any problem. Furthermore, this error only occurs
on one of the computers I'm using. Could this be caused by some 'system'
related thing? For example, the maximum number of simultaneously memory mapped
files is limited in `/proc/sys/vm/max_map_count`.
Pyarrow version: Tried 15.0, 17.0
### Component(s)
Python
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]