alippai opened a new issue, #45298: URL: https://github.com/apache/arrow/issues/45298
### Describe the enhancement requested I'm using Parquet via PyArrow (versions 17.0 and 18.0). I have tried many options (setting `pre_buffer=True`, `use_threads=False`, increasing `buffer_size`) with `pq.read_table()`, but I can't find a way to reduce seeking and increase the size of `read()` calls. As I understand, the Parquet format can be read by reading the footer once or twice, then streaming the row groups in order and passing the data to decompress and decode. If I have a reasonable number of files, having a few sequential readers (one reader per file) should provide the most throughput when the storage is inefficient with small random reads. Currently, I use `pq.read_table(BytesIO(Path().read_bytes()))`. This is wasteful and clunky: it allocates too much contiguous memory, and the read doesn't free the processed row groups. I understand this can get complex when filters and column projection are involved, but developing an I/O plan where I can specify in-order reads and a minimum read size could work (e.g., reading the small gaps and discarding unused data afterward). ### Component(s) Parquet, Python -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org