[I] Sequential reading of parquet files [arrow]

via GitHub Fri, 17 Jan 2025 13:10:03 -0800


alippai opened a new issue, #45298:
URL: https://github.com/apache/arrow/issues/45298


   ### Describe the enhancement requested
   
   I'm using Parquet via PyArrow (versions 17.0 and 18.0).
   
   I have tried many options (setting `pre_buffer=True`, `use_threads=False`, 
increasing `buffer_size`) with `pq.read_table()`, but I can't find a way to 
reduce seeking and increase the size of `read()` calls.
   
   As I understand, the Parquet format can be read by reading the footer once 
or twice, then streaming the row groups in order and passing the data to 
decompress and decode.
   
   If I have a reasonable number of files, having a few sequential readers (one 
reader per file) should provide the most throughput when the storage is 
inefficient with small random reads.
   
   Currently, I use `pq.read_table(BytesIO(Path().read_bytes()))`. This is 
wasteful and clunky: it allocates too much contiguous memory, and the read 
doesn't free the processed row groups.
   
   I understand this can get complex when filters and column projection are 
involved, but developing an I/O plan where I can specify in-order reads and a 
minimum read size could work (e.g., reading the small gaps and discarding 
unused data afterward).
   
   ### Component(s)
   
   Parquet, Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[I] Sequential reading of parquet files [arrow]

Reply via email to