[I] [Python] Order of reading pyarrow.dataset.Dataset [arrow]

via GitHub Tue, 18 Mar 2025 13:54:23 -0700


bwi-earth opened a new issue, #45855:
URL: https://github.com/apache/arrow/issues/45855


   ### Describe the usage question you have. Please include as many useful 
details as  possible.
   
   
   I'm reading about various methods to consume a `pyarrow.dataset.Dataset`, in 
the of large dataset (.to_table is excluded).
   
   it seems that it is impossible to read a dataset in by chunk, yet in ordered 
manner, `to_batches` doesn't offer any guarntees about the order of the 
retrieved batches.
   
   The best I've come up with is to list the fragments of the dataset and read 
each one individually, then sort partial outputs.
   
   However, if that's the case i'm loosing the benefit of pyarrow loading stuff 
in the background.
   
   (im using parquet stored in s3 as backend, doesn't seem to be relevant 
though)
   
   
   ### Component(s)
   
   Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[I] [Python] Order of reading pyarrow.dataset.Dataset [arrow]

Reply via email to