[I] Sorted streaming of partitioned dataset? [arrow]

via GitHub Mon, 17 Feb 2025 05:25:27 -0800


cnoelle opened a new issue, #45553:
URL: https://github.com/apache/arrow/issues/45553


   ### Describe the usage question you have. Please include as many useful 
details as  possible.
   
   
   I would like to stream data from a dataset consisting of multiple files, 
partitioned by one column (time). It should be possible to sort the data 
according to this time column in either ascending or descending way. Is this 
possible with the Dataset API?
   
   Documentation of the `Dataset.sort_by()` method states that it returns an 
`InMemoryDataset` 
(https://arrow.apache.org/docs/python/generated/pyarrow.dataset.Dataset.html#pyarrow.dataset.Dataset.sort_by),
 or in other words, it immediately reads all files into memory. When using a 
partitioned dataset and sorting on the partitioning column I would expect that 
`sort_by()` could determine the order of the required input files only and 
parse them lazily when I run `to_batches()` on the resulting dataset.
   
   ### Component(s)
   
   Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[I] Sorted streaming of partitioned dataset? [arrow]

Reply via email to