[GitHub] [arrow] eeroel opened a new issue, #37857: Allow passing file sizes to FileSystemDataset from Python

via GitHub Mon, 25 Sep 2023 06:53:32 -0700


eeroel opened a new issue, #37857:
URL: https://github.com/apache/arrow/issues/37857


   ### Describe the enhancement requested
   
   When reading Parquet files from table formats such as Delta Lake, file sizes 
are already known from the table format metadata. However, when building a 
dataset from fragments using 
`https://arrow.apache.org/docs/python/generated/pyarrow.dataset.FileFormat.html#pyarrow.dataset.FileFormat.make_fragment`,
 there is no way to inform Pyarrow about the file sizes, and this leads to 
unnecessary HEAD requests in the case of S3. There is already support in Arrow 
for specifying the file size to avoid these requests to S3, but as far as I can 
see this is not exposed to PyArrow: https://github.com/apache/arrow/pull/7547
   
   (As a side note, it seems that those HEAD requests in S3Filesystem are 
always executed on the same thread, which leads to poor concurrency when 
reading multiple files. Is this a known issue?)
   
   I can try to put together a PR with some kind of an implementation.
   
   
   ### Component(s)
   
   Parquet, Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] eeroel opened a new issue, #37857: Allow passing file sizes to FileSystemDataset from Python

Reply via email to