[ 
https://issues.apache.org/jira/browse/ARROW-17961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17614148#comment-17614148
 ] 

David Li commented on ARROW-17961:
----------------------------------

[~westonpace] probably has better context here, but from what I understand, 
s3fs does readahead by default; PyArrow's filesystems do not. And since I don't 
think we enable pre-buffering by default, and the Parquet reader issues a 
separate I/O call for each column chunk, that's {{O(row groups * columns)}} 
read operations, which presumably get absorbed by s3fs's readahead, but which 
lead to individual HTTP requests on the PyArrow filesystem. (This is mostly an 
educated guess, I haven't actually sat down and profiled.)

> Add read/write optimization for pyarrow.fs.S3FileSystem
> -------------------------------------------------------
>
>                 Key: ARROW-17961
>                 URL: https://issues.apache.org/jira/browse/ARROW-17961
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Python
>            Reporter: Volker Lorrmann
>            Priority: Minor
>
> I found large differences in loading time, when loading data  from aws s3 
> using {{pyarrows.fs.S3FileSystem}} compared to {{s3fs.S3FileSystem}} See 
> example below.
> The difference comes from {{s3fs}} optimization, which {{pyarrow.fs}} is not 
> (yet) using.
> {code:python}
> import pyarrow.dataset as ds
> import pyarrow.parquet as pq
> import pyarrow.fs as pafs
> import s3fs
> import load_credentials
> credentials = load_credentials()
> path = "path/to/data" # folder with about 300 small (~10kb) files
> fs1 = s3fs.S3FileSystem(
>     anon=False,
>     key=credentials["accessKeyId"],
>     secret=credentials["secretAccessKey"],
>     token=credentials["sessionToken"],
> )
> fs2 = pafs.S3FileSystem(
>     access_key=credentials["accessKeyId"],
>     secret_key=credentials["secretAccessKey"],
>     session_token=credentials["sessionToken"],
>    
> )
> _ = ds.dataset(path, filesystem=fs1).to_table() # takes about 5 seconds
> _ = ds.dataset(path, filesystem=fs2).to_table() # takes about 25 seconds
> _ = pq.read_table(path, filesyste=fs1) # takes about 5 seconds
> _ = pq.read_table(path, filesytem=fs2) # takes about 10 seconds
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to