[
https://issues.apache.org/jira/browse/ARROW-17961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17614148#comment-17614148
]
David Li commented on ARROW-17961:
----------------------------------
[~westonpace] probably has better context here, but from what I understand,
s3fs does readahead by default; PyArrow's filesystems do not. And since I don't
think we enable pre-buffering by default, and the Parquet reader issues a
separate I/O call for each column chunk, that's {{O(row groups * columns)}}
read operations, which presumably get absorbed by s3fs's readahead, but which
lead to individual HTTP requests on the PyArrow filesystem. (This is mostly an
educated guess, I haven't actually sat down and profiled.)
> Add read/write optimization for pyarrow.fs.S3FileSystem
> -------------------------------------------------------
>
> Key: ARROW-17961
> URL: https://issues.apache.org/jira/browse/ARROW-17961
> Project: Apache Arrow
> Issue Type: Improvement
> Components: Python
> Reporter: Volker Lorrmann
> Priority: Minor
>
> I found large differences in loading time, when loading data from aws s3
> using {{pyarrows.fs.S3FileSystem}} compared to {{s3fs.S3FileSystem}} See
> example below.
> The difference comes from {{s3fs}} optimization, which {{pyarrow.fs}} is not
> (yet) using.
> {code:python}
> import pyarrow.dataset as ds
> import pyarrow.parquet as pq
> import pyarrow.fs as pafs
> import s3fs
> import load_credentials
> credentials = load_credentials()
> path = "path/to/data" # folder with about 300 small (~10kb) files
> fs1 = s3fs.S3FileSystem(
> anon=False,
> key=credentials["accessKeyId"],
> secret=credentials["secretAccessKey"],
> token=credentials["sessionToken"],
> )
> fs2 = pafs.S3FileSystem(
> access_key=credentials["accessKeyId"],
> secret_key=credentials["secretAccessKey"],
> session_token=credentials["sessionToken"],
>
> )
> _ = ds.dataset(path, filesystem=fs1).to_table() # takes about 5 seconds
> _ = ds.dataset(path, filesystem=fs2).to_table() # takes about 25 seconds
> _ = pq.read_table(path, filesyste=fs1) # takes about 5 seconds
> _ = pq.read_table(path, filesytem=fs2) # takes about 10 seconds
> {code}
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)