[
https://issues.apache.org/jira/browse/ARROW-17961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17614189#comment-17614189
]
Weston Pace commented on ARROW-17961:
-------------------------------------
I think David's right. If you know you're going to read the entire parquet
file then you can be more efficient about it. If the file is only 10kb then
for peak performance you should only issue one read request.
However, this will use much more RAM if you have large files (e.g. multiple
GBs) and will have worse performance if you only want to read parts of those
large files (e.g. column selection).
So I agree there is room for optimization. It's just not going to be clear cut
and simple.
> Add read/write optimization for pyarrow.fs.S3FileSystem
> -------------------------------------------------------
>
> Key: ARROW-17961
> URL: https://issues.apache.org/jira/browse/ARROW-17961
> Project: Apache Arrow
> Issue Type: Improvement
> Components: Python
> Reporter: Volker Lorrmann
> Priority: Minor
>
> I found large differences in loading time, when loading data from aws s3
> using {{pyarrows.fs.S3FileSystem}} compared to {{s3fs.S3FileSystem}} See
> example below.
> The difference comes from {{s3fs}} optimization, which {{pyarrow.fs}} is not
> (yet) using.
> {code:python}
> import pyarrow.dataset as ds
> import pyarrow.parquet as pq
> import pyarrow.fs as pafs
> import s3fs
> import load_credentials
> credentials = load_credentials()
> path = "path/to/data" # folder with about 300 small (~10kb) files
> fs1 = s3fs.S3FileSystem(
> anon=False,
> key=credentials["accessKeyId"],
> secret=credentials["secretAccessKey"],
> token=credentials["sessionToken"],
> )
> fs2 = pafs.S3FileSystem(
> access_key=credentials["accessKeyId"],
> secret_key=credentials["secretAccessKey"],
> session_token=credentials["sessionToken"],
>
> )
> _ = ds.dataset(path, filesystem=fs1).to_table() # takes about 5 seconds
> _ = ds.dataset(path, filesystem=fs2).to_table() # takes about 25 seconds
> _ = pq.read_table(path, filesyste=fs1) # takes about 5 seconds
> _ = pq.read_table(path, filesytem=fs2) # takes about 10 seconds
> {code}
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)