[ 
https://issues.apache.org/jira/browse/ARROW-17961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17614189#comment-17614189
 ] 

Weston Pace commented on ARROW-17961:
-------------------------------------

I think David's right.  If you know you're going to read the entire parquet 
file then you can be more efficient about it.  If the file is only 10kb then 
for peak performance you should only issue one read request.

However, this will use much more RAM if you have large files (e.g. multiple 
GBs) and will have worse performance if you only want to read parts of those 
large files (e.g. column selection).

So I agree there is room for optimization.  It's just not going to be clear cut 
and simple.

> Add read/write optimization for pyarrow.fs.S3FileSystem
> -------------------------------------------------------
>
>                 Key: ARROW-17961
>                 URL: https://issues.apache.org/jira/browse/ARROW-17961
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Python
>            Reporter: Volker Lorrmann
>            Priority: Minor
>
> I found large differences in loading time, when loading data  from aws s3 
> using {{pyarrows.fs.S3FileSystem}} compared to {{s3fs.S3FileSystem}} See 
> example below.
> The difference comes from {{s3fs}} optimization, which {{pyarrow.fs}} is not 
> (yet) using.
> {code:python}
> import pyarrow.dataset as ds
> import pyarrow.parquet as pq
> import pyarrow.fs as pafs
> import s3fs
> import load_credentials
> credentials = load_credentials()
> path = "path/to/data" # folder with about 300 small (~10kb) files
> fs1 = s3fs.S3FileSystem(
>     anon=False,
>     key=credentials["accessKeyId"],
>     secret=credentials["secretAccessKey"],
>     token=credentials["sessionToken"],
> )
> fs2 = pafs.S3FileSystem(
>     access_key=credentials["accessKeyId"],
>     secret_key=credentials["secretAccessKey"],
>     session_token=credentials["sessionToken"],
>    
> )
> _ = ds.dataset(path, filesystem=fs1).to_table() # takes about 5 seconds
> _ = ds.dataset(path, filesystem=fs2).to_table() # takes about 25 seconds
> _ = pq.read_table(path, filesyste=fs1) # takes about 5 seconds
> _ = pq.read_table(path, filesytem=fs2) # takes about 10 seconds
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to