andrewthad opened a new issue, #47416:
URL: https://github.com/apache/arrow/issues/47416

   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   I believe that projection pushdown does not work when using an S3 dataset 
consisting of arrow files. Here's what I am doing:
   
       >>> fs = pyarrow.fs.S3FileSystem(access_key='myaccesskey', 
secret_key='mysecretkey', region='us-east-2')
       >>> mydataset = 
ds.dataset('mybucket/fortigate_utm_20250821T172409.arrow',format="arrow",filesystem=fs)
   
   This arrow file is about 2.4GB. There are 512K records per batch, and there 
are a total of 26 million records. This means that there are about 50 record 
batches. There are 10 columns. One a the columns is a "original log" text 
value, and the others are smaller information, mostly integers, parsed from 
that original log. All buffers are LZ4 compressed.
   
   Depending on how this dataset is consumed, the amount of data loaded from S3 
does not match what a user would expect. Here are a few behaviors that I have 
observed:
   
   * `mydataset.to_table(columns=['timestamp']).shape`: loads 115MB, this seems 
right
   * `mydataset.to_table(columns=[]).shape`: Python REPL exits with "malloc(): 
unsorted double linked list corrupted, Aborted (core dumped)"
   * `mydataset.head(1, columns=['timestamp']).shape`: loads 44MB, not great 
but not terrible
   * `mydataset.scanner(columns=['timestamp']).count_rows()`: loads 2.3GB, 
surprising
   * `mydataset.count_rows()`: also loads 2.3GB
   
   Here's my interpretation of what's happening. To be clear, this is just a 
problem when loading arrow files. I don't know if pyarrow has any of these 
surprising behaviors when loading.
   
   For the most part, `to_table` is the only function that loads the data that 
someone would expect it to. The 26M microsecond-precision timestamps (8 bytes 
each) would be 208MB uncompressed, and lz4 is mildly effective on this kind of 
data, so 115MB is very believable. If we are only taking the count of the rows, 
it should be possible to project the empty list of columns, but a mistake in 
pyarrow (or probably in the internal C++ library) prevents this from working.
   
   The `head` function is halfway right. It only loads the first record batch, 
but I think it's loading the buffers for all of the columns instead of just the 
buffers for the timestamp column. If this only loaded the buffer that it was 
supposed to, I would expect to see 2MB or 3MB of inbound network traffic, not 
44MB. This probably does matter a whole lot because `head` is probably not 
likely to be used a whole lot in practice. It's uncommon to only be interested 
in some arbitrary subset of a dataframe. But still, it would be nice if it 
didn't load a bunch of buffers that it didn't need.
   
   The behavior of the scanner (and of `count_rows()`, which I think just uses 
a scanner under the hood) is very surprising. It unconditionally loads all of 
the data from S3. (To the credit of pyarrow, it does at least stream this data 
rather than blowing up with an OOM exception.) This makes scanners 
unnecessarily slow. It also makes scanners financially expensive since S3 bills 
per byte accessed.
   
   For what I'm working on, this doesn't matter a whole lot. My organization is 
experimenting with an arrow data lake at S3 for ad hoc queries on archived 
security-appliance logs. We have enough expertise internally to work around the 
scanner problems with `to_table` and gluing results together.
   
   Please let me know if there are any other workarounds (some kind of argument 
to pass to `scanner()` to make it do the right thing) or also if there is a 
better way to get the row count without having to load an arbitrary column of 
data. Thanks.
   
   ### Component(s)
   
   Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to