andrewthad opened a new issue, #47416:
URL: https://github.com/apache/arrow/issues/47416
### Describe the bug, including details regarding any error messages,
version, and platform.
I believe that projection pushdown does not work when using an S3 dataset
consisting of arrow files. Here's what I am doing:
>>> fs = pyarrow.fs.S3FileSystem(access_key='myaccesskey',
secret_key='mysecretkey', region='us-east-2')
>>> mydataset =
ds.dataset('mybucket/fortigate_utm_20250821T172409.arrow',format="arrow",filesystem=fs)
This arrow file is about 2.4GB. There are 512K records per batch, and there
are a total of 26 million records. This means that there are about 50 record
batches. There are 10 columns. One a the columns is a "original log" text
value, and the others are smaller information, mostly integers, parsed from
that original log. All buffers are LZ4 compressed.
Depending on how this dataset is consumed, the amount of data loaded from S3
does not match what a user would expect. Here are a few behaviors that I have
observed:
* `mydataset.to_table(columns=['timestamp']).shape`: loads 115MB, this seems
right
* `mydataset.to_table(columns=[]).shape`: Python REPL exits with "malloc():
unsorted double linked list corrupted, Aborted (core dumped)"
* `mydataset.head(1, columns=['timestamp']).shape`: loads 44MB, not great
but not terrible
* `mydataset.scanner(columns=['timestamp']).count_rows()`: loads 2.3GB,
surprising
* `mydataset.count_rows()`: also loads 2.3GB
Here's my interpretation of what's happening. To be clear, this is just a
problem when loading arrow files. I don't know if pyarrow has any of these
surprising behaviors when loading.
For the most part, `to_table` is the only function that loads the data that
someone would expect it to. The 26M microsecond-precision timestamps (8 bytes
each) would be 208MB uncompressed, and lz4 is mildly effective on this kind of
data, so 115MB is very believable. If we are only taking the count of the rows,
it should be possible to project the empty list of columns, but a mistake in
pyarrow (or probably in the internal C++ library) prevents this from working.
The `head` function is halfway right. It only loads the first record batch,
but I think it's loading the buffers for all of the columns instead of just the
buffers for the timestamp column. If this only loaded the buffer that it was
supposed to, I would expect to see 2MB or 3MB of inbound network traffic, not
44MB. This probably does matter a whole lot because `head` is probably not
likely to be used a whole lot in practice. It's uncommon to only be interested
in some arbitrary subset of a dataframe. But still, it would be nice if it
didn't load a bunch of buffers that it didn't need.
The behavior of the scanner (and of `count_rows()`, which I think just uses
a scanner under the hood) is very surprising. It unconditionally loads all of
the data from S3. (To the credit of pyarrow, it does at least stream this data
rather than blowing up with an OOM exception.) This makes scanners
unnecessarily slow. It also makes scanners financially expensive since S3 bills
per byte accessed.
For what I'm working on, this doesn't matter a whole lot. My organization is
experimenting with an arrow data lake at S3 for ad hoc queries on archived
security-appliance logs. We have enough expertise internally to work around the
scanner problems with `to_table` and gluing results together.
Please let me know if there are any other workarounds (some kind of argument
to pass to `scanner()` to make it do the right thing) or also if there is a
better way to get the row count without having to load an arbitrary column of
data. Thanks.
### Component(s)
Python
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]