castedice opened a new issue, #344: URL: https://github.com/apache/iceberg-python/issues/344
### Feature Request / Improvement I manage binary data ranging in size from 500kb to 2MB and meta information about this data in one table. The number of rows is from hundreds of thousands to millions. Saving the data from the table as a parquet file via pyiceberg is no problem. However, when I fetch the data from the table via the `to_arrow` method, I get an OOM error while performing the `combine_chunks` method of pyarrow. This is because pyarrow's binary data type is 32-bit in size, which means it can't hold more than about 2GB of data (in my case, I get the error after about 4000 rows of data). This error is often reported when pyarrow is used in other libraries besides pyiceberg ([arrow](https://github.com/apache/arrow/issues/33049), [ray](https://github.com/ray-project/ray/issues/41411), [vaex](https://github.com/vaexio/vaex/issues/2335), ...). I don't think it's a strange usage to store and manage data like images, sounds, LLM tokens, etc. in binary. So why don't we change pyiceberg to import data as large_binary when importing data as pyarrow so that it can handle such large data? For now, I have solved the problem in my use case with a few modifications. When I tested it to contribute, I realized that there are a lot of places where pyarrow uses binary internally, so I think more fixes and modifications to the test code are needed. However, I'm starting to wonder if this is the direction the pyiceberg maintainers want to go. If this is not my particular situation and this is the direction that the maintainers agree with, I will submit a PR to address it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org