[I] Cannot load a binary column of many rows via the `to_arrow` method. [iceberg-python]

via GitHub Thu, 01 Feb 2024 04:48:28 -0800


castedice opened a new issue, #344:
URL: https://github.com/apache/iceberg-python/issues/344


   ### Feature Request / Improvement
   
   I manage binary data ranging in size from 500kb to 2MB and meta information 
about this data in one table.
   The number of rows is from hundreds of thousands to millions.
   Saving the data from the table as a parquet file via pyiceberg is no problem.
   
   However, when I fetch the data from the table via the `to_arrow` method, I 
get an OOM error while performing the `combine_chunks` method of pyarrow.
   This is because pyarrow's binary data type is 32-bit in size, which means it 
can't hold more than about 2GB of data (in my case, I get the error after about 
4000 rows of data).
   This error is often reported when pyarrow is used in other libraries besides 
pyiceberg ([arrow](https://github.com/apache/arrow/issues/33049), 
[ray](https://github.com/ray-project/ray/issues/41411), 
[vaex](https://github.com/vaexio/vaex/issues/2335), ...).
   
   I don't think it's a strange usage to store and manage data like images, 
sounds, LLM tokens, etc. in binary.
   So why don't we change pyiceberg to import data as large_binary when 
importing data as pyarrow so that it can handle such large data?
   
   For now, I have solved the problem in my use case with a few modifications.
   When I tested it to contribute, I realized that there are a lot of places 
where pyarrow uses binary internally, so I think more fixes and modifications 
to the test code are needed.
   However, I'm starting to wonder if this is the direction the pyiceberg 
maintainers want to go.
   If this is not my particular situation and this is the direction that the 
maintainers agree with, I will submit a PR to address it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[I] Cannot load a binary column of many rows via the `to_arrow` method. [iceberg-python]

Reply via email to