Fokko commented on issue #791: URL: https://github.com/apache/iceberg-python/issues/791#issuecomment-2159294140
I agree that you cannot write a single field of 2GB+ to a parquet file. In that case, Parquet is probably not the best way of storing such a big blob. The difference between how the offsets are stored. With the large binary, the offsets are 64 longs, and with the binary, they are 32 bits. When we create an array in Arrow: `[foo, bar, arrow]`, then this is stored as: ```python data = 'foobararrow' offsets = [0, 3, 6, 11] ``` If the offsets are 32 bits, then you need to chunk them into smaller buffers, which negatively impacts performance. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org