Re: [I] Upcasting and Downcasting inconsistencies with PyArrow Schema [iceberg-python]

via GitHub Mon, 10 Jun 2024 14:13:45 -0700


Fokko commented on issue #791:
URL: https://github.com/apache/iceberg-python/issues/791#issuecomment-2159294140


   I agree that you cannot write a single field of 2GB+ to a parquet file. In 
that case, Parquet is probably not the best way of storing such a big blob.
   The difference between how the offsets are stored. With the large binary, 
the offsets are 64 longs, and with the binary, they are 32 bits. When we create 
an array in Arrow: `[foo, bar, arrow]`, then this is stored as:
   
   ```python
   data = 'foobararrow'
   offsets = [0, 3, 6, 11]
   ```
   
   If the offsets are 32 bits, then you need to chunk them into smaller 
buffers, which negatively impacts performance.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [I] Upcasting and Downcasting inconsistencies with PyArrow Schema [iceberg-python]

Reply via email to