Re: [I] [Bug] Error in overwrite(): pyarrow.lib.ArrowInvalid: offset overflow with large dataset (~3M rows) [iceberg-python]

via GitHub Fri, 17 Jan 2025 07:34:58 -0800


bigluck commented on issue #1491:
URL: 
https://github.com/apache/iceberg-python/issues/1491#issuecomment-2598624396


   @Fokko I'm getting older :D 
   BTW I think there's an easy fix. This comment is pretty interesting:
   
   https://github.com/apache/arrow/issues/33049#issuecomment-1466401027
   
   We tested it and it works. In our case we casted all the `string` types into 
`large_string` before passing the data to pyiceberg.
   
   Knowing pyiceberg is already casting (during a scan operation) all the 
string into large_string, I think it's ok if pyiceberg does the same thing 
before writing the data into the lake (before passing the data to arrow).
   
   It should be a noop (this code does not cover all the cases):
   
   ```python
       def upcast_to_large_types(old_schema: pa.Schema) -> pa.Schema:
           fields = []
           for field in old_schema:
               if pt.is_string(field.type):
                   fields.append(pa.field(field.name, pa.large_string()))
               elif pt.is_binary(field.type):
                   fields.append(pa.field(field.name, pa.large_binary()))
               else:
                   fields.append(field)
   
           return pa.schema(fields)
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [I] [Bug] Error in overwrite(): pyarrow.lib.ArrowInvalid: offset overflow with large dataset (~3M rows) [iceberg-python]

Reply via email to