bigluck commented on issue #1491: URL: https://github.com/apache/iceberg-python/issues/1491#issuecomment-2598624396
@Fokko I'm getting older :D BTW I think there's an easy fix. This comment is pretty interesting: https://github.com/apache/arrow/issues/33049#issuecomment-1466401027 We tested it and it works. In our case we casted all the `string` types into `large_string` before passing the data to pyiceberg. Knowing pyiceberg is already casting (during a scan operation) all the string into large_string, I think it's ok if pyiceberg does the same thing before writing the data into the lake (before passing the data to arrow). It should be a noop (this code does not cover all the cases): ```python def upcast_to_large_types(old_schema: pa.Schema) -> pa.Schema: fields = [] for field in old_schema: if pt.is_string(field.type): fields.append(pa.field(field.name, pa.large_string())) elif pt.is_binary(field.type): fields.append(pa.field(field.name, pa.large_binary())) else: fields.append(field) return pa.schema(fields) ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org