syun64 commented on code in PR #305: URL: https://github.com/apache/iceberg-python/pull/305#discussion_r1469710203
########## pyiceberg/io/pyarrow.py: ########## @@ -906,6 +986,76 @@ def after_map_value(self, element: pa.Field) -> None: self._field_names.pop() +class _ConvertToIcebergWithFreshIds(PreOrderPyArrowSchemaVisitor[Union[IcebergType, Schema]]): Review Comment: > I would love to get this in with the 0.6.0 release to simplify the creation of tables. Sounds good 👍 I'll try my best to bring this review to a consensus in the next few days. We already have a working version, so we just need to agree on the final details 😄 >That's an okay approach, as long as the visitor to do this is hidden inside the package. We should not expose setting -1 field IDs to the outside. > > What I like about the current implementation is that the visitor can be used on its own. Converting a field with all -1 IDs doesn't provide much value on its own. I'm 🆗 with both approach, but I have a preference for the '-1' approach... and here's my current thought. Both approaches will work, and in both approaches, we are introducing an intermediate state of the Iceberg Schema that should never be exposed. But what I like about this approach is that we are creating an intermediate Iceberg Schema that is very visibly wrong, that we could use to check the validity of the schema before we allow the erroneous intermediate state to materialize a lasting result. An example could be to introduce a new sanity check before writing the individual files, or before making a schema update, that checks that no two IDs are the same, or that the IDs are greater than 0. Hence, I think the fact that the intermediate Iceberg Schema is "seemingly" correct actually feels more like a downside to me... I'm curious to hear if that argument resonates with you or the rest of the group. Hopefully it will be the last discussion item before we settle on the final approach 🎉🎈 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org