syun64 commented on code in PR #305:
URL: https://github.com/apache/iceberg-python/pull/305#discussion_r1469710203
##########
pyiceberg/io/pyarrow.py:
##########
@@ -906,6 +986,76 @@ def after_map_value(self, element: pa.Field) -> None:
self._field_names.pop()
+class
_ConvertToIcebergWithFreshIds(PreOrderPyArrowSchemaVisitor[Union[IcebergType,
Schema]]):
Review Comment:
> I would love to get this in with the 0.6.0 release to simplify the
creation of tables.
Sounds good 👍 I'll try my best to bring this review to a consensus in the
next few days. We already have a working version, so we just need to agree on
the final details 😄
>That's an okay approach, as long as the visitor to do this is hidden inside
the package. We should not expose setting -1 field IDs to the outside.
>
> What I like about the current implementation is that the visitor can be
used on its own. Converting a field with all -1 IDs doesn't provide much value
on its own.
I'm 🆗 with both approach, but I have a preference for the '-1' approach...
and here's my current thought.
Both approaches will work, and in both approaches, we are introducing an
intermediate state of the Iceberg Schema that should never be exposed. But what
I like about this approach is that we are creating an intermediate Iceberg
Schema that is very visibly wrong, that we could use to check the validity of
the schema before we allow the erroneous intermediate state to materialize a
lasting result. An example could be to introduce a new sanity check before
writing the individual files, or before making a schema update, that checks
that no two IDs are the same, or that the IDs are greater than 0.
Hence, I think the fact that the intermediate Iceberg Schema is "seemingly"
correct actually feels more like a downside to me...
I'm curious to hear if that argument resonates with you or the rest of the
group. Hopefully it will be the last discussion item before we settle on the
final approach 🎉🎈
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]