HonahX commented on code in PR #305: URL: https://github.com/apache/iceberg-python/pull/305#discussion_r1467303335
########## pyiceberg/io/pyarrow.py: ########## @@ -906,6 +986,76 @@ def after_map_value(self, element: pa.Field) -> None: self._field_names.pop() +class _ConvertToIcebergWithFreshIds(PreOrderPyArrowSchemaVisitor[Union[IcebergType, Schema]]): Review Comment: Thanks @syun64 for the summarization. I fully agree with your reasoning. I raised that question just to make sure we all agree that the minimum requirement for the new visitor here is to assign **arbitrarily unique IDs** for each field. Based on our discussions, there are the following options: 1. update `_ConvertToIceberg` to generate some arbitrarily unique IDs for each field in post-order 2. Add a `_ConvertToIcebergWithFreshIds` post-order visitor 3. Add a `_ConvertToIcebergWithFreshIds` pre-order visitor + `pre_order_visit_pyarrow` Option 1 has almost no code duplication and requires slight changes only. Option 2 requires some code duplication, which can be reduced by re-using `primitive()`(the current implementation) or reduced further by implementing some sort of inheritance as https://github.com/apache/iceberg-python/pull/219#discussion_r1456922888. Option 3 requires some code duplication, which can be reduced by re-using `primitive()` (the current implementation) Since the schema generated by option 3 will still have some difference in field ids assignment than that by `assign_fresh_schema_ids`, I am more inclined toward option 1 and 2. I think we might not need the `pre_order_visit_pyarrow` and `PreOrderPyArrowSchemaVisitor`. For option 1 and 2, I have no strong preference. If there is no further concern about mixing this special case for table creation with the visitor implementaion to read table, option 1 is optimal. If we still think it may be better to separate the logic from `_ConvertToIceberg`, considering this [concern](https://github.com/apache/iceberg-python/pull/219#issuecomment-1859273052), then option 2 can be a compromise choice. WDYT? > Is there a reason why we decided on this approach to assign_fresh_schema_ids, over using a simple pre-order traversal to assign IDs as we see the fields? Is this because we want to accommodate for potential duplicates? I also do not know the exact reason and context here. Similarly, Java side does something more than a simple pre-order traversal. Looks like the java implementation can be traced back to the very beginning of iceberg. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org