Re: [PR] create_table with PyArrow Schema [iceberg-python]

via GitHub Thu, 25 Jan 2024 23:36:01 -0800


HonahX commented on code in PR #305:
URL: https://github.com/apache/iceberg-python/pull/305#discussion_r1467303335



##########
pyiceberg/io/pyarrow.py:
##########
@@ -906,6 +986,76 @@ def after_map_value(self, element: pa.Field) -> None:
         self._field_names.pop()
 
 
+class 
_ConvertToIcebergWithFreshIds(PreOrderPyArrowSchemaVisitor[Union[IcebergType, 
Schema]]):

Review Comment:
   Thanks @syun64 for the summarization. I fully agree with your reasoning. I 
raised that question just to make sure we all agree that the minimum 
requirement for the new visitor here is to assign **arbitrarily unique IDs** 
for each field. Based on our discussions, there are the following options:
   
   1. update `_ConvertToIceberg` to generate some arbitrarily unique IDs for 
each field in post-order
   2. Add a `_ConvertToIcebergWithFreshIds` post-order visitor
   3. Add a `_ConvertToIcebergWithFreshIds` pre-order visitor + 
`pre_order_visit_pyarrow`
   
   Option 1 has almost no code duplication and requires slight changes only. 
   Option 2 requires some code duplication, which can be reduced by re-using 
`primitive()`(the current implementation) or reduced further by implementing 
some sort of inheritance as 
https://github.com/apache/iceberg-python/pull/219#discussion_r1456922888. 
   Option 3 requires some code duplication, which can be reduced by re-using 
`primitive()` (the current implementation)
   
   Since the schema generated by option 3 will still have some difference in 
field ids assignment than that by `assign_fresh_schema_ids`, I am more inclined 
toward option 1 and 2. I think we might not need the `pre_order_visit_pyarrow` 
and `PreOrderPyArrowSchemaVisitor`.
   
   For option 1 and 2, I have no strong preference. If there is no further 
concern about mixing this special case for table creation with the visitor 
implementaion to read table, option 1 is optimal. If we still think it may be 
better to separate the logic from `_ConvertToIceberg`, considering this 
[concern](https://github.com/apache/iceberg-python/pull/219#issuecomment-1859273052),
 then option 2 can be a compromise choice. WDYT?
   
   
   
   > Is there a reason why we decided on this approach to 
assign_fresh_schema_ids, over using a simple pre-order traversal to assign IDs 
as we see the fields? Is this because we want to accommodate for potential 
duplicates?
   
   I also do not know the exact reason and context here. Similarly, Java side 
does something more than a simple pre-order traversal. Looks like the java 
implementation can be traced back to the very beginning of iceberg. 
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] create_table with PyArrow Schema [iceberg-python]

Reply via email to