Re: [PR] create_table with PyArrow Schema [iceberg-python]

via GitHub Fri, 26 Jan 2024 06:57:08 -0800


syun64 commented on code in PR #305:
URL: https://github.com/apache/iceberg-python/pull/305#discussion_r1467757335



##########
pyiceberg/io/pyarrow.py:
##########
@@ -906,6 +986,76 @@ def after_map_value(self, element: pa.Field) -> None:
         self._field_names.pop()
 
 
+class 
_ConvertToIcebergWithFreshIds(PreOrderPyArrowSchemaVisitor[Union[IcebergType, 
Schema]]):

Review Comment:
   These are all good suggestions @HonahX . Thank you for taking the time to 
write out the detailed response 😄 .
   
   I'm not too keen on assigning 'random' IDs in post-order in 
_ConvertToIceberg or in _ConvertToIcebergWithFreshIds because that'll make it 
more difficult to write out the correct 'expected' outcome for the test case of 
the visitor when it is assigning fresh and unique IDs. Although it is 
deterministic, I think it's a bit difficult to leave logic that exists only for 
intermediary purpose, that is difficult to understand.
   
   > I also do not know the exact reason and context here. Similarly, Java side 
does something more than a simple pre-order traversal. Looks like the java 
implementation can be traced back to the very beginning of iceberg.
   
   If there's no additional context, I'm taking the liberty to entertain the 
idea of refactoring **_SetFreshIDs** visitor by analyzing its existing 
functionality, which is just to:
   1. assign IDs in pre-order
   2. track old field_id to reserved_id mapping just to ensure that old 
identifier_fields are mapped to the new ones at the end
   
   If functionality (2) is the only reason we have the requirement of the 
field_ids needing to be unique... how do you folks feel about the idea of 
refactoring **_SetFreshIDs** to remove this constraint by using a set to keep 
track of identifier_field_ids when we **_get_and_increment** (assign) new IDs, 
and instead assign all field_ids as '-1' using 
**_ConvertToIceberg**/**_ConvertToIcebergFreshIDs**? And then use the 
refactored **_SetFreshIDs** visitor to assign the IDs in order?
   
   ```
   class _SetFreshIDs(PreOrderSchemaVisitor[IcebergType]):
       """Traverses the schema and assigns monotonically increasing ids."""
   
       identifier_field_ids: Set[int]
   
       def __init__(self, next_id_func: Optional[Callable[[], int]] = None) -> 
None:
           self.identifier_field_ids= set()
           counter = itertools.count(1)
           self.next_id_func = next_id_func if next_id_func is not None else 
lambda: next(counter)
   
       def _get_and_increment(self, current_id: int) -> int:
           new_id = self.next_id_func()
   
           if current_id in identifier_field_ids:
               identifier_field_ids.pop(current_id)
               identifier_field_ids.add(new_id)
           return new_id
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] create_table with PyArrow Schema [iceberg-python]

Reply via email to