syun64 commented on code in PR #305: URL: https://github.com/apache/iceberg-python/pull/305#discussion_r1467757335
########## pyiceberg/io/pyarrow.py: ########## @@ -906,6 +986,76 @@ def after_map_value(self, element: pa.Field) -> None: self._field_names.pop() +class _ConvertToIcebergWithFreshIds(PreOrderPyArrowSchemaVisitor[Union[IcebergType, Schema]]): Review Comment: These are all good suggestions @HonahX . Thank you for taking the time to write out the detailed response 😄 . I'm not too keen on assigning 'random' IDs in post-order in _ConvertToIceberg or in _ConvertToIcebergWithFreshIds because that'll make it more difficult to write out the correct 'expected' outcome for the test case of the visitor when it is assigning fresh and unique IDs. Although it is deterministic, I think it's a bit difficult to leave logic that exists only for intermediary purpose, that is difficult to understand. > I also do not know the exact reason and context here. Similarly, Java side does something more than a simple pre-order traversal. Looks like the java implementation can be traced back to the very beginning of iceberg. If there's no additional context, I'm taking the liberty to entertain the idea of refactoring **_SetFreshIDs** visitor by analyzing its existing functionality, which is just to: 1. assign IDs in pre-order 2. track old field_id to reserved_id mapping just to ensure that old identifier_fields are mapped to the new ones at the end If functionality (2) is the only reason we have the requirement of the field_ids needing to be unique... how do you folks feel about the idea of refactoring **_SetFreshIDs** to remove this constraint by using a set to keep track of identifier_field_ids when we **_get_and_increment** (assign) new IDs, and instead assign all field_ids as '-1' using **_ConvertToIceberg**/**_ConvertToIcebergFreshIDs**? And then use the refactored **_SetFreshIDs** visitor to assign the IDs in order? ``` class _SetFreshIDs(PreOrderSchemaVisitor[IcebergType]): """Traverses the schema and assigns monotonically increasing ids.""" identifier_field_ids: Set[int] def __init__(self, next_id_func: Optional[Callable[[], int]] = None) -> None: self.identifier_field_ids= set() counter = itertools.count(1) self.next_id_func = next_id_func if next_id_func is not None else lambda: next(counter) def _get_and_increment(self, current_id: int) -> int: new_id = self.next_id_func() if current_id in identifier_field_ids: identifier_field_ids.pop(current_id) identifier_field_ids.add(new_id) return new_id ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org