syun64 commented on code in PR #305:
URL: https://github.com/apache/iceberg-python/pull/305#discussion_r1467757335
##########
pyiceberg/io/pyarrow.py:
##########
@@ -906,6 +986,76 @@ def after_map_value(self, element: pa.Field) -> None:
self._field_names.pop()
+class
_ConvertToIcebergWithFreshIds(PreOrderPyArrowSchemaVisitor[Union[IcebergType,
Schema]]):
Review Comment:
These are all good suggestions @HonahX . Thank you for taking the time to
write out the detailed response 😄 .
I'm not too keen on assigning 'random' IDs in post-order in
_ConvertToIceberg or in _ConvertToIcebergWithFreshIds because that'll make it
more difficult to write out the correct 'expected' outcome for the test case of
the visitor when it is assigning fresh and unique IDs. Although it is
deterministic, I think it's a bit difficult to leave logic that exists only for
intermediary purpose, that is difficult to understand.
> I also do not know the exact reason and context here. Similarly, Java side
does something more than a simple pre-order traversal. Looks like the java
implementation can be traced back to the very beginning of iceberg.
If there's no additional context, I'm taking the liberty to entertain the
idea of refactoring **_SetFreshIDs** visitor by analyzing its existing
functionality, which is just to:
1. assign IDs in pre-order
2. track old field_id to reserved_id mapping just to ensure that old
identifier_fields are mapped to the new ones at the end
If functionality (2) is the only reason we have the requirement of the
field_ids needing to be unique... how do you folks feel about the idea of
refactoring **_SetFreshIDs** to remove this constraint by using a set to keep
track of identifier_field_ids when we **_get_and_increment** (assign) new IDs,
and instead assign all field_ids as '-1' using
**_ConvertToIceberg**/**_ConvertToIcebergFreshIDs**? And then use the
refactored **_SetFreshIDs** visitor to assign the IDs in order?
```
class _SetFreshIDs(PreOrderSchemaVisitor[IcebergType]):
"""Traverses the schema and assigns monotonically increasing ids."""
identifier_field_ids: Set[int]
def __init__(self, next_id_func: Optional[Callable[[], int]] = None) ->
None:
self.identifier_field_ids= set()
counter = itertools.count(1)
self.next_id_func = next_id_func if next_id_func is not None else
lambda: next(counter)
def _get_and_increment(self, current_id: int) -> int:
new_id = self.next_id_func()
if current_id in identifier_field_ids:
identifier_field_ids.pop(current_id)
identifier_field_ids.add(new_id)
return new_id
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]