geruh commented on code in PR #602:
URL: https://github.com/apache/iceberg-python/pull/602#discussion_r1565238973
##########
pyiceberg/table/__init__.py:
##########
@@ -3410,6 +3410,32 @@ def _readable_metrics_struct(bound_type: PrimitiveType)
-> pa.StructType:
schema=entries_schema,
)
+ def refs(self) -> "pa.Table":
+ import pyarrow as pa
+
+ ref_schema = pa.schema([
+ pa.field('name', pa.string(), nullable=False),
+ pa.field('type', pa.string(), nullable=False),
Review Comment:
Hey Fokko, I looked into this for both pandas and pyarrow and it offers a
few benefits like reduced memory usage and improved performance on sorting and
filtering. Since we're only dealing with `branch` or `tag`, this is ideal.
However, this does stray from the Java implementation which uses a [string
type](https://github.com/apache/iceberg/blob/fb657b413e2bb7f6c5e2c78465173df0426d3527/core/src/main/java/org/apache/iceberg/RefsTable.java#L36).
But I think we should be fine with this difference, because in both pyarrow
and pandas, this type seems to be well supported and can be easily converted to
string if needed. Also, this would be beneficial for huge tables with many
references.
In PyArrow, this would be implemented as a dictionary type mapping integers
to strings. Also, this could be added to the other metadata tables such as the
operation field in the snapshots table.
##########
pyiceberg/table/__init__.py:
##########
@@ -3410,6 +3410,32 @@ def _readable_metrics_struct(bound_type: PrimitiveType)
-> pa.StructType:
schema=entries_schema,
)
+ def refs(self) -> "pa.Table":
+ import pyarrow as pa
+
+ ref_schema = pa.schema([
+ pa.field('name', pa.string(), nullable=False),
+ pa.field('type', pa.string(), nullable=False),
Review Comment:
Hey Fokko, I looked into this for both pandas and pyarrow and it offers a
few benefits like reduced memory usage and improved performance on sorting and
filtering. Since we're only dealing with `branch` or `tag`, this is ideal.
However, this does stray from the Java implementation which uses a [string
type](https://github.com/apache/iceberg/blob/fb657b413e2bb7f6c5e2c78465173df0426d3527/core/src/main/java/org/apache/iceberg/RefsTable.java#L36).
But I think we should be fine with this difference, because in both pyarrow
and pandas, this type seems to be well supported and can be easily converted to
string if needed. Also, this would be beneficial for huge tables with many
references.
In pyarrow, this would be implemented as a dictionary type mapping integers
to strings. Also, this could be added to the other metadata tables such as the
operation field in the snapshots table.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]