geruh commented on code in PR #602: URL: https://github.com/apache/iceberg-python/pull/602#discussion_r1565238973
########## pyiceberg/table/__init__.py: ########## @@ -3410,6 +3410,32 @@ def _readable_metrics_struct(bound_type: PrimitiveType) -> pa.StructType: schema=entries_schema, ) + def refs(self) -> "pa.Table": + import pyarrow as pa + + ref_schema = pa.schema([ + pa.field('name', pa.string(), nullable=False), + pa.field('type', pa.string(), nullable=False), Review Comment: Hey Fokko, I looked into this for both pandas and pyarrow and it offers a few benefits like reduced memory usage and improved performance on sorting and filtering. Since we're only dealing with `branch` or `tag`, this is ideal. However, this does stray from the Java implementation which uses a [string type](https://github.com/apache/iceberg/blob/fb657b413e2bb7f6c5e2c78465173df0426d3527/core/src/main/java/org/apache/iceberg/RefsTable.java#L36). But I think we should be fine with this difference, because in both pyarrow and pandas, this type seems to be well supported and can be easily converted to string if needed. Also, this would be beneficial for huge tables with many references. In PyArrow, this would be implemented as a dictionary type mapping integers to strings. Also, this could be added to the other metadata tables such as the operation field in the snapshots table. ########## pyiceberg/table/__init__.py: ########## @@ -3410,6 +3410,32 @@ def _readable_metrics_struct(bound_type: PrimitiveType) -> pa.StructType: schema=entries_schema, ) + def refs(self) -> "pa.Table": + import pyarrow as pa + + ref_schema = pa.schema([ + pa.field('name', pa.string(), nullable=False), + pa.field('type', pa.string(), nullable=False), Review Comment: Hey Fokko, I looked into this for both pandas and pyarrow and it offers a few benefits like reduced memory usage and improved performance on sorting and filtering. Since we're only dealing with `branch` or `tag`, this is ideal. However, this does stray from the Java implementation which uses a [string type](https://github.com/apache/iceberg/blob/fb657b413e2bb7f6c5e2c78465173df0426d3527/core/src/main/java/org/apache/iceberg/RefsTable.java#L36). But I think we should be fine with this difference, because in both pyarrow and pandas, this type seems to be well supported and can be easily converted to string if needed. Also, this would be beneficial for huge tables with many references. In pyarrow, this would be implemented as a dictionary type mapping integers to strings. Also, this could be added to the other metadata tables such as the operation field in the snapshots table. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org