kevinjqliu commented on issue #1247: URL: https://github.com/apache/iceberg-python/issues/1247#issuecomment-2448076726
> From the [docs](https://iceberg.apache.org/docs/latest/spark-ddl/#alter-table-write-ordered-by) I'm understanding that sort order is purely suggestive, and it is up to the engines to decide if they will attempt to use the sort-order on write. I don't think sort order is suggestive. My understanding is I can declare a sort order that is maintained for the table, its a contract like partition or schema. Looking at https://iceberg.apache.org/spec/#sorting, """ Users can sort their data within partitions by columns to gain performance. The information on how the data is sorted can be declared per data or delete file, by a sort order. """ The "default sort order id" might be optional. """ A table could also be configured with a default sort order id, indicating how the new data should be sorted by default. Writers should use this default sort order to sort the data on write, but are not required to if the default order is prohibitively expensive, as it would be for streaming writes. """ In a scenario where I've declared a sort order for the table, and then write a dataframe that is not sorted according to the sort order, pyiceberg will still write the data as it is laid out in the dataframe. I think this deviates from the expected behavior. Heres an example, ``` import pyarrow as pa from pyiceberg.table.sorting import SortDirection, SortField, SortOrder from pyiceberg.table.update import AddSortOrderUpdate from pyiceberg.transforms import IdentityTransform from pyiceberg.catalog.sql import SqlCatalog # Example data int_data = [1, 2, 3, 4, 5] string_data = ["a", "b", "c", "d", "e"] # Create a PyArrow Table table = pa.table({"int_field": pa.array(int_data), "string_field": pa.array(string_data)}) # Display the table print(table.to_pandas()) print() warehouse_path = "/tmp/warehouse" catalog = SqlCatalog( "default", **{ "uri": f"sqlite:///{warehouse_path}/pyiceberg_catalog.db", "warehouse": f"file://{warehouse_path}", }, ) identifier = "test.write_with_sort_order" catalog.create_namespace_if_not_exists("test") try: catalog.drop_table(identifier) except: pass staged_table = catalog._create_staged_table(identifier, schema=table.schema) sort_order = SortOrder(*[SortField(source_id=1, transform=IdentityTransform(), direction=SortDirection.DESC)]) tbl = catalog.create_table(identifier, schema=staged_table.schema(), sort_order=sort_order) print(tbl) tbl.overwrite(table) print(tbl.scan().to_pandas()) ``` Output: ``` ➜ iceberg-python git:(kevinjqliu/deprecate-0.8.0) ✗ poetry run python write_sort_order.py int_field string_field 0 1 a 1 2 b 2 3 c 3 4 d 4 5 e write_with_sort_order( 1: int_field: optional long, 2: string_field: optional string ), partition by: [], sort order: [1 DESC NULLS LAST], snapshot: null /Users/kevinliu/repos/iceberg-python/pyiceberg/table/__init__.py:558: UserWarning: Delete operation did not match any records warnings.warn("Delete operation did not match any records") /Users/kevinliu/repos/iceberg-python/pyiceberg/utils/deprecated.py:51: DeprecationWarning: Deprecated in 0.8.0, will be removed in 0.9.0. Table.identifier property is deprecated. Please use Table.name() function instead. _deprecation_warning(message) /Users/kevinliu/repos/iceberg-python/pyiceberg/utils/deprecated.py:51: DeprecationWarning: Deprecated in 0.8.0, will be removed in 0.9.0. Support for parsing catalog level identifier in Catalog identifiers is deprecated. Please refer to the table using only its namespace and its table name. _deprecation_warning(message) /Users/kevinliu/repos/iceberg-python/pyiceberg/utils/deprecated.py:51: DeprecationWarning: Deprecated in 0.8.0, will be removed in 0.9.0. Table.identifier property is deprecated. Please use Table.name() function instead. _deprecation_warning(message) int_field string_field 0 1 a 1 2 b 2 3 c 3 4 d 4 5 e ``` In the example, I declared a `DESC` order order while all the data in the dataframe is in ASC order. Writing is allowed and the result is in the order of the dataframe, not the order of the table -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org