Re: [I] Block writing to sorted tables [iceberg-python]

via GitHub Wed, 30 Oct 2024 11:47:23 -0700


kevinjqliu commented on issue #1247:
URL: 
https://github.com/apache/iceberg-python/issues/1247#issuecomment-2448076726


   > From the 
[docs](https://iceberg.apache.org/docs/latest/spark-ddl/#alter-table-write-ordered-by)
 I'm understanding that sort order is purely suggestive, and it is up to the 
engines to decide if they will attempt to use the sort-order on write.
   
   I don't think sort order is suggestive. My understanding is I can declare a 
sort order that is maintained for the table, its a contract like partition or 
schema.
   Looking at https://iceberg.apache.org/spec/#sorting, 
   """
   Users can sort their data within partitions by columns to gain performance. 
The information on how the data is sorted can be declared per data or delete 
file, by a sort order.
   """
   
   The "default sort order id" might be optional.
   """
    A table could also be configured with a default sort order id, indicating 
how the new data should be sorted by default. Writers should use this default 
sort order to sort the data on write, but are not required to if the default 
order is prohibitively expensive, as it would be for streaming writes.
   """
   
   In a scenario where I've declared a sort order for the table, and then write 
a dataframe that is not sorted according to the sort order, pyiceberg will 
still write the data as it is laid out in the dataframe. I think this deviates 
from the expected behavior. 
   
   Heres an example,
   ```
   import pyarrow as pa
   
   from pyiceberg.table.sorting import SortDirection, SortField, SortOrder
   from pyiceberg.table.update import AddSortOrderUpdate
   from pyiceberg.transforms import IdentityTransform
   from pyiceberg.catalog.sql import SqlCatalog
   
   
   # Example data
   int_data = [1, 2, 3, 4, 5]
   string_data = ["a", "b", "c", "d", "e"]
   
   # Create a PyArrow Table
   table = pa.table({"int_field": pa.array(int_data), "string_field": 
pa.array(string_data)})
   
   # Display the table
   print(table.to_pandas())
   print()
   
   warehouse_path = "/tmp/warehouse"
   catalog = SqlCatalog(
       "default",
       **{
           "uri": f"sqlite:///{warehouse_path}/pyiceberg_catalog.db",
           "warehouse": f"file://{warehouse_path}",
       },
   )
   
   identifier = "test.write_with_sort_order"
   catalog.create_namespace_if_not_exists("test")
   try:
       catalog.drop_table(identifier)
   except:
       pass
   
   staged_table = catalog._create_staged_table(identifier, schema=table.schema)
   
   sort_order = SortOrder(*[SortField(source_id=1, 
transform=IdentityTransform(), direction=SortDirection.DESC)])
   tbl = catalog.create_table(identifier, schema=staged_table.schema(), 
sort_order=sort_order)
   print(tbl)
   tbl.overwrite(table)
   print(tbl.scan().to_pandas())
   ```
   
   Output:
   ```
   ➜  iceberg-python git:(kevinjqliu/deprecate-0.8.0) ✗ poetry run python 
write_sort_order.py
      int_field string_field
   0          1            a
   1          2            b
   2          3            c
   3          4            d
   4          5            e
   
   write_with_sort_order(
     1: int_field: optional long,
     2: string_field: optional string
   ),
   partition by: [],
   sort order: [1 DESC NULLS LAST],
   snapshot: null
   /Users/kevinliu/repos/iceberg-python/pyiceberg/table/__init__.py:558: 
UserWarning: Delete operation did not match any records
     warnings.warn("Delete operation did not match any records")
   /Users/kevinliu/repos/iceberg-python/pyiceberg/utils/deprecated.py:51: 
DeprecationWarning: Deprecated in 0.8.0, will be removed in 0.9.0. 
Table.identifier property is deprecated. Please use Table.name() function 
instead.
     _deprecation_warning(message)
   /Users/kevinliu/repos/iceberg-python/pyiceberg/utils/deprecated.py:51: 
DeprecationWarning: Deprecated in 0.8.0, will be removed in 0.9.0. Support for 
parsing catalog level identifier in Catalog identifiers is deprecated. Please 
refer to the table using only its namespace and its table name.
     _deprecation_warning(message)
   /Users/kevinliu/repos/iceberg-python/pyiceberg/utils/deprecated.py:51: 
DeprecationWarning: Deprecated in 0.8.0, will be removed in 0.9.0. 
Table.identifier property is deprecated. Please use Table.name() function 
instead.
     _deprecation_warning(message)
      int_field string_field
   0          1            a
   1          2            b
   2          3            c
   3          4            d
   4          5            e
   ```
   
   In the example, I declared a `DESC` order order while all the data in the 
dataframe is in ASC order. Writing is allowed and the result is in the order of 
the dataframe, not the order of the table


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [I] Block writing to sorted tables [iceberg-python]

Reply via email to