zeddit commented on issue #132: URL: https://github.com/apache/iceberg-python/issues/132#issuecomment-1803131783
### 2. adding sorted_by properties, or the term of sort_order in iceberg because we want a sorted iceberg table to store and return time series data. so we create a sorted_by table in Trino with ``` CREATE TABLE test_table2( date date ) WITH ( format = 'PARQUET', location = 's3a://test/test_table2', sorted_by = ARRAY['date'] ); ``` and we check the table properties in pyiceberg CLI to make sure sort_order properties is set ``` $ pyiceberg describe test_table2 Table format version 2 Metadata location s3a://test/test_table2/metadata/00025-cd229550-87da-4b49-8013-28d862e8… Table UUID e9c38f74-f556-43b2-a9f9-6facdd884723 Last Updated 1699465355600 Partition spec [] Sort order [ 1 ASC NULLS FIRST ] Current schema Schema, id=0 └── 1: date: optional date ``` Then we insert the rows again to this new table. We still insert them one by one, with a same sequence like '2021-01-01', '2021-01-04', '2021-01-03', '2021-01-07', '2021-01-02', '2021-01-05', '2021-01-09', '2021-01-06', '2021-01-08'. Then we start to get data out of the new table. In trino, when using select * from test_table1, there still is no consistent order, the result returned varies between runs. In pyiceberg, it's good to see there is a fixed order, however, the order is not the Ascending order that we stated and want for the table. It's because we insert rows one by one, and they are in different data-files underlying. When we compact the table into one data-file with Trino `alter table test_table2 execute optimize;`. The result both in trino and pyiceberg is fixed and ordered. that is because there is only one data-file and the result order is the one that how data is written into that data-file. When specifying `sorted_by` properties, a local sort will be performed to ensure that in each data-file, there will be in order. Then we try to append the same amount of data into the sorted table with ``` insert into test_table2 select * from test_table1; ``` we got the result below: <img width="123" alt="截屏2023-11-09 12 04 17" src="https://github.com/apache/iceberg-python/assets/30164206/b05a3e25-eacd-45a1-8d63-7442fb3fe12d"> It's not correct, what we expected is <img width="126" alt="截屏2023-11-09 12 04 43" src="https://github.com/apache/iceberg-python/assets/30164206/976d1022-4b1e-488c-8866-4f465db40176"> it's because the second insert store rows in another data-file, which is just concat with the one before. when using `alter table test_table2 execute optimize;` to merge them, the result is as expected. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org