Re: [I] to_pandas() API which converts iceberg table scan to a pd.DataFrame will lost datetime data type and row order [iceberg-python]

via GitHub Wed, 08 Nov 2023 20:08:06 -0800


zeddit commented on issue #132:
URL: https://github.com/apache/iceberg-python/issues/132#issuecomment-1803131783


   ### 2. adding sorted_by properties, or the term of sort_order in iceberg
   because we want a sorted iceberg table to store and return time series data.
   so we create a sorted_by table in Trino with
   ```
   CREATE TABLE test_table2(
       date date
   )
   WITH (
       format = 'PARQUET',
       location = 's3a://test/test_table2',
       sorted_by = ARRAY['date']
   );
   ```
   and we check the table properties in pyiceberg CLI to make sure sort_order 
properties is set
   ```
   $ pyiceberg describe test_table2
   Table format version  2
   Metadata location     
s3a://test/test_table2/metadata/00025-cd229550-87da-4b49-8013-28d862e8…
   Table UUID            e9c38f74-f556-43b2-a9f9-6facdd884723
   Last Updated          1699465355600
   Partition spec        []
   Sort order            [
                           1 ASC NULLS FIRST
                         ]
   Current schema        Schema, id=0
                         └── 1: date: optional date
   ```
   Then we insert the rows again to this new table. We still insert them one by 
one, with a same sequence like '2021-01-01', '2021-01-04', '2021-01-03', 
'2021-01-07', '2021-01-02', '2021-01-05', '2021-01-09', '2021-01-06', 
'2021-01-08'.
   
   Then we start to get data out of the new table. 
   In trino, when using select * from test_table1, there still is no consistent 
order, the result returned varies between runs.
   In pyiceberg, it's good to see there is a fixed order, however, the order is 
not the Ascending order that we stated and want for the table. It's because we 
insert rows one by one, and they are in different data-files underlying.
   
   When we compact the table into one data-file with Trino `alter table 
test_table2 execute optimize;`. 
   The result both in trino and pyiceberg is fixed and ordered. that is because 
there is only one data-file and the result order is the one that how data is 
written into that data-file. 
   When specifying `sorted_by` properties, a local sort will be performed to 
ensure that in each data-file, there will be in order.
   
   Then we try to append the same amount of data into the sorted table with
   ```
   insert into test_table2 select * from test_table1;
   ```
   we got the result below:
   <img width="123" alt="截屏2023-11-09 12 04 17" 
src="https://github.com/apache/iceberg-python/assets/30164206/b05a3e25-eacd-45a1-8d63-7442fb3fe12d";>
   It's not correct, what we expected is 
   <img width="126" alt="截屏2023-11-09 12 04 43" 
src="https://github.com/apache/iceberg-python/assets/30164206/976d1022-4b1e-488c-8866-4f465db40176";>
   it's because the second insert store rows in another data-file, which is 
just concat with the one before.
   
   when using `alter table test_table2 execute optimize;` to merge them, the 
result is as expected.
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [I] to_pandas() API which converts iceberg table scan to a pd.DataFrame will lost datetime data type and row order [iceberg-python]

Reply via email to