Re: [I] to_pandas() API which converts iceberg table scan to a pd.DataFrame will lost datetime data type and row order [iceberg-python]

via GitHub Tue, 07 Nov 2023 05:16:04 -0800


zeddit commented on issue #132:
URL: https://github.com/apache/iceberg-python/issues/132#issuecomment-1798481255


   great thanks for your help.
   
   I am sorry for not knowing too much about how to conduct a global sort, 
could you please give me some of documents about how to doing so by using trino 
or pyiceberg. I will try it and give you feedback.
   
   I have done some research about the sorting & order, and in slack they told 
me iceberg do not guarantee the order even when using `sorted_by` to create the 
table schema. sorting is just a method for grouping rows with closer statistics 
thus it could benefit query because it could skip more data-files and reduce IO.
   
   I am not sure if that opinion is right, and in my opinion, it's reasonable 
because iceberg comes from big data where distributed computing is used, and 
workers work together to aggregate and reduce a final result,  thus a strict 
ranking in data system is of no use because the data is split and we don't know 
which worker would finish first.
   
   However, I think things go different for the situation of `pyiceberg`, it's 
a single point to get data out, with limited cpu and memory, thus it needs to 
get the data out as the same view with the data stored in the database, e.g. 
rows are in order.
   
   I am not sure if pyiceberg will keep the order when I use global sort, I 
will test it.
   great thanks.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [I] to_pandas() API which converts iceberg table scan to a pd.DataFrame will lost datetime data type and row order [iceberg-python]

Reply via email to