zeddit commented on issue #132: URL: https://github.com/apache/iceberg-python/issues/132#issuecomment-1798481255
great thanks for your help. I am sorry for not knowing too much about how to conduct a global sort, could you please give me some of documents about how to doing so by using trino or pyiceberg. I will try it and give you feedback. I have done some research about the sorting & order, and in slack they told me iceberg do not guarantee the order even when using `sorted_by` to create the table schema. sorting is just a method for grouping rows with closer statistics thus it could benefit query because it could skip more data-files and reduce IO. I am not sure if that opinion is right, and in my opinion, it's reasonable because iceberg comes from big data where distributed computing is used, and workers work together to aggregate and reduce a final result, thus a strict ranking in data system is of no use because the data is split and we don't know which worker would finish first. However, I think things go different for the situation of `pyiceberg`, it's a single point to get data out, with limited cpu and memory, thus it needs to get the data out as the same view with the data stored in the database, e.g. rows are in order. I am not sure if pyiceberg will keep the order when I use global sort, I will test it. great thanks. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org