zeddit commented on issue #132: URL: https://github.com/apache/iceberg-python/issues/132#issuecomment-1802508681
@Fokko In conclusion, even though pyiceberg loads data in a deterministic way, which means the results is preserved between runs, the results is far from arranged, which means we cannot achieve a ordered results without changing the logic of pyiceberg on how to assemble the data blocks. However, it gives out a way to achieve this feature. The main problem is to deal with partitions, for which even we conduct a global sort, the final order in manifest list is still a random one. I will introduce what I found during my experiments below. But first, let me argue the importance of this feature, that is why we need a deterministic and in-order results for sorted tables. It's mainly about time series data analysis use case, e.g. quant finance, which is the one shown in the homepage of `pyiceberg`. In these use cases, the order of data, e.g. every stock order and trade has timestamps and always make up of a time series data array, and a lot of data science works are conducted upon them to search the trends and relativities in time domain. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org