zeddit commented on issue #132:
URL: https://github.com/apache/iceberg-python/issues/132#issuecomment-1802508681

   @Fokko  In conclusion, even though pyiceberg loads data in a deterministic 
way, which means the results is preserved between runs, the results is far from 
arranged, which means we cannot achieve a ordered results without changing the 
logic of pyiceberg on how to assemble the data blocks.
   
   However, it gives out a way to achieve this feature.
   
   The main problem is to deal with partitions, for which even we conduct a 
global sort, the final order in manifest list is still a random one. I will 
introduce what I found during my experiments below.
   
   But first, let me argue the importance of this feature, that is why we need 
a deterministic and in-order results for sorted tables.
   It's mainly about time series data analysis use case, e.g. quant finance, 
which is the one shown in the homepage of `pyiceberg`. In these use cases, the 
order of data, e.g. every stock order and trade has timestamps and always make 
up of a time series data array, and a lot of data science works are conducted 
upon them to search the trends and relativities in time domain.
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to