Selinix opened a new issue, #11618: URL: https://github.com/apache/iceberg/issues/11618
### Query engine Spark for loading, Trino for querying ### Question Hi, I’m looking for guidance on the most efficient solution for maintaining full history and querying the latest versions of events without maintaining redundant copies of the data. A use case is to be able to query either: 1. All versions of an event (e.g., `SELECT * FROM full_hist WHERE id = 'XXX'`) 2. Only the latest version of an event (e.g., `SELECT * FROM latest_slice WHERE id = 'XXX'`) The latest version is determined by the maximum value in a `version` field for each `id`. **Questions** 1. Is it better to maintain: - A single table with full history and periodically deduplicate it into a separate `latest_slice` table? - Or a single full history table with a view that computes the latest versions dynamically? 2. If the latter, does applying optimization techniques like partitioning, sorting, and ordering on the full history table significantly improve performance for querying the latest versions? 3. Given the preference to store only one copy of the data, what is the most performant and practical solution for this scenario? Thank you for your guidance! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org