Selinix opened a new issue, #11618:
URL: https://github.com/apache/iceberg/issues/11618

   ### Query engine
   
   Spark for loading, Trino for querying
   
   ### Question
   
   Hi,
   
   I’m looking for guidance on the most efficient solution for maintaining full 
history and querying the latest versions of events without maintaining 
redundant copies of the data.
   
   A use case is to be able to query either:
   1. All versions of an event (e.g., `SELECT * FROM full_hist WHERE id = 
'XXX'`)
   2. Only the latest version of an event (e.g., `SELECT * FROM latest_slice 
WHERE id = 'XXX'`)
   
   The latest version is determined by the maximum value in a `version` field 
for each `id`.
   
   **Questions**
   1. Is it better to maintain:
      - A single table with full history and periodically deduplicate it into a 
separate `latest_slice` table?
      - Or a single full history table with a view that computes the latest 
versions dynamically?
   2. If the latter, does applying optimization techniques like partitioning, 
sorting, and ordering on the full history table significantly improve 
performance for querying the latest versions?
   3. Given the preference to store only one copy of the data, what is the most 
performant and practical solution for this scenario?
   
   Thank you for your guidance!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to