Re: [PR] Spec: Clarify time travel implementation in Iceberg [iceberg]

via GitHub Mon, 22 Jul 2024 16:31:35 -0700


rdblue commented on code in PR #8982:
URL: https://github.com/apache/iceberg/pull/8982#discussion_r1687239911



##########
format/spec.md:
##########
@@ -1370,3 +1370,16 @@ Writing v2 metadata:
     * `sort_columns` was removed
 
 Note that these requirements apply when writing data to a v2 table. Tables 
that are upgraded from v1 may contain metadata that does not follow these 
requirements. Implementations should remain backward-compatible with v1 
metadata requirements.
+
+## Appendix F: Implementation Notes
+
+This section covers topics not required by the specification but 
recommendations for systems implementing the Iceberg specification
+to help maintain a uniform experience.
+
+### Point in Time Reads (Time Travel)
+
+Iceberg supports two types of histories for tables. A history of previous 
"current snapshots" stored in ["snapshot-log" table 
metadata](#table-metadata-fields) and [parent-child lineage stored in 
"snapshots"](#table-metadata-fields). These two histories 
+might indicate different snapshot IDs for a specific timestamp. The 
discrepancies can be caused by a variety of table operations (e.g. updating the 
`current-snapshot-id` of the table).
+
+When processing point in time queries the Iceberg community has chosen to use 
"snapshot-log" metadata to lookup the table state

Review Comment:
   This spec is independent of the REST catalog protocol. The protocol covers 
how to exchange table information covered by this spec, but this spec covers 
how to track that information and, in this case, recommendations for how that 
information is used. In the context of your suggestion, this assumes that "when 
the catalog makes the snapshot history available in the metadata JSON" is all 
the time. It is always true because this spec defines the metadata JSON.
   
   I think it is valuable to say that engines are encouraged to use the 
information from `snapshot-log` for time travel by timestamp so that the 
results match what a query would have seen at that time. We made that choice 
for Spark because we think that is what users expect.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Spec: Clarify time travel implementation in Iceberg [iceberg]

Reply via email to