Re: [PR] Spec: Clarify identity partition edge cases. [iceberg]

via GitHub Thu, 01 Aug 2024 12:02:43 -0700


rdblue commented on code in PR #10835:
URL: https://github.com/apache/iceberg/pull/10835#discussion_r1700684796



##########
format/spec.md:
##########
@@ -1393,4 +1398,8 @@ This section covers topics not required by the 
specification but recommendations
 Iceberg supports two types of histories for tables. A history of previous 
"current snapshots" stored in ["snapshot-log" table 
metadata](#table-metadata-fields) and [parent-child lineage stored in 
"snapshots"](#table-metadata-fields). These two histories 
 might indicate different snapshot IDs for a specific timestamp. The 
discrepancies can be caused by a variety of table operations (e.g. updating the 
`current-snapshot-id` can be used to set the snapshot of a table to any 
arbitrary snapshot, which might have a lineage derived from a table branch or 
no lineage at all).
 
-When processing point in time queries implementations should use 
"snapshot-log" metadata to lookup the table state at the given point in time. 
This ensures time-travel queries reflect the state of the table at the provided 
timestamp. For example a SQL query like `SELECT * FROM prod.db.table TIMESTAMP 
AS OF '1986-10-26 01:21:00Z';` would find the snapshot of the Iceberg table 
just prior to '1986-10-26 01:21:00 UTC' in the snapshot logs and use the 
metadata from that snapshot to perform the scan of the table. If no  snapshot 
exists prior to the timestamp given or "snapshot-log" is not populated (it is 
an optional field), then systems should raise an informative error message 
about the missing metadata.
\ No newline at end of file
+When processing point in time queries implementations should use 
"snapshot-log" metadata to lookup the table state at the given point in time. 
This ensures time-travel queries reflect the state of the table at the provided 
timestamp. For example a SQL query like `SELECT * FROM prod.db.table TIMESTAMP 
AS OF '1986-10-26 01:21:00Z';` would find the snapshot of the Iceberg table 
just prior to '1986-10-26 01:21:00 UTC' in the snapshot logs and use the 
metadata from that snapshot to perform the scan of the table. If no  snapshot 
exists prior to the timestamp given or "snapshot-log" is not populated (it is 
an optional field), then systems should raise an informative error message 
about the missing metadata.
+
+### Writing data files
+
+All columns should be written to data files even if they introduce redundancy 
with metadata stored in manifest file (e.g. columns with identity partition 
transforms). Writing all columns provides redundancy in case of corruption or 
bugs in the metadata layer.

Review Comment:
   I think that this should be in the main spec. Writers are required to write 
all columns, but readers must handle cases where columns are missing. Those are 
real requirements, not conventions.



##########
format/spec.md:
##########
@@ -591,11 +597,10 @@ For example, an `events` table with a timestamp column 
named `ts` that is partit
 
 Scan predicates are also used to filter data and delete files using column 
bounds and counts that are stored by field id in manifests. The same filter 
logic can be used for both data and delete files because both store metrics of 
the rows either inserted or deleted. If metrics show that a delete file has no 
rows that match a scan predicate, it may be ignored just as a data file would 
be ignored [2].
 
-Data files that match the query filter must be read by the scan. 
+Data files that match the query filter must be read by the scan.

Review Comment:
   Nit: unnecessary whitespace change.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Spec: Clarify identity partition edge cases. [iceberg]

Reply via email to