rdblue opened a new pull request, #12672:
URL: https://github.com/apache/iceberg/pull/12672

   This adds support for first-row-id in manifests and manifest lists.
   
   Manifests are updated so that data files inherit/assign a `first-row-id` 
when the field is null, based on the record counts of previous data files. 
Currently, the `first-row-id` for a data file is based on the manifest's 
`first-row-id` and the number of records in files _without_ an assigned 
`first-row-id`. I think that this matches the expected behavior, which is based 
on the `record_count` of all `ADDED` data files (The spec states: "When 
reading, the first_row_id is assigned by replacing null with the manifest's 
first_row_id plus the sum of record_count for all added data files that 
preceded the file in the manifest.")
   
   Manifest lists are updated so that `first-row-id` for a data manifest is 
always written, either because the manifest has an already assigned 
`first-row-id` or by assigning a new one. The number of added records in a 
manifest assigned a new `first-row-id` is used to update a `next-row-id` that 
is either used for the next manifest or is used as `next-row-id` in table 
metadata. This strategy updates the `next-row-id` by the number of added 
records in all new data manifests. **This is not what the spec currently says 
so we may want to change this**. The spec states:
   
   > When adding a new data manifest file, its first_row_id field is assigned 
the value of the snapshot's first_row_id plus the sum of added_rows_count for 
all data manifests that preceded the manifest in the manifest list."
   
   I think this language would require allocating the number of rows in _added 
data files in the whole table_ for every commit, not just the added rows in the 
_new manifests_. What I've implemented is allocating the number of added rows 
in the manifests that are assigned a new `first-row-id`, which is the same as 
for each new manifest.
   
   This PR also includes changes from #12593 and will be rebased when it is 
merged.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to