rdblue opened a new pull request, #12672: URL: https://github.com/apache/iceberg/pull/12672
This adds support for first-row-id in manifests and manifest lists. Manifests are updated so that data files inherit/assign a `first-row-id` when the field is null, based on the record counts of previous data files. Currently, the `first-row-id` for a data file is based on the manifest's `first-row-id` and the number of records in files _without_ an assigned `first-row-id`. I think that this matches the expected behavior, which is based on the `record_count` of all `ADDED` data files (The spec states: "When reading, the first_row_id is assigned by replacing null with the manifest's first_row_id plus the sum of record_count for all added data files that preceded the file in the manifest.") Manifest lists are updated so that `first-row-id` for a data manifest is always written, either because the manifest has an already assigned `first-row-id` or by assigning a new one. The number of added records in a manifest assigned a new `first-row-id` is used to update a `next-row-id` that is either used for the next manifest or is used as `next-row-id` in table metadata. This strategy updates the `next-row-id` by the number of added records in all new data manifests. **This is not what the spec currently says so we may want to change this**. The spec states: > When adding a new data manifest file, its first_row_id field is assigned the value of the snapshot's first_row_id plus the sum of added_rows_count for all data manifests that preceded the manifest in the manifest list." I think this language would require allocating the number of rows in _added data files in the whole table_ for every commit, not just the added rows in the _new manifests_. What I've implemented is allocating the number of added rows in the manifests that are assigned a new `first-row-id`, which is the same as for each new manifest. This PR also includes changes from #12593 and will be rebased when it is merged. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org