amogh-jahagirdar commented on code in PR #13061: URL: https://github.com/apache/iceberg/pull/13061#discussion_r2114565265
########## spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/ManifestFileBean.java: ########## @@ -46,6 +47,7 @@ public static ManifestFileBean fromManifest(ManifestFile manifest) { bean.setAddedSnapshotId(manifest.snapshotId()); bean.setContent(manifest.content().id()); bean.setSequenceNumber(manifest.sequenceNumber()); + bean.setFirstRowId(manifest.firstRowId()); Review Comment: Ok @RussellSpitzer , I looked into it a bit more here. The issue I pointed out earlier regarding https://github.com/apache/iceberg/blob/main/spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/BaseSparkAction.java#L156 , is simply that when reading the manifests table into a dataframe, we cannot deserialize the records from the manifests metadata table into `ManifestFileBean` structure since the expectation is that it does exist in the source records (even if it's null, but the column must exist). As a result, I think the main part to decide on is do we want to continue relying on this ManifestFileBean in the base spark procedure I linked or do we want to compose a slightly slimmed down structure, which can have a slightly more minimal field set with all the getters/setters. My take is that we should just continue re-using it, and it's fine if some fields don't have explicit getters since this means that it's not always required from the `Encoder` perspective. Additionally, we'll need to add first_row_id to the manifests metadata table anyways. At that point, we _could_ add back the getter for this because then we can project that field to read into this structure. But that still doesn't seem quite right because many of the procedures don't even need to project that field anyways. TLDR, I think there are 3 options (I'm preferring 1): 1. Leave things as is, specific fields like first_row_id won't have explicit getters because when deserializing into the bean via the `encoder` this creates the expectation that the record must actually have that field, which isn't always true. 2. Add a custom slimmed down version of manifestfilebean for the spark case or the inverse, introduce a bean which composes the existing bean plus a first_row_id for the distributed planning case. 3. Add back a getter when working on the manifest metadata table support for first_row_id and change the projection in the spark procedure. This seems like a backwards way of addressing the issue imo -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org