Re: [PR] Spark: Fix row lineage inheritance for distributed planning [iceberg]

via GitHub Thu, 29 May 2025 13:40:37 -0700


amogh-jahagirdar commented on code in PR #13061:
URL: https://github.com/apache/iceberg/pull/13061#discussion_r2114565265



##########
spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/ManifestFileBean.java:
##########
@@ -46,6 +47,7 @@ public static ManifestFileBean fromManifest(ManifestFile 
manifest) {
     bean.setAddedSnapshotId(manifest.snapshotId());
     bean.setContent(manifest.content().id());
     bean.setSequenceNumber(manifest.sequenceNumber());
+    bean.setFirstRowId(manifest.firstRowId());

Review Comment:
   Ok @RussellSpitzer , I looked into it a bit more here. The issue I pointed 
out earlier regarding 
https://github.com/apache/iceberg/blob/main/spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/BaseSparkAction.java#L156
 , is simply that when reading the manifests table into a dataframe, we cannot 
deserialize the records from the manifests metadata table into 
`ManifestFileBean` structure since the expectation is that it does exist in the 
source records (even if it's null, but the column must exist). 
   
   As a result, I think the main part to decide on is do we want to continue 
relying on this ManifestFileBean in the base spark procedure I linked or do we 
want to compose a slightly slimmed down structure, which can have a slightly 
more minimal field set with all the getters/setters. 
   
   My take is that we should just continue re-using it, and it's fine if some 
fields don't have explicit getters since this means that it's not always 
required from the `Encoder` perspective. 
   
   Additionally, we'll need to add first_row_id to the manifests metadata table 
anyways. At that point, we _could_ add back the getter for this because then we 
can project that field to read into this structure. But that still doesn't seem 
quite right because many of the procedures don't even need to project that 
field anyways.
   
   TLDR, I think there are 3 options (I'm preferring 1):
   
   1. Leave things as is, specific fields like first_row_id won't have explicit 
getters because when deserializing into the bean via the `encoder` this creates 
the expectation that the record must actually have that field, which isn't 
always true. 
   
   2. Add a custom slimmed down version of manifestfilebean for the spark case 
or the inverse, introduce a bean which composes the existing bean plus a 
first_row_id for the distributed planning case.
   
   3. Add back a getter when working on the manifest metadata table support for 
first_row_id and change the projection in the spark procedure. This seems like 
a backwards way of addressing the issue imo



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Spark: Fix row lineage inheritance for distributed planning [iceberg]

Reply via email to