Re: [PR] Prototyping Spark 3.4 row lineage [iceberg]

via GitHub Thu, 20 Mar 2025 14:58:19 -0700


amogh-jahagirdar commented on PR #12592:
URL: https://github.com/apache/iceberg/pull/12592#issuecomment-2741759994


   Next steps:
   
   1. I'll be primarily be looking at the Spark plan side of things. So this 
means handling the rest of the cases (CoW/pure appends) and to begin with I'll 
focus on tests which don't rely on core inheritance changes existing. These 
tests would explicitly write out Parquet files with different null/explicit 
values, and we'd test out merging against that and having expectations on which 
rowIds/sequenceNumbers in the output file are null or not.
   
   2. We'll separate the core/reader changes from this PR. cc @rdblue  This 
includes the following:
      a. Writing out first row ID for every manifest in the manifest list (in 
SnapshotProducer)
      b. Inheriting the first row ID for manifests, and passing the appropriate 
firstRowId for data files through to data files during planning of tasks (this 
also needs to just work with distributed planning)
     c. Plumbing the lastUpdatedSequenceNumber and first row ID for a data file 
down to the file reader (can be pushed through via `idToConstant`) and having 
the reader surface the right row_Id/last_updated_sequence_number. The reader 
needs to either surface the value in the file if it's present, or if it's null, 
compute the row ID.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Prototyping Spark 3.4 row lineage [iceberg]

Reply via email to