amogh-jahagirdar commented on PR #12592: URL: https://github.com/apache/iceberg/pull/12592#issuecomment-2741759994
Next steps: 1. I'll be primarily be looking at the Spark plan side of things. So this means handling the rest of the cases (CoW/pure appends) and to begin with I'll focus on tests which don't rely on core inheritance changes existing. These tests would explicitly write out Parquet files with different null/explicit values, and we'd test out merging against that and having expectations on which rowIds/sequenceNumbers in the output file are null or not. 2. We'll separate the core/reader changes from this PR. cc @rdblue This includes the following: a. Writing out first row ID for every manifest in the manifest list (in SnapshotProducer) b. Inheriting the first row ID for manifests, and passing the appropriate firstRowId for data files through to data files during planning of tasks (this also needs to just work with distributed planning) c. Plumbing the lastUpdatedSequenceNumber and first row ID for a data file down to the file reader (can be pushed through via `idToConstant`) and having the reader surface the right row_Id/last_updated_sequence_number. The reader needs to either surface the value in the file if it's present, or if it's null, compute the row ID. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org