wypoon commented on PR #10935: URL: https://github.com/apache/iceberg/pull/10935#issuecomment-2294958003
@pvary thank you for your interest and for the Flink scenarios, which is very helpful as I am unfamiliar with Flink. Regarding https://github.com/apache/iceberg/pull/9888, please read my comments there. I put up https://github.com/apache/iceberg/pull/10954 only as a reference for @manuzhang so he can see the tests I added which fail with the implementation of `BaseIncrementalChangelogScan` in https://github.com/apache/iceberg/pull/9888. https://github.com/apache/iceberg/pull/10954 is not really for consideration or review. Next week I'll look into the scenarios you listed and see what gaps there are in my implementation and add tests as necessary. Regarding `DeletedRowsScanTask`, when @aokolnychyi added the API, he figured that it was necessary to know what deletes existed before the snapshot and what deletes are added in the snapshot. I'll have to think through the equality delete case, but for the positional delete case, I believe that it is not necessary to know the existing deletes (which is the reason for my `// not used` comment). For the positional delete case, I do not believe deleting the same position in the same data file again can happen, so added deletes should always be new positions. Thus, we only need to scan the data file and emit the rows that are deleted by the added deletes (which I do by using the _pos metadata column and the `PositionDeleteIndex` a `DeleteFilter` constructs for the data file). The _pos metadata column is automatically added to the schema if there are any positi on delete files to be applied to the data file, and for a `DeletedRowsScanTask`, the `DeleteFilter` is constructed with the added delete files (so those are the position delete files to be applied). I admit that I hadn't thought through equality deletes carefully. Just to clarify, in your scenario1, for example, does ED1 contain `PK=PK1`? In other words, `PK1` is the value of the primary key in the table, right? and by `V1` you simply mean the values of the other columns? And then ED2 again contains `PK=PK1`? so that you can update the other columns of the same row with `V2`, right? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org