Re: [PR] Support changelog scan for table with delete files [iceberg]

via GitHub Sat, 17 Aug 2024 12:46:45 -0700


wypoon commented on PR #10935:
URL: https://github.com/apache/iceberg/pull/10935#issuecomment-2294958003


   @pvary thank you for your interest and for the Flink scenarios, which is 
very helpful as I am unfamiliar with Flink.
   
   Regarding https://github.com/apache/iceberg/pull/9888, please read my 
comments there. I put up https://github.com/apache/iceberg/pull/10954 only as a 
reference for @manuzhang so he can see the tests I added which fail with the 
implementation of `BaseIncrementalChangelogScan` in 
https://github.com/apache/iceberg/pull/9888. 
https://github.com/apache/iceberg/pull/10954 is not really for consideration or 
review.
   
   Next week I'll look into the scenarios you listed and see what gaps there 
are in my implementation and add tests as necessary. Regarding 
`DeletedRowsScanTask`, when @aokolnychyi added the API, he figured that it was 
necessary to know what deletes existed before the snapshot and what deletes are 
added in the snapshot. I'll have to think through the equality delete case, but 
for the positional delete case, I believe that it is not necessary to know the 
existing deletes (which is the reason for my `// not used` comment). For the 
positional delete case, I do not believe deleting the same position in the same 
data file again can happen, so added deletes should always be new positions. 
Thus, we only need to scan the data file and emit the rows that are deleted by 
the added deletes (which I do by using the _pos metadata column and the 
`PositionDeleteIndex` a `DeleteFilter` constructs for the data file). The _pos 
metadata column is automatically added to the schema if there are any positi
 on delete files to be applied to the data file, and for a 
`DeletedRowsScanTask`, the `DeleteFilter` is constructed with the added delete 
files (so those are the position delete files to be applied).
   I admit that I hadn't thought through equality deletes carefully. Just to 
clarify, in your scenario1, for example, does ED1 contain `PK=PK1`? In other 
words, `PK1` is the value of the primary key in the table, right? and by `V1` 
you simply mean the values of the other columns? And then ED2 again contains 
`PK=PK1`?  so that you can update the other columns of the same row with `V2`, 
right?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Support changelog scan for table with delete files [iceberg]

Reply via email to