[GitHub] [iceberg] Fokko commented on issue #6956: Spark: Data file rewriting spark job fails with oom

via GitHub Mon, 27 Feb 2023 23:34:02 -0800


Fokko commented on issue #6956:
URL: https://github.com/apache/iceberg/issues/6956#issuecomment-1447709562


   Ah, I see, using merge on read using Flink makes sense.
   
   > And I have a question: with merge on read mode, in the worst case, does an 
executor have to read all delete records (in my case maybe all the rows before 
the whole table delete)?
   
   There is some logic involved to optimize this, but equality deletes aren't 
the best choice when it comes to performance. Because at some point Flink will 
write a delete (`id=5`), and you have to apply this to the subsequent data 
files, which is quite costly as you might imagine. Of course, this is limited 
to the partitions that you're reading and will prune the deletes of the 
partitions that are outside of the scope of the query.
   
   What also would work is to compact the table using a Spark job periodically 
(ideally the partitions that aren't being written to anymore). So you'll get 
rid of the deletes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] Fokko commented on issue #6956: Spark: Data file rewriting spark job fails with oom

Reply via email to