GitHub user danny0405 edited a comment on the discussion: Support Concurrent 
Clustering of Small MOR File Groups with Upserts Using pendingReplacedFileIdMap

Thanks for the ideas.

For Flink scenario#2 and scenario#3, the clustering may fail continuously 
because of the overlapping of file group modifications, since the Flink 
ingestion job is long-running and the upserts are randomly across 
partitions/file groups. Same case for `DeltaStreamer` I think.

BTW, in consistent-hashing bucket index, the ingestion job will do a dual write 
for both replaced file group id and the new target one until the clustering 
finished(consistent hashing utilitizes clustering to merge/split the file 
groups). The dual write is there to ensure the visibility of the dataset for 
readers. I think there are many similarities between these two(consistent 
hashing and the solution proposed here). The consistent hashing ring plays 
similiar role with the `pendingReplacedFileIdMap` you mentioned here.

GitHub link: 
https://github.com/apache/hudi/discussions/18433#discussioncomment-16417743

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]

Reply via email to