GitHub user danny0405 edited a comment on the discussion: Support Concurrent Clustering of Small MOR File Groups with Upserts Using pendingReplacedFileIdMap
Thanks for the ideas. For Flink scenario#2 and scenario#3, the clustering may fail continuously because of the overlapping of file group modifications, since the Flink ingestion job is long running and the upserts are randomly across partitions/file groups. Same case for `DeltaStreamer` I think. BTW, in consistent hashing bucket index, the ingestion job will do a dual write for both replaced file group id and the new target one until the clustering finished(consistent hashing utilitizes clustering to merge/split the file groups). The dual write is to ensure the visibility of the dataset for readers. I think there are many similarities between these two use cases(consistent hashing and the solution proposed here). GitHub link: https://github.com/apache/hudi/discussions/18433#discussioncomment-16417743 ---- This is an automatically sent email for [email protected]. To unsubscribe, please send an email to: [email protected]
