Re: [D] Support Concurrent Clustering of Small MOR File Groups with Upserts Using pendingReplacedFileIdMap [hudi]

via GitHub Wed, 01 Apr 2026 00:42:40 -0700


GitHub user kbuci added a comment to the discussion: Support Concurrent 
Clustering of Small MOR File Groups with Upserts Using pendingReplacedFileIdMap


Thanks for the discussion (since I was also curious how we could prevent 
clustering <-> upsert conflicts for frequent writes to MOR).
Just to clarify:
- Unlike the current clustering plan which just stores a set of all replaces 
files per partition, here  the structure `pendingReplacedFileIdMap` needs to 
store a mapping of `[input file groups] -> output file` right? So this isn't 
intended for all clustering strategies (like sorting all records within a 
partition) right - just types of clustering where we can explicitly "know" that 
multiple specific input files map to one output file (like the recent 
clustering strategy that uses parquet APIs to combine multiple small files into 
one)?
- Once a clustering plan is scheduled, it can now never be rolled back right?

GitHub link: 
https://github.com/apache/hudi/discussions/18433#discussioncomment-16405112

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]

Re: [D] Support Concurrent Clustering of Small MOR File Groups with Upserts Using pendingReplacedFileIdMap [hudi]

Reply via email to