GitHub user kbuci added a comment to the discussion: Support Concurrent Clustering of Small MOR File Groups with Upserts Using pendingReplacedFileIdMap
Thanks for the discussion (since I was also curious how we could prevent clustering <-> upsert conflicts for frequent writes to MOR). Just to clarify: - Unlike the current clustering plan which just stores a set of all replaces files per partition, here the structure `pendingReplacedFileIdMap` needs to store a mapping of `[input file groups] -> output file` right? So this isn't intended for all clustering strategies (like sorting all records within a partition) right - just types of clustering where we can explicitly "know" that multiple specific input files map to one output file (like the recent clustering strategy that uses parquet APIs to combine multiple small files into one)? - Once a clustering plan is scheduled, it can now never be rolled back right? GitHub link: https://github.com/apache/hudi/discussions/18433#discussioncomment-16405112 ---- This is an automatically sent email for [email protected]. To unsubscribe, please send an email to: [email protected]
