GitHub user cshuo added a comment to the discussion: Support Concurrent Clustering of Small MOR File Groups with Upserts Using pendingReplacedFileIdMap
Thanks for writing this up. I have a few concerns with respect to the proposal: * One concern on the writer routing side: today the writer may choose F1 based on the small-file profile policy. With this proposal, if we directly redirect that write to F3, is it possible that F3 would not actually qualify as a valid small-file target under the existing writer logic? In other words, this seems to bypass the current file-group selection policy and force writes into a pending replacement file group that may not have been chosen otherwise. Is F3 intended to be treated as a special routing target outside the normal small-file policy? * Is the redirection intended only for inserts, or also for updates? Redirecting updates seems much trickier because record location/indexing would still point to F1/F2 until the replace commit completes? * For flink writer, the committing and table service scheduling are asynchronous on coordinator(like spark driver) receiving checkpoint success event, the data ingestion flow are continuous without blocking, so it seems scenario 2/3 will probably happens frequently and there is a risk the clustering keeps retrying and never make progress. GitHub link: https://github.com/apache/hudi/discussions/18433#discussioncomment-16423230 ---- This is an automatically sent email for [email protected]. To unsubscribe, please send an email to: [email protected]
