GitHub user cshuo added a comment to the discussion: Support Concurrent 
Clustering of Small MOR File Groups with Upserts Using pendingReplacedFileIdMap

Thanks for writing this up. I have a few concerns with respect to the proposal:
* One concern on the writer routing side: today the writer may choose F1 based 
on the small-file profile policy. With this proposal, if we directly redirect 
that write to F3, is it possible that F3 would not actually qualify as a valid 
small-file target under the existing writer logic? In other words, this seems 
to bypass the current file-group selection policy and force writes into a 
pending replacement file group that may not have been chosen otherwise. Is F3 
intended to be treated as a special routing target outside the normal 
small-file policy?
* Is the redirection intended only for inserts, or also for updates? 
Redirecting updates seems much trickier because record location/indexing would 
still point to F1/F2 until the replace commit completes?
* For flink writer, the committing and table service scheduling are 
asynchronous on coordinator(like spark driver) receiving checkpoint success 
event, the data ingestion flow are continuous without blocking, so it seems 
scenario 2/3 will probably happens frequently and there is a risk the 
clustering keeps retrying and never make progress.



GitHub link: 
https://github.com/apache/hudi/discussions/18433#discussioncomment-16423230

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]

Reply via email to