yujun777 opened a new pull request, #26428: URL: https://github.com/apache/doris/pull/26428
Concurrent schema change and txn may cause dead lock. An example: 1. Txn T commit but not publish; 2. Run schema change or rollup on T's related partition, add alter replica R; 3. sc/rollup add a sched txn watermark M; 4. Restart fe; 5. After fe restart, T's loadedTblIndexes will clear loadedTblIndexes because it's not save to disk; 6. T will publish version to all tablet, including sc/rollup's new alter replica R; 7. Since R not contains txn data, so the T will fail. It will then always waitting for R's data; 8. sc/rollup wait for txn before M to finish, only after it will let R copy history data; 9. Since T's not finished, so sc/rollup will always wait, so R will nerver copy history data; 10. R and sc/rollup will wait each other forever, cause dead lock; Fix: For sc/rollup, it will ensure double write after the sched watermark M, so for finish transaction, when checking a alter replica: 1. if txn id is bigger than M, check it just like a normal replica; 2. otherwise skip check this replica, the BE will modify history data later. ## Proposed changes Issue Number: close #xxx <!--Describe your changes.--> ## Further comments If this is a relatively large or complex change, kick off the discussion at [d...@doris.apache.org](mailto:d...@doris.apache.org) by explaining why you chose the solution you did and what alternatives you considered, etc... -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org For additional commands, e-mail: commits-h...@doris.apache.org