zhouxiz9 commented on PR #10979: URL: https://github.com/apache/pinot/pull/10979#issuecomment-1661155624
> @jtao15 @zhouxiz9 I synced up with @jtao15 yesterday. It looks that the issue that we need to address 2 things: > > 1. detect the failure issue earlier than 24 hours. > 2. optimize the runtime by only running the failed portion. > > It looks that this PR potentially improve 2; however, this may not address the issue that @zhouxiz9 and @jtao15 is currently facing. We will revisit the PR once the root cause of the ongoing is identified. Hi @snleee, I synced with @jtao15 today and understand that a more comprehensive design is needed to cover the cases such as late events and spill-over. The current fix is only solving part of the problem. I'll close this PR and create a new one once that design is ready. Since we are facing the minion delay issue (due to repetitively merging already merged segments) in production, a short term fix is needed. I've created another [PR](https://github.com/apache/pinot/pull/11243) to make `MaxAttemptsPerTask` configurable so that we can try to increase this value and to better handle the transient errors. Please let me know if that works. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org For additional commands, e-mail: commits-h...@pinot.apache.org