zhouxiz9 commented on PR #11243: URL: https://github.com/apache/pinot/pull/11243#issuecomment-1663033673
> High level question: why do you need multiple attempts? Currently minion task is modeled as idempotent. Task failures should be retried in the next schedule. If this is for the ad-hoc task, we probably need to also change the failure threshold Currently we are facing the following problem: We have enabled rollup for a large table and we have 30 minions in our cluster. Each time the controller will schedule around 600 - 1000 tasks. It will take 10 - 16 hours to complete all the tasks. During this time period, if one or a few of the tasks fails, due to timeout or minion restart or transient controller error, it will result in not being able to bump up the watermark. So next time the controller will schedule the same tasks again with segments that are already merged. I have created a [PR](https://github.com/apache/pinot/pull/10979) trying to resolve this issue previously but it turns out a more comprehensive design is needed. So now I'm trying to increase the `MaxAttemptsPerTask` number and hope some retries will partially solve the problem. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org For additional commands, e-mail: commits-h...@pinot.apache.org