zhouxiz9 commented on PR #11243:
URL: https://github.com/apache/pinot/pull/11243#issuecomment-1663033673

   > High level question: why do you need multiple attempts? Currently minion 
task is modeled as idempotent. Task failures should be retried in the next 
schedule. If this is for the ad-hoc task, we probably need to also change the 
failure threshold
   
   Currently we are facing the following problem:
   We have enabled rollup for a large table and we have 30 minions in our 
cluster. Each time the controller will schedule around 600 - 1000 tasks. It 
will take 10 - 16 hours to complete all the tasks. During this time period, if 
one or a few of the tasks fails, due to timeout or minion restart or transient 
controller error, it will result in not being able to bump up the watermark. So 
next time the controller will schedule the same tasks again with segments that 
are already merged. 
   
   I have created a [PR](https://github.com/apache/pinot/pull/10979) trying to 
resolve this issue previously but it turns out a more comprehensive design is 
needed. So now I'm trying to increase the `MaxAttemptsPerTask` number and hope 
some retries will partially solve the problem.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org
For additional commands, e-mail: commits-h...@pinot.apache.org

Reply via email to