]] Santiago Vila 

> While building a lot of packages today, I found that this is happening again:
> 
> 503  Quorum weight not reached [IP: 151.101.112.204 80]

Hi,

yeah, we saw an outage on the 9th, need to investigate why it hit us.  I
suspect it was due to (all dates the the 9th, all times are UTC):

21:08:21 mirror push on mirror-conova is done
21:08:22 mirror-skroutz declares itself unhealthy due to being out of date
21:08:58 mirror-accumu declares itself unhealthy due to being out of date
21:09:10 mirror-bytemark declares itself unhealthy due to being out of date
21:09:56 mirror-conova declares itself unhealthy due to scheduled shutdown
21:16:44 mirror push on mirror-skroutz is done
21:17:22 mirror-skroutz declares itself healthy
21:17:37 mirror push on mirror-accumu is done
21:17:58 mirror-accumu declares itself healthy
21:19:32 mirror-skroutz hits a bug in mirror-health and declares itself
         unhealthy due to connect timeouts to mirror-conova
21:20:07 mirror-accumu hits a bug in mirror-health and declares itself
         unhealthy due to connect timeouts to mirror-conova
21:26:10 mirror-accumu comes back and is healthy
21:26:11 mirror-skroutz comes back and is healthy
21:26:53 mirror-conova comes back and is healthy

Dec 10:
03:45:34 mirror-bytemark push is done
03:45:47 mirror-bytemark comes back and is healthy

So what happened was essentially that we did a reboot *just* after a
push had finished, leading to a mirror declaring itself unhealthy.

I've fixed the bug that caused the second outage from ~2120 until ~2126
now, so we should not run into that one again at least.

I wonder if the right fix is to ignore timestamps from any unhealthy
hosts, and to reset latest_ts on each iteration through check_uptodate.
If we initialise latest_ts to 0, that will mean we do consider ourselves
healthy in the case of nobody else being healthy.  If somebody else is
newer than us and healthy, we consider ourselves unhealthy.

DSA, thoughts on this?  Sounds reasonable?

Cheers,
-- 
Tollef Fog Heen
UNIX is user friendly, it's just picky about who its friends are

Reply via email to