On 12/12/2017 03:39 PM, Tollef Fog Heen wrote: > DSA, thoughts on this? Sounds reasonable? > I think the issue is also made worse by mirror-bytemark being consistently much slower than the other backends, and how ftpsync behaves in a pathological way when mirrors have very different speeds.
When doing a staged push, we start stage1 for all downstreams. Each of those, when it's done, waits for up to $PUSHDELAY, by default 10 minutes, for its siblings to signal they're also done with stage1. If all goes well, everyone waits for everyone else to be done with stage1, then within 5 seconds of each other they run stage2. If on the other hand one mirror gets ahead (as often happens with mirror-conova), or one mirror gets behind (as always happens with mirror-bytemark), then the first mirror to finish stage1 waits 10 minutes, then removes its stage1 lock and starts stage2. The other mirrors are never going to find the stage1 lock from the fast mirror, so they'll each wait the whole 10 minutes, even though they could have started stage2 at the same time as the first because they finished within the 10 minutes. So we're in something of a worst case here where - the mirror-conova push is local, so it's pretty fast - mirror-skroutz and mirror-accumu are reasonably fast, but still take a few minutes more than mirror-conova for stage1 - mirror-bytemark is way slow so consistently needs over 10 minutes more than mirror-conova for stage1 - after 10 minutes, mirror-conova gives up on waiting for bytemark, and starts stage2 - by that time mirror-accumu and mirror-skroutz are done with their stage1, but they still each wait for 10 minutes before starting stage2, so the delay between mirror-conova and them for finishing stage1 is preserved for stage2, and they end up out of sync for a while for no good reason - at some point later, mirror-bytemark catches up So I wonder if we should: - kill mirror-bytemark until we fix the I/O issue that plagues us there - increase PUSHDELAY - change ftpsync so the first mirror that times out waiting for stage1 locks touches ${LOCKDIR}/all_stage1 so its siblings don't wait in vain making the problem worse - a combination of the above Cheers, Julien