On 12/12/2017 03:39 PM, Tollef Fog Heen wrote:
> DSA, thoughts on this?  Sounds reasonable?
> 
I think the issue is also made worse by mirror-bytemark being
consistently much slower than the other backends, and how ftpsync
behaves in a pathological way when mirrors have very different speeds.

When doing a staged push, we start stage1 for all downstreams.  Each of
those, when it's done, waits for up to $PUSHDELAY, by default 10
minutes, for its siblings to signal they're also done with stage1.  If
all goes well, everyone waits for everyone else to be done with stage1,
then within 5 seconds of each other they run stage2.

If on the other hand one mirror gets ahead (as often happens with
mirror-conova), or one mirror gets behind (as always happens with
mirror-bytemark), then the first mirror to finish stage1 waits 10
minutes, then removes its stage1 lock and starts stage2.  The other
mirrors are never going to find the stage1 lock from the fast mirror, so
they'll each wait the whole 10 minutes, even though they could have
started stage2 at the same time as the first because they finished
within the 10 minutes.

So we're in something of a worst case here where
- the mirror-conova push is local, so it's pretty fast
- mirror-skroutz and mirror-accumu are reasonably fast, but still take a
few minutes more than mirror-conova for stage1
- mirror-bytemark is way slow so consistently needs over 10 minutes more
than mirror-conova for stage1
- after 10 minutes, mirror-conova gives up on waiting for bytemark, and
starts stage2
- by that time mirror-accumu and mirror-skroutz are done with their
stage1, but they still each wait for 10 minutes before starting stage2,
so the delay between mirror-conova and them for finishing stage1 is
preserved for stage2, and they end up out of sync for a while for no
good reason
- at some point later, mirror-bytemark catches up

So I wonder if we should:
- kill mirror-bytemark until we fix the I/O issue that plagues us there
- increase PUSHDELAY
- change ftpsync so the first mirror that times out waiting for stage1
locks touches ${LOCKDIR}/all_stage1 so its siblings don't wait in vain
making the problem worse
- a combination of the above

Cheers,
Julien

Reply via email to