On 2/28/14, 5:24 PM, Hal Wine wrote:
tl;dr: what is the balance point between pushes to try taking too long
and loosing repository history of recent try pushes?

Summary:
--------

As most developers have experienced, pushing to try can sometimes take a
long time. Once it takes "too long" (as measured by screams of pain in
#releng) <https://etherpad.mozilla.org/ep/search?query=releng%29>, a
"try [repository] reset" is scheduled. This hurts productivity and
increases frustration for everyone involved (devs, IT, RelEng). We don't
want to do this anymore.

A reset of the try repository deletes the existing contents, and
replaces with a fresh clone from mozilla-central. While the tbpl
information will remain valid for any completed build, any attempt to
view the diffs for a try build will fail (unless you already had them in
your local repository).

Progress on resolution of the root cause:
-----------------------------------------

IT has made tremendous progress in reducing the occurrence of "long push
times", but they still are not predictable. Various attempts at
monitoring[1] and auto correction[2] have not been successful in
improving the situation. Work continues on additional changes that
should improve the situation[3].

The most recent mitigation strategy is to trade the "unknown timing"
disruption of the push times increasing to a pain threshold with a
"known timing" of reseting the try repository every TCW (tree closing
window - every 6 wks currently). However, we heard from some folks that
this is too often.

The most recent try-reset-triggered-by-pain was a duration of 6
months[4]. There was at least one report just 3 months after reset of
problems[5].

So, the question is - what say developers -- what's the balance point
between:
  - too often, making collaborating on try pushes hard
  - too infrequent, introducing increasing push times

I wouldn't have such a big issue with Try resets if we didn't lose information in the process. I believe every time there's been a Try reset, I've lost data from a recent (<1 week) Try push and I needed to re-run that job - incurring extra cost to Mozilla and wasting my time. I also periodically find myself wanting to answer questions like "what percentage of tree closures are due to pushes that didn't go to Try first." Data loss stinks.

I'd say the goal should be "no data loss." I have an idea that will enable us to achieve this.

Let's expose every newly-reset instance of the Try repo as a separate URL. We would still push to ssh://hg.mozilla.org/try, but the URLs printed and the URLs used by automation would be URLs to repos that would never go away. e.g. https://hg.mozilla.org/tries/try1/rev/840f122d1286 ("try1" being the important bit in there). When we reset Try, you'd hand out URLs to "try2." You could reset the writable Try repo as frequently as you desired and aside from a slightly different repo URL being given out, nobody should notice.

The main drawbacks of this approach that I can think of are all in automation: parts of automation are very repo/URL centric and having effectively dynamic URLs might break assumptions. But making automation work against arbitrary URLs is a good thing, as it allows automation to be more flexible and this allows people to experiment with alternate repo hosting, landing tools, landing-integrated code review tools, etc without requiring special involvement from RelEng. "Everything is a web service and is self-service," etc.
_______________________________________________
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform

Reply via email to