On 2/28/14, 5:24 PM, Hal Wine wrote:
tl;dr: what is the balance point between pushes to try taking too long
and loosing repository history of recent try pushes?
Summary:
--------
As most developers have experienced, pushing to try can sometimes take a
long time. Once it takes "too long" (as measured by screams of pain in
#releng) <https://etherpad.mozilla.org/ep/search?query=releng%29>, a
"try [repository] reset" is scheduled. This hurts productivity and
increases frustration for everyone involved (devs, IT, RelEng). We don't
want to do this anymore.
A reset of the try repository deletes the existing contents, and
replaces with a fresh clone from mozilla-central. While the tbpl
information will remain valid for any completed build, any attempt to
view the diffs for a try build will fail (unless you already had them in
your local repository).
Progress on resolution of the root cause:
-----------------------------------------
IT has made tremendous progress in reducing the occurrence of "long push
times", but they still are not predictable. Various attempts at
monitoring[1] and auto correction[2] have not been successful in
improving the situation. Work continues on additional changes that
should improve the situation[3].
The most recent mitigation strategy is to trade the "unknown timing"
disruption of the push times increasing to a pain threshold with a
"known timing" of reseting the try repository every TCW (tree closing
window - every 6 wks currently). However, we heard from some folks that
this is too often.
The most recent try-reset-triggered-by-pain was a duration of 6
months[4]. There was at least one report just 3 months after reset of
problems[5].
So, the question is - what say developers -- what's the balance point
between:
- too often, making collaborating on try pushes hard
- too infrequent, introducing increasing push times
I wouldn't have such a big issue with Try resets if we didn't lose
information in the process. I believe every time there's been a Try
reset, I've lost data from a recent (<1 week) Try push and I needed to
re-run that job - incurring extra cost to Mozilla and wasting my time. I
also periodically find myself wanting to answer questions like "what
percentage of tree closures are due to pushes that didn't go to Try
first." Data loss stinks.
I'd say the goal should be "no data loss." I have an idea that will
enable us to achieve this.
Let's expose every newly-reset instance of the Try repo as a separate
URL. We would still push to ssh://hg.mozilla.org/try, but the URLs
printed and the URLs used by automation would be URLs to repos that
would never go away. e.g.
https://hg.mozilla.org/tries/try1/rev/840f122d1286 ("try1" being the
important bit in there). When we reset Try, you'd hand out URLs to
"try2." You could reset the writable Try repo as frequently as you
desired and aside from a slightly different repo URL being given out,
nobody should notice.
The main drawbacks of this approach that I can think of are all in
automation: parts of automation are very repo/URL centric and having
effectively dynamic URLs might break assumptions. But making automation
work against arbitrary URLs is a good thing, as it allows automation to
be more flexible and this allows people to experiment with alternate
repo hosting, landing tools, landing-integrated code review tools, etc
without requiring special involvement from RelEng. "Everything is a web
service and is self-service," etc.
_______________________________________________
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform