I think the crux of this is to reduce the time m-i is closed is to have
better monitoring on things like the memory during testing so we can see
if it is growing during tests or change the tests that we don't care
about that assert or hide the test on TBPL so the sheriffs ignore it.
The best answer is for the first thing and add more monitoring so we
catch this sooner.
The one thing I think we should note is we shouldnt push the sheriffs to
re-open the tree when its in a bad state. The sheriffs never take
closing the tree lightly but if something needs to be landed urgently
you can always add checkin-needed keyword to the bug and the sheriffs
will land it for you ASAP.
David
On 20/11/2013 16:20, Robert Kaiser wrote:
Nicholas Nethercote schrieb:
It also assumes that we can backout stuff to fix
the problem; we tried that to some extent with the first OOM closure
-- it is the standard response to test failure, of course -- but it
didn't work.
Yes, in the case of those OOM issues that caused this closure, they
are probably just a symptom of a larger problem.
We've been having a step-by-step rise of OOM issues over quite some
time now, most intensely seen as an increase of crashes with empty
dumps. I alerted to that in bug 837835 but we couldn't track down a
decent regression range (we mostly know in which 6-week cycle we had
regressions, we can do some assumptions to narrow things a bit further
down on trunk, but not nearly well enough to get to candidate
checkins). Because of that, this has been lingering without any real
tries to fix things, and from what I saw in data, things did even get
worse recently - and that's talking of the release channel, so
whatever might have increased troubles on trunk around this closure is
even on top of that.
As in a lot of cases we're seeing, there's apparently too little
memory available for Windows to even create a minidump, we have pretty
little info about those issues - but we do have our additional
annotations we send along with the crash report, and those gives us
some info that AFAIK gives us the assumption that in many cases we're
running out of virtual memory space but not necessarily of physical
memory. As I'm told, that can for example happen with VM fragmentation
as well as bugs causing a mapping of the same physical page over and
over into virtual memory. We're not sure if that's all on our code or
if system code or (graphics?) driver code exposes issues to us there.
From what I know, bsmedberg and dmajor are looking into those issues
more closely, both now that we had the tree closure problem and also
because it has been a lingering stability issue for months. I'm sure
any help in those efforts is appreciated as those are tough issues,
and it might be multiple problems that all contribute a share to the
overall issue.
Making us more efficient on memory sounds like a worthwhile goal
overall anyhow (even though the bullet of running out of VM space can
be dodged by switching to Win64 and/or e10s giving us multiple
processes that all have their 32bit virtual memory space, but not sure
if those should or will be our primary solutions).
I think in other cases, where a clear cause to the tree-closing issues
is easy to assess, a backout-based process can work better, but with
those OOM issues there's not a clear patch or patch set to point to.
IMHO, we should work on the overall issue cluster of OOM, though.
KaiRo
_______________________________________________
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform
_______________________________________________
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform