I think the crux of this is to reduce the time m-i is closed is to have better monitoring on things like the memory during testing so we can see if it is growing during tests or change the tests that we don't care about that assert or hide the test on TBPL so the sheriffs ignore it.

The best answer is for the first thing and add more monitoring so we catch this sooner.

The one thing I think we should note is we shouldnt push the sheriffs to re-open the tree when its in a bad state. The sheriffs never take closing the tree lightly but if something needs to be landed urgently you can always add checkin-needed keyword to the bug and the sheriffs will land it for you ASAP.

David



On 20/11/2013 16:20, Robert Kaiser wrote:
Nicholas Nethercote schrieb:
It also assumes that we can backout stuff to fix
the problem;  we tried that to some extent with the first OOM closure
-- it is the standard response to test failure, of course -- but it
didn't work.

Yes, in the case of those OOM issues that caused this closure, they are probably just a symptom of a larger problem.

We've been having a step-by-step rise of OOM issues over quite some time now, most intensely seen as an increase of crashes with empty dumps. I alerted to that in bug 837835 but we couldn't track down a decent regression range (we mostly know in which 6-week cycle we had regressions, we can do some assumptions to narrow things a bit further down on trunk, but not nearly well enough to get to candidate checkins). Because of that, this has been lingering without any real tries to fix things, and from what I saw in data, things did even get worse recently - and that's talking of the release channel, so whatever might have increased troubles on trunk around this closure is even on top of that.

As in a lot of cases we're seeing, there's apparently too little memory available for Windows to even create a minidump, we have pretty little info about those issues - but we do have our additional annotations we send along with the crash report, and those gives us some info that AFAIK gives us the assumption that in many cases we're running out of virtual memory space but not necessarily of physical memory. As I'm told, that can for example happen with VM fragmentation as well as bugs causing a mapping of the same physical page over and over into virtual memory. We're not sure if that's all on our code or if system code or (graphics?) driver code exposes issues to us there.

From what I know, bsmedberg and dmajor are looking into those issues more closely, both now that we had the tree closure problem and also because it has been a lingering stability issue for months. I'm sure any help in those efforts is appreciated as those are tough issues, and it might be multiple problems that all contribute a share to the overall issue.

Making us more efficient on memory sounds like a worthwhile goal overall anyhow (even though the bullet of running out of VM space can be dodged by switching to Win64 and/or e10s giving us multiple processes that all have their 32bit virtual memory space, but not sure if those should or will be our primary solutions).

I think in other cases, where a clear cause to the tree-closing issues is easy to assess, a backout-based process can work better, but with those OOM issues there's not a clear patch or patch set to point to. IMHO, we should work on the overall issue cluster of OOM, though.

KaiRo
_______________________________________________
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform

_______________________________________________
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform

Reply via email to