Hi All,

Tl;dr; The necko team has for months been chasing a windows only top
crasher. It is a shutdown hang - Bug 1158189. The crash stopped happening
on nightly-48 back in March and that ‘fixed state’ has been riding the
normal trains. Last weekend it returned to crashing on aurora-48 but not on
nightly-49. The data indicates that the toolchain changes are the reason.
We should talk about whether to put MSVC-2015 back on aurora-48 or to live
with the crashes for an extra 6 weeks.

This is a windows only bug and essentially boils down to non blocking
networking operations sometimes blocking (maybe forever) inside system
calls. It impacts a range of calls - send, recv, poll, connect, etc.. Often
LSPs and AV software is involved, but not always. Chrome has seen behavior
like this from time to time in the past, but anecdotally it is worse for
us. They aren’t sure if they have dealt with it since they changed to
msvc-2015.

On 46.0.1 this is the #18 top crasher (about 0.8% of crashes). On 47 this
is the #2 crasher (about 3.5% of crashes).  On 48 over the last 3 days it
is the #10 top crasher (1.25% of crashes), but is just in the noise for 48
when measured over the last few weeks as it just started recurring. It is
not a factor on 49.

We honestly don’t know if this is only a shutdown hang or not. It certainly
could be triggered by the shutdown path but just as easily this could be
happening during normal browsing and the user’s reaction would be to
shutdown the browser where networking (i.e. the socket thread) appears hung.

During the 48 cycle we hadn’t yet figured out a plan to attack it directly,
and while we were inserting diagnostics for it, we also cleaned up every
somewhat related issue we could find. When the hang disappeared from the
nightly crash stats we attributed it to a second order impact of a
different bugfix that landed at about the same time. Attempts to uplift
that fix to 47 did not help with crashes on 47 which we attributed to the
complex dependencies of the bug we uplifted (and eventually backed out of
47) - but it seems now the primary reason was that the toolchain on 47 was
different… as when the toolchain on 48 went back to msvc-2013 last Friday
the crashes returned on aurora 48. Version 49 (still msvc-2015) has not
seen a crash.

The last nightly crash was 20160324030447 - the msvc2015 patch landed 215
csets later on nightly-48. The crash was not seen again on 48 or 49 until
aurora-48 20160514004011 which had the reversion to msvc2013 just 31 csets
earlier. Nightly-49, which has only ever had msvc2015 as its compiler, has
not seen the crash.

I’m not sure how to compare the size of the populations impacted by the
crash vs the size of the population impacted by the SSE dependency. My
intuition says the no-SSE population is very small and we might be better
off overall with MSVC-2015 on the 48 channel.. We’re going to orphan that
population eventually anyhow but perhaps we want to live with the crashes
while we prep the infrastructure to deal with it as nathan mentions in a
different thread. I'm really torn.

Beyond the product tradeoffs, I am acutely aware that changing toolchains
is a real pain for everyone and going back and forth is kind of insane. I’m
sorry to even float the idea at this point - we hadn’t hypothesized that
the crash improved because of the change in msvc until it returned over the
weekend.


Thoughts?


-Patrick and Dragana

[This is a resend because filters hate me. My apologies if you receive it
twice.]
_______________________________________________
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform

Reply via email to