Hi All, Tl;dr; The necko team has for months been chasing a windows only top crasher. It is a shutdown hang - Bug 1158189. The crash stopped happening on nightly-48 back in March and that ‘fixed state’ has been riding the normal trains. Last weekend it returned to crashing on aurora-48 but not on nightly-49. The data indicates that the toolchain changes are the reason. We should talk about whether to put MSVC-2015 back on aurora-48 or to live with the crashes for an extra 6 weeks.
This is a windows only bug and essentially boils down to non blocking networking operations sometimes blocking (maybe forever) inside system calls. It impacts a range of calls - send, recv, poll, connect, etc.. Often LSPs and AV software is involved, but not always. Chrome has seen behavior like this from time to time in the past, but anecdotally it is worse for us. They aren’t sure if they have dealt with it since they changed to msvc-2015. On 46.0.1 this is the #18 top crasher (about 0.8% of crashes). On 47 this is the #2 crasher (about 3.5% of crashes). On 48 over the last 3 days it is the #10 top crasher (1.25% of crashes), but is just in the noise for 48 when measured over the last few weeks as it just started recurring. It is not a factor on 49. We honestly don’t know if this is only a shutdown hang or not. It certainly could be triggered by the shutdown path but just as easily this could be happening during normal browsing and the user’s reaction would be to shutdown the browser where networking (i.e. the socket thread) appears hung. During the 48 cycle we hadn’t yet figured out a plan to attack it directly, and while we were inserting diagnostics for it, we also cleaned up every somewhat related issue we could find. When the hang disappeared from the nightly crash stats we attributed it to a second order impact of a different bugfix that landed at about the same time. Attempts to uplift that fix to 47 did not help with crashes on 47 which we attributed to the complex dependencies of the bug we uplifted (and eventually backed out of 47) - but it seems now the primary reason was that the toolchain on 47 was different… as when the toolchain on 48 went back to msvc-2013 last Friday the crashes returned on aurora 48. Version 49 (still msvc-2015) has not seen a crash. The last nightly crash was 20160324030447 - the msvc2015 patch landed 215 csets later on nightly-48. The crash was not seen again on 48 or 49 until aurora-48 20160514004011 which had the reversion to msvc2013 just 31 csets earlier. Nightly-49, which has only ever had msvc2015 as its compiler, has not seen the crash. I’m not sure how to compare the size of the populations impacted by the crash vs the size of the population impacted by the SSE dependency. My intuition says the no-SSE population is very small and we might be better off overall with MSVC-2015 on the 48 channel.. We’re going to orphan that population eventually anyhow but perhaps we want to live with the crashes while we prep the infrastructure to deal with it as nathan mentions in a different thread. I'm really torn. Beyond the product tradeoffs, I am acutely aware that changing toolchains is a real pain for everyone and going back and forth is kind of insane. I’m sorry to even float the idea at this point - we hadn’t hypothesized that the crash improved because of the change in msvc until it returned over the weekend. Thoughts? -Patrick and Dragana [This is a resend because filters hate me. My apologies if you receive it twice.] _______________________________________________ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform