Right now our stability efforts are primarily focused on crashes. However, as we have been very successful at reducing our crash rate, some other stability issues which are not crashes have come to be more prominent. Some examples:

* very slow startup
* very slow/hung shutdown
* hangs while running
* JS error which cause parts or all of the UI to stop functioning properly
* localization errors, especially entity/DTD errors which cause parts of the UI to be ugly or missing

Prompted by several discussions during the stability work week, we need to broaden our focus within stability and deal with many more of these kinds of events.

The first technical step in this effort needs to be a unified log of failure events which includes all types of failure events. This will enable two new features right away:

* When a failure event happens and then there is a crash, all the failure events leading up to the crash should be contained within the crash report. * Support-facing mechanisms (about:support or perhaps the web frontend for FHR) will be able to display recent error events to the user and allow the log to be copied into SUMO issues or bug reports.

After we've tested logging features in the wild, we will likely build this out into a more complete support mechanism: * include counts/histograms of error events within the FHR payload itself, to correlate errors across user populations and identify common causes * combine about:support and FHR user interfaces into a unified troubleshooting UI and allow users to submit error reports for non-crash events, including comments about their issues and hopefully provide users with automated solutions to common problems (on B2G, this will be a support/troubleshooting app built into the system?)

Technically, though I'm not exactly sure how to accomplish this kind of logging: whatever system we have should be fairly robust:

* the log must be writable from multiple processes, for B2G, multiprocess Firefox, and even Firefox webapp support (note that hopefully soon we'll be collecting crash reports from every process on B2G devices using debuggerd, not just the B2G/app/content processeses) * the log it must be writable from multiple threads (even if the main thread is deadlocked) so that we can monitor and write hang-detector information to the log * individual log entries such as hang reports may need contain data (such as SPS profiles, or invalid responses from Mozilla services)

Does anyone know of prior art that we could apply to this problem, or suggestions for how to implement this kind of logging safely, correctly, and efficiently? It's possible that the system will need to be different across platforms, using a logging service on B2G, some kind of native logging system on android, and a custom-built system on desktop.

If people have suggestions for other types of error log events that should be include in this system, please let me know.

--BDS

_______________________________________________
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform

Reply via email to