Right now our stability efforts are primarily focused on crashes.
However, as we have been very successful at reducing our crash rate,
some other stability issues which are not crashes have come to be more
prominent. Some examples:
* very slow startup
* very slow/hung shutdown
* hangs while running
* JS error which cause parts or all of the UI to stop functioning properly
* localization errors, especially entity/DTD errors which cause parts of
the UI to be ugly or missing
Prompted by several discussions during the stability work week, we need
to broaden our focus within stability and deal with many more of these
kinds of events.
The first technical step in this effort needs to be a unified log of
failure events which includes all types of failure events. This will
enable two new features right away:
* When a failure event happens and then there is a crash, all the
failure events leading up to the crash should be contained within the
crash report.
* Support-facing mechanisms (about:support or perhaps the web frontend
for FHR) will be able to display recent error events to the user and
allow the log to be copied into SUMO issues or bug reports.
After we've tested logging features in the wild, we will likely build
this out into a more complete support mechanism:
* include counts/histograms of error events within the FHR payload
itself, to correlate errors across user populations and identify common
causes
* combine about:support and FHR user interfaces into a unified
troubleshooting UI and allow users to submit error reports for non-crash
events, including comments about their issues and hopefully provide
users with automated solutions to common problems (on B2G, this will be
a support/troubleshooting app built into the system?)
Technically, though I'm not exactly sure how to accomplish this kind of
logging: whatever system we have should be fairly robust:
* the log must be writable from multiple processes, for B2G,
multiprocess Firefox, and even Firefox webapp support (note that
hopefully soon we'll be collecting crash reports from every process on
B2G devices using debuggerd, not just the B2G/app/content processeses)
* the log it must be writable from multiple threads (even if the main
thread is deadlocked) so that we can monitor and write hang-detector
information to the log
* individual log entries such as hang reports may need contain data
(such as SPS profiles, or invalid responses from Mozilla services)
Does anyone know of prior art that we could apply to this problem, or
suggestions for how to implement this kind of logging safely, correctly,
and efficiently? It's possible that the system will need to be different
across platforms, using a logging service on B2G, some kind of native
logging system on android, and a custom-built system on desktop.
If people have suggestions for other types of error log events that
should be include in this system, please let me know.
--BDS
_______________________________________________
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform