Logging of "quality events"

Benjamin Smedberg Wed, 18 Sep 2013 13:54:45 -0700

Right now our stability efforts are primarily focused on crashes.However, as we have been very successful at reducing our crash rate,some other stability issues which are not crashes have come to be moreprominent. Some examples:


* very slow startup
* very slow/hung shutdown
* hangs while running
* JS error which cause parts or all of the UI to stop functioning properly

* localization errors, especially entity/DTD errors which cause parts ofthe UI to be ugly or missing

Prompted by several discussions during the stability work week, we needto broaden our focus within stability and deal with many more of thesekinds of events.

The first technical step in this effort needs to be a unified log offailure events which includes all types of failure events. This willenable two new features right away:

* When a failure event happens and then there is a crash, all thefailure events leading up to the crash should be contained within thecrash report.* Support-facing mechanisms (about:support or perhaps the web frontendfor FHR) will be able to display recent error events to the user andallow the log to be copied into SUMO issues or bug reports.

After we've tested logging features in the wild, we will likely buildthis out into a more complete support mechanism:* include counts/histograms of error events within the FHR payloaditself, to correlate errors across user populations and identify commoncauses* combine about:support and FHR user interfaces into a unifiedtroubleshooting UI and allow users to submit error reports for non-crashevents, including comments about their issues and hopefully provideusers with automated solutions to common problems (on B2G, this will bea support/troubleshooting app built into the system?)

Technically, though I'm not exactly sure how to accomplish this kind oflogging: whatever system we have should be fairly robust:

* the log must be writable from multiple processes, for B2G,multiprocess Firefox, and even Firefox webapp support (note thathopefully soon we'll be collecting crash reports from every process onB2G devices using debuggerd, not just the B2G/app/content processeses)* the log it must be writable from multiple threads (even if the mainthread is deadlocked) so that we can monitor and write hang-detectorinformation to the log* individual log entries such as hang reports may need contain data(such as SPS profiles, or invalid responses from Mozilla services)

Does anyone know of prior art that we could apply to this problem, orsuggestions for how to implement this kind of logging safely, correctly,and efficiently? It's possible that the system will need to be differentacross platforms, using a logging service on B2G, some kind of nativelogging system on android, and a custom-built system on desktop.

If people have suggestions for other types of error log events thatshould be include in this system, please let me know.


--BDS

_______________________________________________
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform

Logging of "quality events"

Reply via email to