Re: Logging of "quality events"

Gregory Szorc Thu, 19 Sep 2013 10:38:20 -0700

On 9/18/2013 1:52 PM, Benjamin Smedberg wrote:
> Right now our stability efforts are primarily focused on crashes.
> However, as we have been very successful at reducing our crash rate,
> some other stability issues which are not crashes have come to be more
> prominent. Some examples:
> 
> * very slow startup
> * very slow/hung shutdown
> * hangs while running
> * JS error which cause parts or all of the UI to stop functioning properly
> * localization errors, especially entity/DTD errors which cause parts of
> the UI to be ugly or missing
> 
> Prompted by several discussions during the stability work week, we need
> to broaden our focus within stability and deal with many more of these
> kinds of events.
> 
> The first technical step in this effort needs to be a unified log of
> failure events which includes all types of failure events. This will
> enable two new features right away:
> 
> * When a failure event happens and then there is a crash, all the
> failure events leading up to the crash should be contained within the
> crash report.
> * Support-facing mechanisms (about:support or perhaps the web frontend
> for FHR) will be able to display recent error events to the user and
> allow the log to be copied into SUMO issues or bug reports.
> 
> After we've tested logging features in the wild, we will likely build
> this out into a more complete support mechanism:
> * include counts/histograms of error events within the FHR payload
> itself, to correlate errors across user populations and identify common
> causes
> * combine about:support and FHR user interfaces into a unified
> troubleshooting UI and allow users to submit error reports for non-crash
> events, including comments about their issues and hopefully provide
> users with automated solutions to common problems (on B2G, this will be
> a support/troubleshooting app built into the system?)
> 
> Technically, though I'm not exactly sure how to accomplish this kind of
> logging: whatever system we have should be fairly robust:
> 
> * the log must be writable from multiple processes, for B2G,
> multiprocess Firefox, and even Firefox webapp support (note that
> hopefully soon we'll be collecting crash reports from every process on
> B2G devices using debuggerd, not just the B2G/app/content processeses)
> * the log it must be writable from multiple threads (even if the main
> thread is deadlocked) so that we can monitor and write hang-detector
> information to the log
> * individual log entries such as hang reports may need contain data
> (such as SPS profiles, or invalid responses from Mozilla services)
> 
> Does anyone know of prior art that we could apply to this problem, or
> suggestions for how to implement this kind of logging safely, correctly,
> and efficiently? It's possible that the system will need to be different
> across platforms, using a logging service on B2G, some kind of native
> logging system on android, and a custom-built system on desktop.
> 
> If people have suggestions for other types of error log events that
> should be include in this system, please let me know.


Operating systems provide various logging facilities that are arguably
suitable for our needs. e.g. Windows has Events [1] with a few different
APIs depending on the version of Windows. If we weren't interested in
reading events in process, I'd say just emit events using the "native"
API on the platform, ship those logs over the wire, and let a
"downstream" system worry about decoding them. However, we want to
consume this output inside Firefox, so a unified API across platforms is
certainly less work to implement (at the point you write middleware to
manage the differences between all the platform's native APIs, I think
we would have been better off writing something independent of them).

The requirement for handling events from different processes/threads and
writing to a unified log is going to be challenging, especially if you
are worried about deadlocks and allocator issues. This seems to require
an out-of-process solution.

One question I didn't see answered in your original post is durability.
How long do events need persisted? Do they need to survive processes
restarts? Machine restarts?

There are no shortage of standalone projects/daemons that handle log
aggregation. When you require Windows support, want to minimize shipping
size, and rule out Java (dependency issues), that leaves us with C/C++
and any languages that compile down with minimal dependencies. I'll
throw Go and Lua into the mix. Possibly even JS if we link against SM.
Possibly even Rust?

Some ideas:

* syslog/systemlog-ng/rsyslog/d-bus. I /think/ you can run it as a
standalone service and I /think/ some have been ported to Windows.
Whether the message format is proper or not is an open question.
* Heka [2]. Probably overkill. And the current binary is a little large.
But we could probably convince people to prune the feature set into
something shippable with Firefox.
* zippylog [3]. A little project I wrote a few years ago. You could
steal the message code and replace IPC with not 0MQ. Or, we could ship
0MQ (or one of its derivatives) into the tree so we can finally have a
sane, high-performance IPC mechanism in the tree.
* Roll your own. It's really not that much work. Hardest part is
aggreeing on the message storage format and IPC. I've done lots of work
on these types of systems on the server side and could provide technical
advice.

[1]
http://msdn.microsoft.com/en-us/library/windows/desktop/aa964766%28v=vs.85%29.aspx
[2] https://blog.mozilla.org/services/2013/07/16/heka-0-3-released/
[3] https://github.com/indygreg/zippylog/wiki
_______________________________________________
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform

Re: Logging of "quality events"

Reply via email to