Quantum Flow Engineering Newsletter #3

Ehsan Akhgari Thu, 23 Mar 2017 11:25:41 -0700

Hi everyone,

Another week, another Quantum Flow engineering newsletter!  We have a lot
to cover, so let me get started.

Michael Layzell is getting really close on his work on bug 1346415
<https://bugzilla.mozilla.org/show_bug.cgi?id=1346415> in order to collect
native stacks from Background Hang Reports through telemetry on Nightly.
There are several practical concerns around this data collection, things
such as not blowing up our telemetry ping size, and also the processing of
this data on the server side, and we have some ideas on how we can improve
this in the future. Since some data is better than no data, we're trying
to start with having each client send a maximum of 300 of these native
stacks in each ping to begin with, and will hopefully grow this limit in
the future to be able to collect more data. He has also been helping with
writing some scripts for post-processing this data so that we can have an
automatically generated nightly report set from these pings to triage. The
triage itself, of course, will be a manual, excruciating (read: "fun"!)
process for now, until we think of something better.

We have finished an initial round of triage of the Quantum Flow bugs. We
are using a few tags, which are all described here
<https://docs.google.com/document/d/1Ka8eNAISQodT1mS_OXapFG-_kk94GoXyo4eKH1j7EV4/edit#heading=h.g074di4nyf2m>.
The most important bug tag to pay attention to at this point is [qf:p1] in
the status whiteboard field. This tag means we believe this bug may have a
large impact on performance, and it needs to be fixed *now*. We try our
best to make it obvious why we believe this to be the case, and of course
not all [qf:p1] bugs are all of the same level of importance, but if you
believe there is strong evidence why a [qf:p1] bug isn't of utmost
importance for performance, please feel free to raise the issue on the bug,
it's best to correct any possible triage mistakes as soon as we can.
Otherwise, we really appreciate your assistance in addressing these bugs.
Note that we are dealing with a massive project (making the entire web
browser faster for all users in all usage scenarios) under a very strict
timeline (by Firefox 57!) and the longer we let these bugs live in Firefox,
the longer they can mask smaller and less severe performance issues,
putting the entire effort at risk.

Next week we are going to have a work week around Quantum Flow in the
Toronto office. There are many people attending from different parts of
Mozilla and it's going to be a really exciting and super packed week.
Several things excite me personally. I expect to spend some more time
profiling and delving down into technical issues. I also expect to spend
some time talking to people on various teams about how we can facilitate
getting more help from even more engineers on fixing the bugs that we are
finding. One of my goals is to make the bottleneck of our pipeline be the
discovering of new issues to fix, and I hope to get closer to achieving
that after next week. Another exciting thing happening next week is that
we have some members from the Quantum DOM team also attending the work week
(including myself, as I'm still involved in that project as well.) We're
hopefully going to have a more concrete plan around cooperatively
scheduling of JavaScript running on web pages, which is a really important
part of the overall picture of the improvement of the performance of the
browser. I don't expect to be able to send out one of these newsletters
next week though, so expect the next one in two weeks!

Now I want to talk a bit about our synchronous IPCs. I've talked about
them before, but they deserve more air time, as based on the data we have
so far, they are one of our biggest performance issues at this point. I
have been thinking about good ways of making the extent of the problem more
obvious. We already have a tracker bug
<https://bugzilla.mozilla.org/show_bug.cgi?id=SyncIPC>, and some people
have been helping with a few of these bugs (see below), but I still think
our progress on this issue could be better. So let's open up this closet
and take a look at our skeletons, shall we?

I have prepared a Sync IPC Report for 2017-03-23
<https://docs.google.com/spreadsheets/d/1x_BWVlnQPg0DHbsrvPFX7g89lnFGa3lAIHWD_pLa_dE/edit#gid=844442583&fvid=785100780>.
It's a spreadsheet, with a chart! So cool. The first thing you'll notice
is that I'm not great at data visualization. :-) With that out of the
way, let's look at the data. We could sort this data in various ways, but
I have chosen to stick to something super simple, sort it in descending
order of median time of the sync IPC times the number of times it happens
in the wild. You can inspect the data yourself, but here is a human
readable summary of where we are now:

- PCookieService::Msg_GetCookieString
<https://bugzilla.mozilla.org/show_bug.cgi?id=1331680> (aka, what
happens when a page calls document.cookie!) at 34%. This is the most
horrible sync IPC that we have (and it's one of the most popular APIs on
the web.) Amy Chung is actively working on fixing this, and Josh Matthews
is helping her with providing feedback on her patch. Thanks to you both!
- PContent::Msg_RpcMessage and PBrowser::Msg_RpcMessage at 26.9%. These
two are together forming a big bucket consisting of all of the sync IPCs
triggered from JS. In order to stop flying blind here, bug 1348113
<https://bugzilla.mozilla.org/show_bug.cgi?id=1348113> was filed to
collect specific telemetry on this bucket. I recently found out that a
page calling navigator.userAgent to do UA sniffing (which is also super
common) can result in sync IPC
<https://bugzilla.mozilla.org/show_bug.cgi?id=1347425> that happens
through JS and this stayed hidden from us for a long time in this telemetry
data...
- A number of PScreenManager sync IPC messages
<https://bugzilla.mozilla.org/show_bug.cgi?id=1194751> at 12.8%. Kan-Ru
Chen has done some amazing work to fix all of them, and the patch set is
really close to landing any day now.
- Then there is a bit of a longer tail, and I have looked at some of
them in some detail:
- CPOW overhead: basically PJavaScript and anything under it. Some
of this could be caused by add-ons that aren't e10s compatible
yet. I need
to investigate more to get a better sense of how true this statement is!
- Graphics initialization sync IPCs: PContent::Msg_GetGfxVars
<https://bugzilla.mozilla.org/show_bug.cgi?id=1337062> and
PContent::Msg_GetGraphicsDeviceInitData
<https://bugzilla.mozilla.org/show_bug.cgi?id=1337063>. These should
be easy to fix but we've had a bit of a difficult time getting help in
fixing them. Gerald Squelart has recently stepped up to the
task, thanks
Gerald! These are important for navigation performance, as I
mentioned in
my previous newsletter.
- PContent::Msg_CreateWindow
<https://bugzilla.mozilla.org/show_bug.cgi?id=1343728>. This one
also has a pretty bad impact on navigation, even when we don't need to
start a new content process! I have a patch that fixes this
enough to make
things work for basic browsing, but it's far from passing tests still...

If you see an IPC message on this list that looks familiar to you and
doesn't have a bug that tracks fixing it already, please feel free to file
one. If you are familiar with an area of the code where one of these
messages is being used, please consider fixing one or two. :-)

Now, it's time for our performance story of the week! This time we're
going to look at how not to do off-main-thread I/O. Usually when people
talk about avoiding main thread I/O, the goal is to make it so that the
main thread doesn't end up calling a function that could end up being
blocked until the (potentially spinning) disk finishes an I/O operation.
Typically this is done in one of the two ways, either using a non-blocking
I/O API that the underlying OS provides (to get the OS to call you back
when the I/O is finished) or make a background thread call the mentioned
function, and notify your main thread itself. In our implementation of the
XMLHttpRequest in Gecko, in order to support the blob response type, we
need to open a temporary file to write the incoming data to. Opening this
file is an I/O operation, and we use the second strategy in order to avoid
a main-thread I/O. Now, it turns out that we had this code
<http://searchfox.org/mozilla-central/rev/a5c2b278897272497e14a8481513fee34bbc7e2c/dom/file/MutableBlobStorage.cpp#123>
which was expecting NS_OpenAnonymousTemporaryFile() to fail in the
sandboxed content process where, the author expected, opening the temporary
file handle would fail. But then, that wasn't what that function was doing
at all! That function was doing all in its power to do what the caller
asked it to, that is, to open an anonymous temporary file. The way that
the function did it
<http://searchfox.org/mozilla-central/rev/2d24acd7f3e087c5a506f325684487013e1f1744/xpcom/io/nsAnonymousTemporaryFile.cpp#118>
in the content process in a background thread was to dispatch a synchronous
runnable to the main thread, blocking the calling thread (in this case, the
Gecko IO thread) and then dispatching a synchronous IPC message
<http://searchfox.org/mozilla-central/rev/2d24acd7f3e087c5a506f325684487013e1f1744/xpcom/io/nsAnonymousTemporaryFile.cpp#100>
to the parent process. At this point, two threads would be blocked in the
content process. As if that weren't enough, the handler for the sync IPC
in the parent process would then call the same function
<http://searchfox.org/mozilla-central/rev/a5c2b278897272497e14a8481513fee34bbc7e2c/dom/ipc/ContentParent.cpp#4063>
on the parent process main thread leading to main-thread I/O
<http://searchfox.org/mozilla-central/rev/2d24acd7f3e087c5a506f325684487013e1f1744/xpcom/io/nsAnonymousTemporaryFile.cpp#160>
on our UI thread! Of course, all of this was the unintended interaction of
different parts of the code when combined together, and I'm glad to report
that this is all now fixed
<https://bugzilla.mozilla.org/show_bug.cgi?id=1347031> on Nightly. :-)

Last but not least, time for the credits section again. I would like to
thank the following individuals for their help in making Firefox faster
this past week. As always, apologies to those who I'm forgetting to name
here.

- Kris Maglione did some heroic work
<https://bugzilla.mozilla.org/show_bug.cgi?id=1333990> to avoid
reparsing our content scripts every time we run them. This was a pretty
severe performance issue that impacts a lot of add-ons that rely on content
scripts, but fixing it wasn't very easy, and honestly when the bug was
filed I wasn't very hopeful to see it fixed any time soon given the amount
of work that was involved.
- Sam Foster has been attacking a synchronous reflow
<https://bugzilla.mozilla.org/show_bug.cgi?id=1334642> that can happen
when we (de)activate a browser window. The work in ongoing, but these
types of front-end bugs, even though they may not be much fun to work on,
are very important to fix and can remove a lot of jank that we won't be
able to get rid of in any other way. Thank you Sam!
- Mike Conley landed some instrumentation
<https://bugzilla.mozilla.org/show_bug.cgi?id=1340842> for tab closing.
In case you're wondering, this means we're taking tab closing
performance very
seriously <https://bugzilla.mozilla.org/show_bug.cgi?id=1344302>.
- Mike Conley also made us create the about:blank placeholder document
for lazily restored tabs after a session restore in the content process
<https://bugzilla.mozilla.org/show_bug.cgi?id=1256472>. If that sounds
boring, how about this: he improved session restore times for users with
hundreds of tabs by a lot. Users are reporting improvements on the scale
of *minutes* (you read that right.)
- Mike de Boer has been helping with triaging some session restore
performance bugs <https://bugzilla.mozilla.org/show_bug.cgi?id=1330635>.
- Kearwood (kip) Gilbert has been continuing his work on removing the
synchronous <https://bugzilla.mozilla.org/show_bug.cgi?id=1346923> IPCs
<https://bugzilla.mozilla.org/show_bug.cgi?id=1346926> used in the WebVR
implementation.
- Michael Layzell removed a synchronous IPC
<https://bugzilla.mozilla.org/show_bug.cgi?id=1337056> which was used to
initialize the permission manager's database. As an additional privacy
win, the content process now only knows about the permissions belonging to
the websites that you have visited, not all of the permissions stored in
your profile!
- Michael Layzell also added telemetry for IPC message
serialization/deserializaion
<https://bugzilla.mozilla.org/show_bug.cgi?id=1342635> that happens on
the main thread. There's some evidence that this can be expensive, and
this probe will help us find the IPC messages where this can be problematic
in the wild.
- Chris Pearce made media cache initialization use asynchronous IPC
<https://bugzilla.mozilla.org/show_bug.cgi?id=1347031>.
- Jeff Muizelaar removed an async pan/zoom logging message
<https://bugzilla.mozilla.org/show_bug.cgi?id=1346585> which was slowing
us down to log information that nobody was looking at!
- Olli Pettay brought the performance of accessing MouseEvent.offsetX/Y
on simulated click events
<https://bugzilla.mozilla.org/show_bug.cgi?id=1339758> on par to other
engines.
- Edgar Chen and Boris Zbarsky worked on a
<https://bugzilla.mozilla.org/show_bug.cgi?id=1347634> few
<https://bugzilla.mozilla.org/show_bug.cgi?id=1347639> optimizations
<https://bugzilla.mozilla.org/show_bug.cgi?id=1347640> for improving our
innerHTML setter performance.
- Henry Chang fixed a severe UI jank
<https://bugzilla.mozilla.org/show_bug.cgi?id=1325054> that could occur
when using tracking protection (for example in private browsing windows).

Until next time, happy hacking!
--
Ehsan
_______________________________________________
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform

Quantum Flow Engineering Newsletter #3

Reply via email to