On 08/30/2013 06:27 AM, Nathan Froyd wrote:
> ----- Original Message -----
>> On 08/30/2013 11:07 AM, Robert O'Callahan wrote:
>>> Being able to sample uniformly whether or not the thread is in a syscall is
>>> important sometimes, so let's not lose that.
>> I am not proposing to lose that facility.  Am only proposing (really,
>> merely parroting Taras' suggestion) that it would be nice to have the option
>> of some kind of adaptive backoff for threads blocked in syscalls.
> One way of handling this is the following: When you take a sample, replace 
> the return address for the current stack frame with the address of a known 
> function inside your profile (call this a "trampoline function").  Now the 
> call stack looks something like:

Another approach with similar benefits is what I was thinking of in my
comment to jseward's blog post. I implemented something very similar as
an optimization for our dynamic rooting analysis, which has similar
stackwalking requirements.

In my scheme, you periodically push "markers" onto the stack. The
existing SPS markers would be good for this, though we might want to
insert more for this purpose. They will end up scattered throughout the
real stack. Chain the markers together in a simple linked list, with a
global or thread-local list header updated separately on every push/pop.
(I think SPS markers already do this.) Keep a "seen" status bit in each
marker, initialized to 0 when the marker is pushed.

When sampling, scan up the seen bits, flipping them from 0 to 1 until
you hit the first that is already 1. That marks the deepest point you
need to search in the stack, since you've already seen the other frames.

Note that if you take a sample, pop a couple of frames, push those same
frames back, and take another sample, this scheme will force you to
re-scan those new frames again. On the bright side, this allows you to
distinguish this case from the more usual case of those frames never
having been popped in between samples. But only at a marker granularity;
I don't know if this information can be usefully surfaced in a UI or not.

This scheme would reduce the cost of scanning sleeping threads, since
you'd only need to scan up to the first marker.

The major drawback of this scheme is that it relies on manually placing
markers in the code so that they occur frequently enough in the stack to
be useful. That has side benefits too, though, since you can use
semantic labeling, a la SPS markers, to better categorize the work being
done.

Still, if you want to lift this restriction, you could combine the
Finkish and Froydulent schemes: in addition to the linked list of
markers, you also tag native frames during a sample by replacing their
return addresses with jumps to trampolines. (The trampoline in this case
is much simpler: it's just a direct jump to the original return
address.) You can now derive a per-frame "seen" bit by checking whether
it's within your trampoline buffer's address range. It's automatically
and implicitly 0 when pushing a frame normally, set to 1 during sampling
by swapping in the trampoline address, and deleted when the frame is
popped. No special logic to massage the stack is required, as in the
Froydulent trampolines.

One drawback of any trampoline approach is that it would confuse
debuggers. I'm not sure if there's a way with CFI to label the whole
trampoline region with the rule "pull the return address out of the
operand to this jump instruction, and leave all registers as-is."
Similarly, Breakpad and any other stack unwinders would need to be
taught to deal with them.

A potential advantage of manual markers is that they can serve as
synchronization points in your stackwalk -- if your native frame walk
gets lost somewhere, you can store enough info in the markers to be able
to pick things up again at the next marker's frame. (I'm skeptical that
this would be worth the effort, though -- the unwind state you'd need to
save is platform-specific and hairy. You might be better off turning on
exception unwind info and relying on that.)

With this in place, a sleeping thread sample would be walked once, and
then on the next sample only the first return address would be considered.

Finally, one minor wrinkle with manual markers that I hit when
implementing this for the rooting analysis -- if you have multiple
markers in one frame, you can't predict which one the compiler will put
on the stack first. So when scanning in linked list order, you're ok --
all seen markers will be visited before all unseen markers. But that
order may not be in monotonic address order. And you may get multiple
markers in one frame from inlining, so it's not always easy to predict
or avoid. Maybe that doesn't matter for this application?

>
>   A --> B --> C --> D --Trampoline--> E
>
> where the sample was taken in E.  The trampoline function enables us to keep 
> track of what some portion of the stack looks like without having to unwind 
> it.  Unwinding on subsequent samples proceeds until we encounter a frame 
> whose return address is the trampoline function.  At that point, we know 
> everything else above that frame is the same as whatever it was the last time 
> we took a sample.
>
> There are now three interesting scenarios of what can happen before we take 
> the next sample.
>
> 1) E does not return.  It may call other functions, it may do some 
> computation, or it may just be blocked on a syscall.  Whatever the case, we 
> take a sample, find that E's return address is our trampoline function, and 
> stop.
>
> 2) E calls F calls G, and then a sample is taken.  Just prior to the sample, 
> we have:
>
>   A --> B --> C --> D --Trampoline--> E --> F --> G
>
> The profiler fixes things up to do:
>
>   A --> B --> C --> D --> E --> F --Trampoline--> G
>
> and we go about our merry way.
>
> 3) E returns.  Instead of returning to D, it returns to our trampoline 
> function.  Said function then replaces the return address of D's stack frame 
> with the address of the trampoline function, and returns to the appropriate 
> PC in D:
>
>   A --> B --> C --Trampoline--> D
>
> This can happen several times (D returns to C, etc.).
>
> Depending on how deep the call stack is and how frequently we push and pop 
> stack frames, being able to not unwind some number of frames can be a 
> significant savings.
>
> -Nathan
> _______________________________________________
> dev-platform mailing list
> dev-platform@lists.mozilla.org
> https://lists.mozilla.org/listinfo/dev-platform

_______________________________________________
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform

Reply via email to