On 08/30/2013 06:27 AM, Nathan Froyd wrote: > ----- Original Message ----- >> On 08/30/2013 11:07 AM, Robert O'Callahan wrote: >>> Being able to sample uniformly whether or not the thread is in a syscall is >>> important sometimes, so let's not lose that. >> I am not proposing to lose that facility. Am only proposing (really, >> merely parroting Taras' suggestion) that it would be nice to have the option >> of some kind of adaptive backoff for threads blocked in syscalls. > One way of handling this is the following: When you take a sample, replace > the return address for the current stack frame with the address of a known > function inside your profile (call this a "trampoline function"). Now the > call stack looks something like:
Another approach with similar benefits is what I was thinking of in my comment to jseward's blog post. I implemented something very similar as an optimization for our dynamic rooting analysis, which has similar stackwalking requirements. In my scheme, you periodically push "markers" onto the stack. The existing SPS markers would be good for this, though we might want to insert more for this purpose. They will end up scattered throughout the real stack. Chain the markers together in a simple linked list, with a global or thread-local list header updated separately on every push/pop. (I think SPS markers already do this.) Keep a "seen" status bit in each marker, initialized to 0 when the marker is pushed. When sampling, scan up the seen bits, flipping them from 0 to 1 until you hit the first that is already 1. That marks the deepest point you need to search in the stack, since you've already seen the other frames. Note that if you take a sample, pop a couple of frames, push those same frames back, and take another sample, this scheme will force you to re-scan those new frames again. On the bright side, this allows you to distinguish this case from the more usual case of those frames never having been popped in between samples. But only at a marker granularity; I don't know if this information can be usefully surfaced in a UI or not. This scheme would reduce the cost of scanning sleeping threads, since you'd only need to scan up to the first marker. The major drawback of this scheme is that it relies on manually placing markers in the code so that they occur frequently enough in the stack to be useful. That has side benefits too, though, since you can use semantic labeling, a la SPS markers, to better categorize the work being done. Still, if you want to lift this restriction, you could combine the Finkish and Froydulent schemes: in addition to the linked list of markers, you also tag native frames during a sample by replacing their return addresses with jumps to trampolines. (The trampoline in this case is much simpler: it's just a direct jump to the original return address.) You can now derive a per-frame "seen" bit by checking whether it's within your trampoline buffer's address range. It's automatically and implicitly 0 when pushing a frame normally, set to 1 during sampling by swapping in the trampoline address, and deleted when the frame is popped. No special logic to massage the stack is required, as in the Froydulent trampolines. One drawback of any trampoline approach is that it would confuse debuggers. I'm not sure if there's a way with CFI to label the whole trampoline region with the rule "pull the return address out of the operand to this jump instruction, and leave all registers as-is." Similarly, Breakpad and any other stack unwinders would need to be taught to deal with them. A potential advantage of manual markers is that they can serve as synchronization points in your stackwalk -- if your native frame walk gets lost somewhere, you can store enough info in the markers to be able to pick things up again at the next marker's frame. (I'm skeptical that this would be worth the effort, though -- the unwind state you'd need to save is platform-specific and hairy. You might be better off turning on exception unwind info and relying on that.) With this in place, a sleeping thread sample would be walked once, and then on the next sample only the first return address would be considered. Finally, one minor wrinkle with manual markers that I hit when implementing this for the rooting analysis -- if you have multiple markers in one frame, you can't predict which one the compiler will put on the stack first. So when scanning in linked list order, you're ok -- all seen markers will be visited before all unseen markers. But that order may not be in monotonic address order. And you may get multiple markers in one frame from inlining, so it's not always easy to predict or avoid. Maybe that doesn't matter for this application? > > A --> B --> C --> D --Trampoline--> E > > where the sample was taken in E. The trampoline function enables us to keep > track of what some portion of the stack looks like without having to unwind > it. Unwinding on subsequent samples proceeds until we encounter a frame > whose return address is the trampoline function. At that point, we know > everything else above that frame is the same as whatever it was the last time > we took a sample. > > There are now three interesting scenarios of what can happen before we take > the next sample. > > 1) E does not return. It may call other functions, it may do some > computation, or it may just be blocked on a syscall. Whatever the case, we > take a sample, find that E's return address is our trampoline function, and > stop. > > 2) E calls F calls G, and then a sample is taken. Just prior to the sample, > we have: > > A --> B --> C --> D --Trampoline--> E --> F --> G > > The profiler fixes things up to do: > > A --> B --> C --> D --> E --> F --Trampoline--> G > > and we go about our merry way. > > 3) E returns. Instead of returning to D, it returns to our trampoline > function. Said function then replaces the return address of D's stack frame > with the address of the trampoline function, and returns to the appropriate > PC in D: > > A --> B --> C --Trampoline--> D > > This can happen several times (D returns to C, etc.). > > Depending on how deep the call stack is and how frequently we push and pop > stack frames, being able to not unwind some number of frames can be a > significant savings. > > -Nathan > _______________________________________________ > dev-platform mailing list > dev-platform@lists.mozilla.org > https://lists.mozilla.org/listinfo/dev-platform _______________________________________________ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform