On Wed, Oct 14, 2015 at 9:15 PM, Bernd Schmidt <bschm...@redhat.com> wrote: > On 10/14/2015 07:43 PM, Jeff Law wrote: >> >> Obviously some pessimization relative to current code is necessary to >> fix some of the problems WRT thread safety and avoiding things like >> introducing faults in code which did not previously fault. > > > Huh? This patch is purely an (attempt at) optimization, not something that > fixes any problems. > >> However, pessimization of safe code is, err, um, bad and needs to be >> avoided. > > > Here's an example: > > > subq $16, %rsp > [...] > > leaq 8(%rsp), %r8 > > leaq 256(%rax), %rdx > cmpq 256(%rax), %rcx | cmpq 256(%rax), %rsi > jne .L97 < > movq $0, 256(%rax) < > .L97: < > > movq %rdx, %rax > > cmovne %r8, %rax > > movq $0, (%rax) > [...] > > addq $16, %rsp > > In the worst case that executes six more instructions, and always causes > unnecessary stack frame bloat. This on x86 where AFAIK it's doubtful whether > cmov is a win at all anyway. I think this shows the approach is just bad, > even ignoring problems like that it could allocate multiple scratchpads when > one would suffice, or allocate one and end up not using it because the > transformation fails. > > I can't test valgrind right now, it fails to run on my machine, but I guess > it could adapt to allow stores slightly below the stack (maybe warning > once)? It seems like a bit of an edge case to worry about, but if supporting > it is critical and it can't be changed to adapt to new optimizations, then I > think we're probably better off entirely without this scratchpad > transformation. > > Alternatively I can think of a few other possible approaches which wouldn't > require this kind of bloat: > * add support for allocating space in the stack redzone. That could be > interesting for the register allocator as well. Would help only > x86_64, but that's a large fraction of gcc's userbase. > * add support for opportunistically finding unused alignment padding > in the existing stack frame. Less likely to work but would produce > better results when it does. > * on embedded targets we probably don't have to worry about valgrind, > so do the optimal (sp - x) thing there > * allocate a single global as the dummy target. Might be more > expensive to load the address on some targets though. > * at least find a way to express costs for this transformation. > Difficult since you don't yet necessarily know if the function is > going to have a stack frame. Hence, IMO this approach is flawed. > (You'll still want cost estimates even when not allocating stuff in > the normal stack frame, because generated code will still execute > between two and four extra instructions).
Btw, I agree with all your concerns and indeed finding a better place we can always store to is required. I wonder if we can do the actual location assignment after reload (searching for (temporarily) unused already allocated stack space for example). Using the red-zone also crossed my mind. As for performance of the resulting code I'd allow the target to decide whether to never do this, whether to do this only with profile-feedback (and 50%/50% chance) or whether to do this always. Most branch predictors have issues with high branch density so if you have a sequence of if (a) *p = ..; if (b) *q = ..; if (c) *r = ..; then the jump form may have a big penalty (or a lot of padding to avoid it). I'm not sure whether conditional moves in that case behave better (no idea if they also enter the prediction machinery for speculative execution). Richard. > > Bernd