Re: using scratchpads to enhance RTL-level if-conversion: revised patch

Richard Biener Thu, 15 Oct 2015 01:53:50 -0700

On Wed, Oct 14, 2015 at 9:15 PM, Bernd Schmidt <bschm...@redhat.com> wrote:
> On 10/14/2015 07:43 PM, Jeff Law wrote:
>>
>> Obviously some pessimization relative to current code is necessary to
>> fix some of the problems WRT thread safety and avoiding things like
>> introducing faults in code which did not previously fault.
>
>
> Huh? This patch is purely an (attempt at) optimization, not something that
> fixes any problems.
>
>> However, pessimization of safe code is, err, um, bad and needs to be
>> avoided.
>
>
> Here's an example:
>
>                                   >         subq    $16, %rsp
> [...]
>                                   >         leaq    8(%rsp), %r8
>                                   >         leaq    256(%rax), %rdx
>     cmpq    256(%rax), %rcx       |         cmpq    256(%rax), %rsi
>     jne    .L97                   <
>     movq    $0, 256(%rax)         <
> .L97:                             <
>                                   >         movq    %rdx, %rax
>                                   >         cmovne  %r8, %rax
>                                   >         movq    $0, (%rax)
> [...]
>                                   >         addq    $16, %rsp
>
> In the worst case that executes six more instructions, and always causes
> unnecessary stack frame bloat. This on x86 where AFAIK it's doubtful whether
> cmov is a win at all anyway. I think this shows the approach is just bad,
> even ignoring problems like that it could allocate multiple scratchpads when
> one would suffice, or allocate one and end up not using it because the
> transformation fails.
>
> I can't test valgrind right now, it fails to run on my machine, but I guess
> it could adapt to allow stores slightly below the stack (maybe warning
> once)? It seems like a bit of an edge case to worry about, but if supporting
> it is critical and it can't be changed to adapt to new optimizations, then I
> think we're probably better off entirely without this scratchpad
> transformation.
>
> Alternatively I can think of a few other possible approaches which wouldn't
> require this kind of bloat:
>  * add support for allocating space in the stack redzone. That could be
>    interesting for the register allocator as well. Would help only
>    x86_64, but that's a large fraction of gcc's userbase.
>  * add support for opportunistically finding unused alignment padding
>    in the existing stack frame. Less likely to work but would produce
>    better results when it does.
>  * on embedded targets we probably don't have to worry about valgrind,
>    so do the optimal (sp - x) thing there
>  * allocate a single global as the dummy target. Might be more
>    expensive to load the address on some targets though.
>  * at least find a way to express costs for this transformation.
>    Difficult since you don't yet necessarily know if the function is
>    going to have a stack frame. Hence, IMO this approach is flawed.
>    (You'll still want cost estimates even when not allocating stuff in
>    the normal stack frame, because generated code will still execute
>    between two and four extra instructions).


Btw, I agree with all your concerns and indeed finding a better place
we can always store to is required.  I wonder if we can do the actual
location assignment after reload (searching for (temporarily) unused
already allocated stack space for example).  Using the red-zone also
crossed my mind.

As for performance of the resulting code I'd allow the target to
decide whether to never do this, whether to do this only with
profile-feedback (and 50%/50% chance) or whether to do this
always.

Most branch predictors have issues with high branch density so
if you have a sequence of

 if (a) *p = ..;
 if (b) *q = ..;
 if (c) *r = ..;

then the jump form may have a big penalty (or a lot of padding
to avoid it).  I'm not sure whether conditional moves in that case
behave better (no idea if they also enter the prediction machinery
for speculative execution).

Richard.

>
> Bernd

Re: using scratchpads to enhance RTL-level if-conversion: revised patch

Reply via email to