[Bug target/67782] New: [SH] Improve bit tests of values loaded from memory

olegendo at gcc dot gnu.org Wed, 30 Sep 2015 06:38:42 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67782


            Bug ID: 67782
           Summary: [SH] Improve bit tests of values loaded from memory
           Product: gcc
           Version: unknown
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: olegendo at gcc dot gnu.org
  Target Milestone: ---
            Target: sh*-*-*

The following example

int test (int* x)
{
  return (*x & (1 << 14)) == 0;
}

compiled with -O2 -m4 -ml:
        mov.l   @r4,r1
        mov.w   .L2,r2
        tst     r2,r1
        rts     
        movt    r0
        .align 1
.L2:
        .short  16384

compiled with -Os -m4 -ml (uses some constant optimization in tstsi_t pattern):
        mov.l   @r4,r0
        swap.b  r0,r0
        tst     #64,r0
        rts     
        movt    r0

Instead of loading the whole 32 bit value from memory, loading one byte is
enough:
       mov.b    @(1,r4),r0
       tst      #64,r0
       rts
       movt     r0

Because the value has to go into R0 anyway, using a displacement mov.b is OK,
if the displacement is in range and if no further address calculations are
needed (e.g. for a different mode than displacement addressing).  If the
constant is not shared with anything else, this can be a win.

Actually the SLOW_BYTE_ACCESS macro has a similar effect.  Defining it to 0
will make some optimizations try things like above.  Although that particular
case doesn't see any improvement, there are some hits in the CSiBE set.  For
example in linux tcp_input.c:

SLOW_BYTE_ACCESS = 1:
        mov.l   @(32,r4),r3
        mov.l   @(12,r3),r3
        tst     r10,r3

SLOW_BYTE_ACCESS = 0:
        mov.l   @(32,r4),r0
        mov.b   @(13,r0),r0
        tst     #192,r0

tcp_input.c seems to have quite a couple of such cases.  There are also other
cases like binfmt_script.s:

SLOW_BYTE_ACCESS = 1:
        mov.l   @r4,r1
        add     #-68,r15
        mov.w   .L54,r2
        extu.w  r1,r1
        cmp/eq  r2,r1
        bf/s    .L67

SLOW_BYTE_ACCESS = 0:
        mov.w   .L54,r1
        add     #-68,r15
        mov.w   @r4,r2
        cmp/eq  r1,r2
        bf/s    .L67

However, overall the code seems to get a bit worse.  It seems this kind of
transformation has to be done by taking a bit more context into account.  One
idea would be to do it rather later, before/during peephole2, although then
utilizing tst #imm,R0 might be difficult.

It would also be possible to do this during combine by using some special
patterns/predicates that accept a memory operand before register allocation,
and split out the memory load in split1.  However, there are quite a few
patterns involved and the final tstsi_t pattern is formed during split1.  So
maybe tstsi_t can be extended to look for a memory load of the operand and its
addressing mode, and convert the memory load accordingly.  Although that
wouldn't catch the cmp/eq case above.

[Bug target/67782] New: [SH] Improve bit tests of values loaded from memory

Reply via email to