https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67782
Bug ID: 67782 Summary: [SH] Improve bit tests of values loaded from memory Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: olegendo at gcc dot gnu.org Target Milestone: --- Target: sh*-*-* The following example int test (int* x) { return (*x & (1 << 14)) == 0; } compiled with -O2 -m4 -ml: mov.l @r4,r1 mov.w .L2,r2 tst r2,r1 rts movt r0 .align 1 .L2: .short 16384 compiled with -Os -m4 -ml (uses some constant optimization in tstsi_t pattern): mov.l @r4,r0 swap.b r0,r0 tst #64,r0 rts movt r0 Instead of loading the whole 32 bit value from memory, loading one byte is enough: mov.b @(1,r4),r0 tst #64,r0 rts movt r0 Because the value has to go into R0 anyway, using a displacement mov.b is OK, if the displacement is in range and if no further address calculations are needed (e.g. for a different mode than displacement addressing). If the constant is not shared with anything else, this can be a win. Actually the SLOW_BYTE_ACCESS macro has a similar effect. Defining it to 0 will make some optimizations try things like above. Although that particular case doesn't see any improvement, there are some hits in the CSiBE set. For example in linux tcp_input.c: SLOW_BYTE_ACCESS = 1: mov.l @(32,r4),r3 mov.l @(12,r3),r3 tst r10,r3 SLOW_BYTE_ACCESS = 0: mov.l @(32,r4),r0 mov.b @(13,r0),r0 tst #192,r0 tcp_input.c seems to have quite a couple of such cases. There are also other cases like binfmt_script.s: SLOW_BYTE_ACCESS = 1: mov.l @r4,r1 add #-68,r15 mov.w .L54,r2 extu.w r1,r1 cmp/eq r2,r1 bf/s .L67 SLOW_BYTE_ACCESS = 0: mov.w .L54,r1 add #-68,r15 mov.w @r4,r2 cmp/eq r1,r2 bf/s .L67 However, overall the code seems to get a bit worse. It seems this kind of transformation has to be done by taking a bit more context into account. One idea would be to do it rather later, before/during peephole2, although then utilizing tst #imm,R0 might be difficult. It would also be possible to do this during combine by using some special patterns/predicates that accept a memory operand before register allocation, and split out the memory load in split1. However, there are quite a few patterns involved and the final tstsi_t pattern is formed during split1. So maybe tstsi_t can be extended to look for a memory load of the operand and its addressing mode, and convert the memory load accordingly. Although that wouldn't catch the cmp/eq case above.