https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70088
Bug ID: 70088 Summary: ARM/THUMB unnecessarily typecasts some rvalues on memory store Product: gcc Version: 5.2.1 Status: UNCONFIRMED Severity: minor Priority: P3 Component: c Assignee: unassigned at gcc dot gnu.org Reporter: aik at aol dot com.au Target Milestone: --- When a multiplication is part of an rvalue for memory storage, gcc is casting/wrapping variables smaller than the native width (32bit) to the destination pointer's dereference type and expanding back to 32bit (via LSL+LSR, LSL+ASR, or AND), even if the [r]value is not used for anything more than being stored to memory. Since memory stores smaller than the native register width ignore the higher bits, this behaviour is unnecessary and results in bloat inside of hotspots. Additionally, there is a strange behaviour where if one logical-ORs a variable with a constant which is negative (if the type were signed) and smaller than the native register width, gcc will emit code to sign extend (even in unsigned cases), making the code inefficient. For example: // typeof(m) = unsigned short * // typeof(x) = unsigned short *m++ = x|0x4000U; *m++ = x|0x8000U; This is trivially translated to ARM assembly as: ; r0: &m ; r1: x ORR r2, r1, #0x4000 ; t1 = x|0x4000U ORR r1, r1, #0x8000 ; t2 = x|0x8000U STRH r2, [r0], #2 ; *m++ = t1 STRH r1, [r0], #2 ; *m++ = t2 However, gcc is generating the following instead: ; r0: &m ; r1: x MVN r3, r1, lsl #17 ; t2 = x|0xFFFF8000 MVN r3, r3, lsr #17 ORR r1, r1, #0x4000 ; t1 = x|0x4000U STRH r1, [r0, #0] ; m[0] = t1 STRH r3, [r0, #2] ; m[1] = t2 In that instance, it's not too awful (just one extra instruction). However, when these sign-extended values become impossible to generate in two instructions, gcc will resort to using a literal pool to fetch the OR constant. The C code: *m++ = x|0x4100U; *m++ = x|0x8100U; The trivial interpretation: ORR r2, r1, #0x4100 ; t1 = x|0x4100U ORR r1, r1, #0x8100 ; t2 = x|0x8100U STRH r2, [r0], #2 ; *m++ = t1 STRH r1, [r0], #2 ; *m++ = t2 The generated assembly (instruction sorted for readability): ORR ip, r1, #0x4100 ; t1 = x|0x4100U LDR r3, =0xFFFF8100 ; t2 = x|0xFFFF8100 ORR r3, r1, r3 STRH ip, [r0, #4] ; m[2] = t1 STRH r3, [r0, #6] ; m[3] = t2 Not only is this slower (due to the extra instruction and the memory access), but it also takes up more memory (and the more constants you have that require a memory load for sign-extension, the worse it gets). --- Comment #1 from Richard Earnshaw <rearnsha at gcc dot gnu.org> --- *** Bug 70089 has been marked as a duplicate of this bug. ***