https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91569
Bug ID: 91569 Summary: Optimisation test case and unnecessary XOR-OR pair instead of MOV. Product: gcc Version: 9.2.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: rtl-optimization Assignee: unassigned at gcc dot gnu.org Reporter: cubitect at gmail dot com Target Milestone: --- I wasn't entirely sure where to post this, but I have a very simple test problem that shows some missed optimisation potential. The task is to cast an integer to a long and replace the second lowest byte of the result with a constant (4). Below are three ways to achieve this: long opt_test1(int num) // opt_test1: { // movslq %edi, %rax union { // mmovb $4, %ah long q; // ret struct { char l,h; }; } a; a.q = num; a.h = 4; return a.q; } The union here is modelled after the structure of a r?x register which contains the low and high byte registers: ?l and ?h. The cast and second byte assignment can be done in one instruction each. The optimiser manages to understand this and gives the optimal instructions. long opt_test2(int num) // opt_test2: { // movl %edi, %eax long a = num; // xor %ah, %ah a &= (-1UL ^ 0xff00); // orb $4, %ah a |= (4 << 8); // cltq return a; // ret } This solution, based on a bitwise AND and OR, is interesting. The optimiser recognised that I am interested in the second byte and makes use of the 'ah' register, but why is there a XOR and an OR rather than an a single, equivalent MOV? Similarly the (MOV + CLTQ) can be replaced outright with MOVSLQ. Notable here is that some older versions (such as "gcc-4.8.5 -O3") give results that correspond more to the C code: andl $-65281, %edi orl $1024, %edi movslq %edi, %rax ret which is actually better than the output for gcc-9.2. long opt_test3(int num) // opt_test3: { // movslq %edi, %rdi long a = num; // movq %rdi, -8(%rsp) ((char*)&a)[1] = 4; // movb $4, -7(%rsp) return a; // movq -8(%rsp), %rax } // ret This is the straightforwards approach, addressing the second byte in memory. I am including this because LLVM manages to recognise that the stack is not actually necessary and goes for a register based solution. As far as I could tell, these results seem quite consistent across most GCC versions and across all optimisation levels above -O0. However, I obtained the assembly code above using: $ gcc-9.2 opt_tests.c -S -O3 -Wall -Wextra -pedantic