https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82242
Bug ID: 82242 Summary: x86_64 bad optimization with -march Product: gcc Version: 7.1.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: guillaume at morinfr dot org Target Milestone: --- Created attachment 42200 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=42200&action=edit "slow" testcase I have attached a test case (slow_add.cpp) that allocates a large vector of double, initializes each double then adds them. When I compile the attached test case with "g++ -std=c++11 -march=nehalem -O3" (gcc 7.1), the adding part (the program prints the timing) takes on my machine ~14 seconds. The same program written without a vector (fast_add.cpp) takes ~5 seconds to add the doubles. Both programs print the same result as the end. Slow add loop (slow_add compiled with the options above): const double * ptr = array.data(); const double *const end = array.data() + array.size(); double result = 0.0; startTime = std::chrono::system_clock::now(); while (ptr != end) { result += *ptr; ++ptr; } endTime = std::chrono::system_clock::now(); 400c00: e8 eb fd ff ff callq 4009f0 <std::chrono::_V2::system_clock::now()@plt> 400c05: 49 89 c4 mov %rax,%r12 400c08: 4c 39 f3 cmp %r14,%rbx 400c0b: 0f 84 09 01 00 00 je 400d1a <main+0x27a> 400c11: 48 8b 2d 78 03 00 00 mov 0x378(%rip),%rbp # 400f90 <_IO_stdin_used+0x28> 400c18: 4c 89 f0 mov %r14,%rax 400c1b: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1) 400c20: 66 48 0f 6e cd movq %rbp,%xmm1 <== BAD 400c25: f2 0f 58 08 addsd (%rax),%xmm1 400c29: 48 83 c0 08 add $0x8,%rax 400c2d: 66 48 0f 7e cd movq %xmm1,%rbp <== BAD 400c32: 48 39 d8 cmp %rbx,%rax 400c35: 75 e9 jne 400c20 <main+0x180> 400c37: e8 b4 fd ff ff callq 4009f0 <std::chrono::_V2::system_clock::now()@plt> As you can see the result is copied back and forth between %xmm1 and %rbp un-necessarily Pretty much the same program without a vector produces a much better version: Fast add loop: (fast_add compiled with the options above): startTime = std::chrono::system_clock::now(); while (ptr != end) { result += *ptr; ++ptr; } endTime = std::chrono::system_clock::now(); 400a68: e8 93 fe ff ff callq 400900 <std::chrono::_V2::system_clock::now()@plt> 400a6d: 66 0f ef c9 pxor %xmm1,%xmm1 400a71: 49 89 c4 mov %rax,%r12 400a74: 48 b8 00 00 00 00 08 movabs $0x800000000,%rax 400a7b: 00 00 00 400a7e: 48 01 e8 add %rbp,%rax 400a81: eb 09 jmp 400a8c <main+0x11c> 400a83: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1) 400a88: 48 83 c3 08 add $0x8,%rbx 400a8c: f2 0f 58 4d 00 addsd 0x0(%rbp),%xmm1 400a91: 48 89 dd mov %rbx,%rbp 400a94: 48 39 c3 cmp %rax,%rbx 400a97: 75 ef jne 400a88 <main+0x118> 400a99: f2 0f 11 4c 24 08 movsd %xmm1,0x8(%rsp) 400a9f: e8 5c fe ff ff callq 400900 <std::chrono::_V2::system_clock::now()@plt> If I remove -march when compiling slow_add.cpp, the performance and the generated assembly is in line wth fast_add.cpp. Compiling with -fno-exceptions but keeping -march also solves the issue. I tried -march=native on both Westemere and Haswell machines as well and it produces slow code as well on both. Removing -march or adding -fno-exceptions fixes the issue on both. I see the same issue with gcc 6.3