https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82242

            Bug ID: 82242
           Summary: x86_64 bad optimization with -march
           Product: gcc
           Version: 7.1.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: c++
          Assignee: unassigned at gcc dot gnu.org
          Reporter: guillaume at morinfr dot org
  Target Milestone: ---

Created attachment 42200
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=42200&action=edit
"slow" testcase

I have attached a test case (slow_add.cpp) that allocates a large vector of
double, initializes each double then adds them.

When I compile the attached test case with "g++ -std=c++11 -march=nehalem -O3"
(gcc 7.1), the adding part (the program prints the timing) takes on my machine
~14 seconds.  The same program written without a vector (fast_add.cpp) takes ~5
seconds to add the doubles.  Both programs print the same result as the end. 


Slow add loop (slow_add compiled with the options above):
    const double * ptr = array.data();
    const double *const end = array.data() + array.size();

    double result = 0.0;
    startTime = std::chrono::system_clock::now();
    while (ptr != end) {
        result += *ptr;
        ++ptr;
    }
    endTime = std::chrono::system_clock::now();
  400c00:       e8 eb fd ff ff          callq  4009f0
<std::chrono::_V2::system_clock::now()@plt>
  400c05:       49 89 c4                mov    %rax,%r12
  400c08:       4c 39 f3                cmp    %r14,%rbx
  400c0b:       0f 84 09 01 00 00       je     400d1a <main+0x27a>
  400c11:       48 8b 2d 78 03 00 00    mov    0x378(%rip),%rbp        # 400f90
<_IO_stdin_used+0x28>
  400c18:       4c 89 f0                mov    %r14,%rax
  400c1b:       0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
  400c20:       66 48 0f 6e cd          movq   %rbp,%xmm1              <== BAD
  400c25:       f2 0f 58 08             addsd  (%rax),%xmm1
  400c29:       48 83 c0 08             add    $0x8,%rax
  400c2d:       66 48 0f 7e cd          movq   %xmm1,%rbp              <== BAD
  400c32:       48 39 d8                cmp    %rbx,%rax
  400c35:       75 e9                   jne    400c20 <main+0x180>
  400c37:       e8 b4 fd ff ff          callq  4009f0
<std::chrono::_V2::system_clock::now()@plt>

As you can see the result is copied back and forth between %xmm1 and %rbp
un-necessarily

Pretty much the same program without a vector produces a much better version:
Fast add loop: (fast_add compiled with the options above):
    startTime = std::chrono::system_clock::now();
    while (ptr != end) {
        result += *ptr;
        ++ptr;
    }
    endTime = std::chrono::system_clock::now();
  400a68:       e8 93 fe ff ff          callq  400900
<std::chrono::_V2::system_clock::now()@plt>
  400a6d:       66 0f ef c9             pxor   %xmm1,%xmm1
  400a71:       49 89 c4                mov    %rax,%r12
  400a74:       48 b8 00 00 00 00 08    movabs $0x800000000,%rax
  400a7b:       00 00 00 
  400a7e:       48 01 e8                add    %rbp,%rax
  400a81:       eb 09                   jmp    400a8c <main+0x11c>
  400a83:       0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
  400a88:       48 83 c3 08             add    $0x8,%rbx
  400a8c:       f2 0f 58 4d 00          addsd  0x0(%rbp),%xmm1
  400a91:       48 89 dd                mov    %rbx,%rbp
  400a94:       48 39 c3                cmp    %rax,%rbx
  400a97:       75 ef                   jne    400a88 <main+0x118>
  400a99:       f2 0f 11 4c 24 08       movsd  %xmm1,0x8(%rsp)
  400a9f:       e8 5c fe ff ff          callq  400900
<std::chrono::_V2::system_clock::now()@plt>

If I remove -march when compiling slow_add.cpp, the performance and the
generated assembly is in line wth fast_add.cpp.  Compiling with -fno-exceptions
but keeping -march also solves the issue.

I tried -march=native on both Westemere and Haswell machines as well and it
produces slow code as well on both.  Removing -march or adding -fno-exceptions
fixes the issue on both.

I see the same issue with gcc 6.3

Reply via email to