https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89582

            Bug ID: 89582
           Summary: Suboptimal code generated for floating point struct in
                    -O3 compare to -O2
           Product: gcc
           Version: 8.2.1
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: yyc1992 at gmail dot com
  Target Milestone: ---

When testing the code for https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89581 on
linux, I noticed that the code seems suboptimum when compiled under -O3 rather
than -O2 on linux x64.

```
typedef struct {
    double x1;
    double x2;
} vdouble __attribute__((aligned(16)));

vdouble f(vdouble x, vdouble y)
{
    return (vdouble){x.x1 + y.x1, x.x2 + y.x2};
}
```

Compiled with `-O2` produces
```
f:
        addsd   %xmm3, %xmm1
        addsd   %xmm2, %xmm0
        ret
```

With `-O3` or `-Ofast`, however, the code produced is,

```
f:
        movq    %xmm0, -40(%rsp)
        movq    %xmm1, -32(%rsp)
        movapd  -40(%rsp), %xmm4
        movq    %xmm2, -24(%rsp)
        movq    %xmm3, -16(%rsp)
        addpd   -24(%rsp), %xmm4
        movaps  %xmm4, -40(%rsp)
        movsd   -32(%rsp), %xmm1
        movsd   -40(%rsp), %xmm0
        ret
```

It seems that gcc tries to use the vector instruction but had to use the stack
for that. I did a quick benchmark which confirms that the -O3 version is much
slower than the -O2 version.

Clang produces

```
f:
        addsd   %xmm2, %xmm0
        addsd   %xmm3, %xmm1
        retq
```

As long as any optimizations are on, which seems appropriate.

Reply via email to