https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108608

--- Comment #12 from Richard Sandiford <rsandifo at gcc dot gnu.org> ---
(In reply to fengfei.xi from comment #11)
> could you please explain under what specific circumstances this change might
> lead to slower performance?
> Also, is there a more complete fix or any plans for further optimization?
The log message was a bit cryptic, sorry.  The problem isn't that the patch
makes things slower.  Instead, it's the feature that the patch is fixing that
makes things slower.

If the vectoriser vectorises something like:

  int64_t f(int32_t *x, int32_t *y) {
    int64_t res = 0;
    for (int i = 0; i < 100; ++i)
      res += x[i] * y[i];
    return res;
  }

one option is to have one vector of 32-bit integers for each of x and y and two
vectors of 64-bit integers for res (so that the total number of elements in the
same).  With this approach, the vectoriser can do two parallel additions on
each res vector.

In contrast, single def-use cycles replace the two res vectors with one res
vector but add to it twice.  You can see the effect in
https://godbolt.org/z/o11zrMbWs .   The main loop is:

        ldr     q30, [x1, x2]
        ldr     q29, [x0, x2]
        add     x2, x2, 16
        mul     v29.4s, v30.4s, v29.4s
        saddw   v31.2d, v31.2d, v29.2s
        saddw2  v31.2d, v31.2d, v29.4s
        cmp     x2, 400
        bne     .L2

This adds to v31 twice, doubling the loop-carried latency.  Ideally we would
do:

        ldr     q30, [x1, x2]
        ldr     q29, [x0, x2]
        add     x2, x2, 16
        mul     v29.4s, v30.4s, v29.4s
        saddw   v31.2d, v31.2d, v29.2s
        saddw2  v28.2d, v28.2d, v29.4s
        cmp     x2, 400
        bne     .L2
        add     v31.2d, v31.2d, v28.2d

instead.

The vectoriser specifically chooses the first (serial) version over the second
(parallel) one.  The commit message was complaining about that.  But the patch
doesn't change that decision.  It just makes both versions work.

Reply via email to