https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108608
--- Comment #13 from fengfei.xi at horizon dot auto --- (In reply to Richard Sandiford from comment #12) > (In reply to fengfei.xi from comment #11) > > could you please explain under what specific circumstances this change might > > lead to slower performance? > > Also, is there a more complete fix or any plans for further optimization? > The log message was a bit cryptic, sorry. The problem isn't that the patch > makes things slower. Instead, it's the feature that the patch is fixing > that makes things slower. > > If the vectoriser vectorises something like: > > int64_t f(int32_t *x, int32_t *y) { > int64_t res = 0; > for (int i = 0; i < 100; ++i) > res += x[i] * y[i]; > return res; > } > > one option is to have one vector of 32-bit integers for each of x and y and > two vectors of 64-bit integers for res (so that the total number of elements > in the same). With this approach, the vectoriser can do two parallel > additions on each res vector. > > In contrast, single def-use cycles replace the two res vectors with one res > vector but add to it twice. You can see the effect in > https://godbolt.org/z/o11zrMbWs . The main loop is: > > ldr q30, [x1, x2] > ldr q29, [x0, x2] > add x2, x2, 16 > mul v29.4s, v30.4s, v29.4s > saddw v31.2d, v31.2d, v29.2s > saddw2 v31.2d, v31.2d, v29.4s > cmp x2, 400 > bne .L2 > > This adds to v31 twice, doubling the loop-carried latency. Ideally we would > do: > > ldr q30, [x1, x2] > ldr q29, [x0, x2] > add x2, x2, 16 > mul v29.4s, v30.4s, v29.4s > saddw v31.2d, v31.2d, v29.2s > saddw2 v28.2d, v28.2d, v29.4s > cmp x2, 400 > bne .L2 > add v31.2d, v31.2d, v28.2d > > instead. > > The vectoriser specifically chooses the first (serial) version over the > second (parallel) one. The commit message was complaining about that. But > the patch doesn't change that decision. It just makes both versions work. OKOK. Thank you very much for your detailed explanation. I understand. Best regards, Fengfei.Xi