https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115252
Bug ID: 115252 Summary: The SLP vectorizer failed to perform automatic vectorization on pixel_sub_wxh of x264 Product: gcc Version: 14.1.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: hkzhang455 at gmail dot com Target Milestone: --- Test case: (from https://github.com/mirror/x264/blob/master/common/dct.c) void pixel_sub_wxh(int16_t *diff, uint8_t *pix1, uint8_t *pix2) { for (int y = 0; y < 4; y++) { for (int x = 0; x < 4; x++) diff[x + y * 4] = pix1[x] - pix2[x]; pix1 += 16; pix2 += 32; } } This is a simplified version, as the original code will inlined and some of the parameters are constant. When compiling the function with `-O3 -mavx2`, . But after that, the code in it should be vectorized When I compiled with `-O3 -mavx2/-msse4.2`, the inner loop will be unrolled and SLP vectorizer failed to vectorize it, and I got the following message when adding `-fopt-info-vec-all`. <source>:6:21: optimized: loop vectorized using 8 byte vectors <source>:6:21: optimized: loop versioned for vectorization because of possible aliasing <source>:5:6: note: vectorized 1 loops in function. <source>:5:6: note: ***** Analysis failed with vector mode V8SI <source>:5:6: note: ***** The result for vector mode V32QI would be the same <source>:5:6: note: ***** Re-trying analysis with vector mode V16QI <source>:5:6: note: ***** Analysis failed with vector mode V16QI <source>:5:6: note: ***** Re-trying analysis with vector mode V8QI <source>:5:6: note: ***** Analysis failed with vector mode V8QI <source>:5:6: note: ***** Re-trying analysis with vector mode V4QI <source>:5:6: note: ***** Analysis failed with vector mode V4QI If I manually use the type declaration provided by `immintrin.h` to rewrite the code, the code is as follows (which I hope the SLP vectorizer to be able to do) void pixel_sub_wxh_vec(int16_t *diff, uint8_t *pix1, uint8_t *pix2) { for (int y = 0; y < 4; y++) { __v4hi pix1_v = {pix1[0], pix1[1], pix1[2], pix1[3]}; __v4hi pix2_v = {pix2[0], pix2[1], pix2[2], pix2[3]}; __v4hi diff_v = pix1_v - pix2_v; *(long long *)(diff + y * 4) = (long long)diff_v; pix1 += 16; pix2 += 32; } } I raised this issue in Gcc mailling list already, and Biner gave some analysis, that is, pix1 and pix2 are both uint8_t type, and their iterations are scalar, so this issue will exist, but I still submit a bug here and hope to follow up.