https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115252
Bug ID: 115252
Summary: The SLP vectorizer failed to perform automatic
vectorization on pixel_sub_wxh of x264
Product: gcc
Version: 14.1.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: tree-optimization
Assignee: unassigned at gcc dot gnu.org
Reporter: hkzhang455 at gmail dot com
Target Milestone: ---
Test case: (from https://github.com/mirror/x264/blob/master/common/dct.c)
void pixel_sub_wxh(int16_t *diff, uint8_t *pix1, uint8_t *pix2) {
for (int y = 0; y < 4; y++) {
for (int x = 0; x < 4; x++)
diff[x + y * 4] = pix1[x] - pix2[x];
pix1 += 16;
pix2 += 32;
}
}
This is a simplified version, as the original code will inlined and some of the
parameters are constant.
When compiling the function with `-O3 -mavx2`, . But after that, the code in it
should be vectorized
When I compiled with `-O3 -mavx2/-msse4.2`, the inner loop will be unrolled and
SLP vectorizer failed to vectorize it, and I got the following message when
adding
`-fopt-info-vec-all`.
<source>:6:21: optimized: loop vectorized using 8 byte vectors
<source>:6:21: optimized: loop versioned for vectorization because of
possible aliasing
<source>:5:6: note: vectorized 1 loops in function.
<source>:5:6: note: ***** Analysis failed with vector mode V8SI
<source>:5:6: note: ***** The result for vector mode V32QI would be the same
<source>:5:6: note: ***** Re-trying analysis with vector mode V16QI
<source>:5:6: note: ***** Analysis failed with vector mode V16QI
<source>:5:6: note: ***** Re-trying analysis with vector mode V8QI
<source>:5:6: note: ***** Analysis failed with vector mode V8QI
<source>:5:6: note: ***** Re-trying analysis with vector mode V4QI
<source>:5:6: note: ***** Analysis failed with vector mode V4QI
If I manually use the type declaration provided by `immintrin.h` to
rewrite the code, the code is as follows (which I hope the SLP
vectorizer to be able to do)
void pixel_sub_wxh_vec(int16_t *diff, uint8_t *pix1, uint8_t *pix2) {
for (int y = 0; y < 4; y++) {
__v4hi pix1_v = {pix1[0], pix1[1], pix1[2], pix1[3]};
__v4hi pix2_v = {pix2[0], pix2[1], pix2[2], pix2[3]};
__v4hi diff_v = pix1_v - pix2_v;
*(long long *)(diff + y * 4) = (long long)diff_v;
pix1 += 16;
pix2 += 32;
}
}
I raised this issue in Gcc mailling list already, and Biner gave some analysis,
that is, pix1 and pix2 are both uint8_t type, and their iterations are scalar,
so this issue will exist, but I still submit a bug here and hope to follow up.