https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110062

--- Comment #6 from Richard Biener <rguenth at gcc dot gnu.org> ---
Btw, we would also be able to vectorize just the red and green channel:

t.c:18:27: note: ***** Analysis succeeded with vector mode V4SF
t.c:18:27: note: SLPing BB part
t.c:18:27: note: Costing subgraph:
t.c:18:27: note: node 0x420b6c8 (max_nunits=2, refcnt=1) vector(2) unsigned
char
t.c:18:27: note: op template: q_45(D)->red = _29;
t.c:18:27: note:        stmt 0 q_45(D)->red = _29;
t.c:18:27: note:        stmt 1 q_45(D)->green = _31;
t.c:18:27: note:        children 0x420b750
t.c:18:27: note: node (external) 0x420b750 (max_nunits=2, refcnt=1) vector(2)
unsigned char
t.c:18:27: note:        stmt 0 _29 = (unsigned char) pixel$red_78;
t.c:18:27: note:        stmt 1 _31 = (unsigned char) pixel$green_84;
t.c:18:27: note:        children 0x420b7d8
t.c:18:27: note: node 0x420b7d8 (max_nunits=2, refcnt=1) vector(2) float
t.c:18:27: note: op template: pixel$red_78 = PHI <_142(11),
pixel$red_60(D)(10)>
t.c:18:27: note:        stmt 0 pixel$red_78 = PHI <_142(11),
pixel$red_60(D)(10)>
t.c:18:27: note:        stmt 1 pixel$green_84 = PHI <_144(11),
pixel$green_61(D)(10)>
t.c:18:27: note:        children 0x420b860 0x420be38
t.c:18:27: note: node 0x420b860 (max_nunits=2, refcnt=1) vector(2) float
t.c:18:27: note: op template: _142 = PHI <_143(4)>
t.c:18:27: note:        stmt 0 _142 = PHI <_143(4)>
t.c:18:27: note:        stmt 1 _144 = PHI <_145(4)>
t.c:18:27: note:        children 0x420b8e8
t.c:18:27: note: node 0x420b8e8 (max_nunits=2, refcnt=2) vector(2) float
t.c:18:27: note: op template: _143 = PHI <_12(3)>
t.c:18:27: note:        stmt 0 _143 = PHI <_12(3)>
t.c:18:27: note:        stmt 1 _145 = PHI <_17(3)>
t.c:18:27: note:        children 0x420b970
t.c:18:27: note: node 0x420b970 (max_nunits=2, refcnt=2) vector(2) float
t.c:18:27: note: op template: _12 = _11 + pixel$red_80;
t.c:18:27: note:        stmt 0 _12 = _11 + pixel$red_80;
t.c:18:27: note:        stmt 1 _17 = _16 + pixel$green_82;
t.c:18:27: note:        children 0x420b9f8 0x420bca0
t.c:18:27: note: node 0x420b9f8 (max_nunits=2, refcnt=1) vector(2) float
t.c:18:27: note: op template: _11 = _4 * _10;
t.c:18:27: note:        stmt 0 _11 = _4 * _10;
t.c:18:27: note:        stmt 1 _16 = _4 * _15;
t.c:18:27: note:        children 0x420ba80 0x420bb08
t.c:18:27: note: node (external) 0x420ba80 (max_nunits=1, refcnt=1) vector(2)
float
t.c:18:27: note:        { _4, _4 }
t.c:18:27: note: node 0x420bb08 (max_nunits=2, refcnt=1) vector(2) float
t.c:18:27: note: op template: _10 = (float) _9;
t.c:18:27: note:        stmt 0 _10 = (float) _9;
t.c:18:27: note:        stmt 1 _15 = (float) _14;
t.c:18:27: note:        children 0x420bb90
t.c:18:27: note: node (external) 0x420bb90 (max_nunits=2, refcnt=1) vector(2)
int
t.c:18:27: note:        stmt 0 _9 = (int) _8;
t.c:18:27: note:        stmt 1 _14 = (int) _13;
t.c:18:27: note:        children 0x420bc18
t.c:18:27: note: node 0x420bc18 (max_nunits=2, refcnt=1) vector(2) unsigned
char
t.c:18:27: note: op template: _8 = _7->red;
t.c:18:27: note:        stmt 0 _8 = _7->red;
t.c:18:27: note:        stmt 1 _13 = _7->green;
t.c:18:27: note: node 0x420bca0 (max_nunits=2, refcnt=1) vector(2) float
t.c:18:27: note: op template: pixel$red_80 = PHI <_12(9), pixel$red_79(5)>
t.c:18:27: note:        stmt 0 pixel$red_80 = PHI <_12(9), pixel$red_79(5)>
t.c:18:27: note:        stmt 1 pixel$green_82 = PHI <_17(9), pixel$green_85(5)>
t.c:18:27: note:        children 0x420b970 0x420bd28
t.c:18:27: note: node 0x420bd28 (max_nunits=2, refcnt=1) vector(2) float
t.c:18:27: note: op template: pixel$red_79 = PHI <_143(8), pixel$red_60(D)(7)>
t.c:18:27: note:        stmt 0 pixel$red_79 = PHI <_143(8), pixel$red_60(D)(7)>
t.c:18:27: note:        stmt 1 pixel$green_85 = PHI <_145(8),
pixel$green_61(D)(7)>
t.c:18:27: note:        children 0x420b8e8 0x420bdb0
t.c:18:27: note: node (external) 0x420bdb0 (max_nunits=1, refcnt=1) vector(2)
float
t.c:18:27: note:        { pixel$red_60(D), pixel$green_61(D) }
t.c:18:27: note: node (external) 0x420be38 (max_nunits=1, refcnt=1) vector(2)
float
t.c:18:27: note:        { pixel$red_60(D), pixel$green_61(D) }

But the '(external)' show that we're missing support for some operations:

t.c:18:27: note:   ==> examining statement: _29 = (unsigned char) pixel$red_78;
t.c:18:27: note:   vect_is_simple_use: operand pixel$red_78 = PHI <_142(11),
pixel$red_60(D)(10)>, type of def: internal
t.c:18:27: missed:   conversion not supported by target.
t.c:18:27: note:   vect_is_simple_use: operand pixel$red_78 = PHI <_142(11),
pixel$red_60(D)(10)>, type of def: internal
t.c:18:27: missed:   no optab.
t.c:18:27: missed:   not vectorized: relevant stmt not supported: _29 =
(unsigned char) pixel$red_78;
t.c:18:27: note:   Building vector operands of 0x4215e90 from scalars instead

that's float -> unsigned char

for the stores:

                    q->red=pixel.red;
                    q->green=pixel.green;

we then cut the SLP off from that node, we're not considering keeping
the remains and materialize the sources of the conversions from vector
components.  That is, we're not trying to split the SLP graph at
such edges but simply throw away unreachable bits.

So there's this BB SLP issue, the issue we're not vectorizing the loop
and possibly the issue that we're not able to vectorize this conversion.

You btw didn't show me whether clang vectorizes the store (and this
conversion).  clang 13 does

        vcvttps2dq      %xmm1, %xmm1
        vpackusdw       %xmm1, %xmm1, %xmm1
        vpackuswb       %xmm1, %xmm1, %xmm1
        vcvttss2si      %xmm0, %eax
        jmp     .LBB0_9
.LBB0_1:
                                        # implicit-def: $al
                                        # implicit-def: $xmm1
.LBB0_9:
        vpextrb $0, %xmm1, (%r8)
        vpextrb $1, %xmm1, 1(%r8)
        movb    %al, 2(%r8)
        movb    $-1, 3(%r8)

so it doesn't vectorize the stores and it vectorizes the conversions
by converting to int and then packing two times to short and then char.
I suppose since it extracts the bytes the clang way would have been
faster extracting the two floats and doing scalar conversions like it
does for blue.

Reply via email to