https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88398

Wilco <wilco at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |wilco at gcc dot gnu.org

--- Comment #8 from Wilco <wilco at gcc dot gnu.org> ---
(In reply to Jiangning Liu from comment #0)
> For the small case below, GCC -O3 can't vectorize the small loop to do byte
> comparison in func2.
> 
> void *malloc(long unsigned int);
> typedef struct {
>         unsigned char *buffer;
> } data;
> 
> static unsigned char *func1(data *d)
> {
>         return d->buffer;
> }
> 
> static int func2(int max, int pos, unsigned char *cur)
> {
>         unsigned char *p = cur + pos;
>         int len = 0;
>         while (++len != max)
>                 if (p[len] != cur[len])
>                         break;
>         return cur[len];
> }
> 
> int main (int argc) {
>         data d;
>         d.buffer = malloc(2*argc);
>         return func2(argc, argc, func1(&d));
> }
> 
> At the moment, the following code is generated for this loop,
> 
>   4004d4:       38616862        ldrb    w2, [x3,x1]
>   4004d8:       6b00005f        cmp     w2, w0
>   4004dc:       540000a1        b.ne    4004f0 <main+0x50>
>   4004e0:       38616880        ldrb    w0, [x4,x1]
>   4004e4:       6b01027f        cmp     w19, w1
>   4004e8:       91000421        add     x1, x1, #0x1
>   4004ec:       54ffff41        b.ne    4004d4 <main+0x34>
> 
> In fact, this loop can be vectorized by checking if the comparison size is
> aligned to SIMD register length. It may introduce run time overhead, but
> cost model could make decision on doing it or not.

The only optimization that can be done here is unrolling. For this kind of
string matching the number of bytes that match is will be small on average, so
even if vectorization was feasible, the startup overhead alone would kill
performance. With unrolling you can remove the comparison with max each
iteration and do 4 bytes per iteration like this:

loop4:
ldrb    w2, [x3,4]!
ldrb    w0, [x4,4]!
cmp     w0, w2
bne     exitloop1
ldrb    w2, [x3,1]
ldrb    w0, [x4,1]
cmp     w0, w2
bne     exitloop2
ldrb    w2, [x3,2]
ldrb    w0, [x4,2]
cmp     w0, w2
bne     exitloop3
ldrb    w2, [x3,3]
ldrb    w0, [x4,3]
cmp     w0, w2
bne     exitloop4
add     x1, x1, 4
cmp     x1, w19
blt     loop4

Reply via email to