https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88398
Wilco <wilco at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |wilco at gcc dot gnu.org --- Comment #8 from Wilco <wilco at gcc dot gnu.org> --- (In reply to Jiangning Liu from comment #0) > For the small case below, GCC -O3 can't vectorize the small loop to do byte > comparison in func2. > > void *malloc(long unsigned int); > typedef struct { > unsigned char *buffer; > } data; > > static unsigned char *func1(data *d) > { > return d->buffer; > } > > static int func2(int max, int pos, unsigned char *cur) > { > unsigned char *p = cur + pos; > int len = 0; > while (++len != max) > if (p[len] != cur[len]) > break; > return cur[len]; > } > > int main (int argc) { > data d; > d.buffer = malloc(2*argc); > return func2(argc, argc, func1(&d)); > } > > At the moment, the following code is generated for this loop, > > 4004d4: 38616862 ldrb w2, [x3,x1] > 4004d8: 6b00005f cmp w2, w0 > 4004dc: 540000a1 b.ne 4004f0 <main+0x50> > 4004e0: 38616880 ldrb w0, [x4,x1] > 4004e4: 6b01027f cmp w19, w1 > 4004e8: 91000421 add x1, x1, #0x1 > 4004ec: 54ffff41 b.ne 4004d4 <main+0x34> > > In fact, this loop can be vectorized by checking if the comparison size is > aligned to SIMD register length. It may introduce run time overhead, but > cost model could make decision on doing it or not. The only optimization that can be done here is unrolling. For this kind of string matching the number of bytes that match is will be small on average, so even if vectorization was feasible, the startup overhead alone would kill performance. With unrolling you can remove the comparison with max each iteration and do 4 bytes per iteration like this: loop4: ldrb w2, [x3,4]! ldrb w0, [x4,4]! cmp w0, w2 bne exitloop1 ldrb w2, [x3,1] ldrb w0, [x4,1] cmp w0, w2 bne exitloop2 ldrb w2, [x3,2] ldrb w0, [x4,2] cmp w0, w2 bne exitloop3 ldrb w2, [x3,3] ldrb w0, [x4,3] cmp w0, w2 bne exitloop4 add x1, x1, 4 cmp x1, w19 blt loop4