https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88398
Jiu Fu Guo <guojiufu at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |guojiufu at gcc dot gnu.org --- Comment #25 from Jiu Fu Guo <guojiufu at gcc dot gnu.org> --- (In reply to Jakub Jelinek from comment #10) > If the compiler knew say from PGO that pos is usually a multiple of certain > power of two and that the loop usually iterates many times (I guess the > latter can be determined from comparing the bb count of the loop itself and > its header), it could emit something like: > static int func2(int max, int pos, unsigned char *cur) > { > unsigned char *p = cur + pos; > int len = 0; > if (max > 32 && (pos & 7) == 0) > { > int l = ((1 - ((uintptr_t) cur)) & 7) + 1; > while (++len != l) > if (p[len] != cur[len]) > goto end; > unsigned long long __attribute__((may_alias)) *p2 = (unsigned long > long *) &p[len]; > unsigned long long __attribute__((may_alias)) *cur2 = (unsigned long > long *) &cur[len]; > while (len + 8 < max) > { > if (*p2++ != *cur2++) > break; > len += 8; > } > --len; > } > while (++len != max) > if (p[len] != cur[len]) > break; > end: > return cur[len]; > } > > or so (untested). Of course, it could be done using SIMD too if there is a > way to terminate the loop if any of the elts is different and could be done > in that case at 16 or 32 or 64 characters at a time etc. > But, without knowing that pos is typically some power of two this would just > waste code size, dealing with the unaligned cases would be more complicated > (one can't read the next elt until proving that the current one is all > equal), so it would need to involve some rotations (or permutes for SIMD). With this kind of widening, we can see ~5% performance improvement for xz on some platforms.