https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91201
--- Comment #3 from Joel Yliluoma <bisqwit at iki dot fi> ---
For the record, for this particular case (8-bit checksum of an array, 16 bytes
in this case) there exists even more optimal SIMD code, which ICC (version 18
or greater) generates automatically.
vmovups xmm0, XMMWORD PTR bytes[rip] #5.9
vpxor xmm2, xmm2, xmm2 #4.41
vpaddb xmm0, xmm2, xmm0 #4.41
vpsrldq xmm1, xmm0, 8 #4.41
vpaddb xmm3, xmm0, xmm1 #4.41
vpsadbw xmm4, xmm2, xmm3 #4.41
vmovd eax, xmm4 #4.41
movsx rax, al #4.41
ret #7.16