https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68109
Bug ID: 68109 Summary: GCC fails to vectorize popcount on x86_64 Product: gcc Version: 5.2.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: other Assignee: unassigned at gcc dot gnu.org Reporter: haneef503 at gmail dot com Target Milestone: --- Created attachment 36595 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=36595&action=edit Clang Vectorized Assembly Output The following code is an SSCCE that GCC doesn't vectorize on x86_64: #include <stdlib.h> #include <stdint.h> size_t hd (const uint8_t *restrict a, const uint8_t *restrict b, size_t l) { size_t r = 0, x; for (x = 0; x < l; x++) r += __builtin_popcount (a[x] ^ b[x]); return r; } On other architectures, such as power8, GCC successfully vectorizes the loop. However, on x86_64, there doesn't actually exist a vector version of the `popcnt` instruction. Despite this, as shown by [http://wm.ite.pl/articles/sse-popcount.html] it is actually possible to vectorize popcount by using SSE2 or SSSE3 instructions. Further research on [https://software.intel.com/sites/landingpage/IntrinsicsGuide/] shows that it may be possible to achieve further performance on the latest architectures gains by using AVX2 instructions along the same lines as in the article, albeit with 256-bit YMM registers in place of the 128-bit XMM registers used in the article. Since GCC often has support for insofar unreleased architectures, I did a bit more research on the Intel Intrisics Guide mentioned above for future architectures and found that the same could likely also be done using AVX-512 with the 512-bit ZMM registers if you guys are interested. Anyways, I did find that clang has been doing these optimizations since ~clang3.5. I've attached an output of the resulting [vectorized] assembly emitted by clang3.7 for the above function, since it appears to be done relatively thoroughly and cleanly. In both GCC and Clang, I used the following flags: -xc -O2 -ftree-vectorize -D_GNU_SOURCE -std=gnu11 -fverbose-asm