https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68109

            Bug ID: 68109
           Summary: GCC fails to vectorize popcount on x86_64
           Product: gcc
           Version: 5.2.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: other
          Assignee: unassigned at gcc dot gnu.org
          Reporter: haneef503 at gmail dot com
  Target Milestone: ---

Created attachment 36595
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=36595&action=edit
Clang Vectorized Assembly Output

The following code is an SSCCE that GCC doesn't vectorize on x86_64:

#include <stdlib.h>
#include <stdint.h>

size_t hd (const uint8_t *restrict a, const uint8_t *restrict b, size_t l) {
  size_t r = 0, x;
  for (x = 0; x < l; x++)
    r += __builtin_popcount (a[x] ^ b[x]);

  return r;
}

On other architectures, such as power8, GCC successfully vectorizes the loop.
However, on x86_64, there doesn't actually exist a vector version of the
`popcnt` instruction. Despite this, as shown by
[http://wm.ite.pl/articles/sse-popcount.html] it is actually possible to
vectorize popcount by using SSE2 or SSSE3 instructions. Further research on
[https://software.intel.com/sites/landingpage/IntrinsicsGuide/] shows that it
may be possible to achieve further performance on the latest architectures
gains by using AVX2 instructions along the same lines as in the article, albeit
with 256-bit YMM registers in place of the 128-bit XMM registers used in the
article. Since GCC often has support for insofar unreleased architectures, I
did a bit more research on the Intel Intrisics Guide mentioned above for future
architectures and found that the same could likely also be done using AVX-512
with the 512-bit ZMM registers if you guys are interested.

Anyways, I did find that clang has been doing these optimizations since
~clang3.5. I've attached an output of the resulting [vectorized] assembly
emitted by clang3.7 for the above function, since it appears to be done
relatively thoroughly and cleanly.

In both GCC and Clang, I used the following flags:

-xc -O2 -ftree-vectorize -D_GNU_SOURCE  -std=gnu11 -fverbose-asm

Reply via email to