Hi Juzhe, > +/* Expand Vector POPCOUNT by parallel popcnt: > + > + int parallel_popcnt(uint32_t n) { > + #define POW2(c) (1U << (c)) > + #define MASK(c) (static_cast<uint32_t>(-1) / (POW2(POW2(c)) + 1U)) > + #define COUNT(x, c) ((x) & MASK(c)) + (((x)>>(POW2(c))) & MASK(c)) > + n = COUNT(n, 0); > + n = COUNT(n, 1); > + n = COUNT(n, 2); > + n = COUNT(n, 3); > + n = COUNT(n, 4); > + // n = COUNT(n, 5); // uncomment this line for 64-bit integers > + return n; > + #undef COUNT > + #undef MASK > + #undef POW2 > + }
That's quite a heavy implementation but I suppose with the proper cost function it can still be worth it. Did you also try some alternatives? WWG comes to mind: uint64_t c1 = 0x5555555555555555; uint64_t c2 = 0x3333333333333333; uint64_t c4 = 0x0F0F0F0F0F0F0F0F; uint64_t wwg (uint64_t x) { x -= (x >> 1) & c1; x = ((x >> 2) & c2) + (x & c2); x = (x + (x >> 4) ) & c4; x *= 0x0101010101010101; return x >> 56; } >From my recollection this is usually 30-40% faster than the naive tree adder and also amenable to vectorization. As long as the multiplication is not terribly slow, that is. Mula's algorithm should be significantly faster even, another 30% IIRC. I'm not against continuing with the more well-known approach for now but we should keep in mind that might still be potential for improvement. > } // namespace riscv_vector > diff --git a/gcc/testsuite/gcc.target/riscv/rvv/autovec/widen/popcount-1.c > b/gcc/testsuite/gcc.target/riscv/rvv/autovec/widen/popcount-1.c Any particular reason why the tests are in widen? > +extern void abort (void) __attribute__ ((noreturn)); Why no __builtin_unreachable as in the other tests? > + asm volatile ("" ::: "memory"); Is this necessary? I doesn't hurt of course, just wondering. All in all LGTM in case you'd rather get this upstream now. We can always improve later. Regards Robin