https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91201
--- Comment #10 from Marc Glisse <glisse at gcc dot gnu.org> --- For AVX512, I wonder if we could use vpsadbw to compute the sums for each 64-bit part, then vcompressb to collect them in the lower 64 bits, then vpsadbw to conclude. Or whatever other faster variant (is Peter Cordes around?). But that's not required for this patch.