https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115843
--- Comment #13 from Richard Biener <rguenth at gcc dot gnu.org> --- Hmm, interesting. We even vectorize this with just -mavx512f but end up using vector(16) int besides vector(8) long and equality compares of vector(16) int: vpcmpd $0, %zmm7, %zmm0, %k2 according to docs that's fine with AVX512F. But then for both long and double you need byte masks so I wonder why kmovb isn't in AVX512F ... I will adjust the testcase to use only AVX512F and push the fix now. I can't reproduce the runfail in a different worktree. Note I don't see all-zero masks but vect_patt_22.11_6 = .MASK_LOAD (&MEM <BITBOARD[64]> [(void *)&KingSafetyMask1 + 8B], 64B, { -1, 0, 0, 0, 0, 0, 0, 0 }); could be optimized to movq $mem, %zmmN (just a single or just a power-of-two number of initial elements read). Not sure if the corresponding vect_patt_20.17_34 = .MASK_LOAD (&MEM <BITBOARD[64]> [(void *)&KingSafetyMask1 + -8B], 64B, { 0, 0, 0, 0, 0, 0, 0, -1 }); is worth optimizing to xor %zmmN, %zmmN and pinsr $MEM, %zmmN? Eliding constant masks might help to avoid STLF issues due to false dependences on masked out elements (IIRC all uarchs currently suffer from that). Note even all-zero masks cannot be optimized on GIMPLE currently since the value of the masked out lanes isn't well-defined there (we're working on that).