https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91198
--- Comment #4 from Moritz Kreutzer <moritz.kreutzer at siemens dot com> --- > How would a vectorized version with the intrinsic look like? Something along the lines of (assuming insize is a multiple of 16): ======================================================== __mmask16 mask; __m512 vin; __m512 const thr = _mm512_set1_ps(threshold); int o = 0; for (int i = 0; i < insize; i+=16) { vin = _mm512_loadu_ps(&input[i]); mask = _mm512_cmplt_ps_mask(vin, thr); _mm512_mask_compressstoreu_ps(&output[o], mask, vin); o += __builtin_popcount(_mm512_mask2int(mask)); } *outsize = o; ======================================================== I don't really understand your other two questions, but maybe the intrinsics code will help.