https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91198
--- Comment #5 from rguenther at suse dot de <rguenther at suse dot de> --- On Fri, 19 Jul 2019, moritz.kreutzer at siemens dot com wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91198 > > --- Comment #4 from Moritz Kreutzer <moritz.kreutzer at siemens dot com> --- > > How would a vectorized version with the intrinsic look like? > > Something along the lines of (assuming insize is a multiple of 16): > > ======================================================== > __mmask16 mask; > __m512 vin; > __m512 const thr = _mm512_set1_ps(threshold); > int o = 0; > for (int i = 0; i < insize; i+=16) { > vin = _mm512_loadu_ps(&input[i]); > mask = _mm512_cmplt_ps_mask(vin, thr); > _mm512_mask_compressstoreu_ps(&output[o], mask, vin); > o += __builtin_popcount(_mm512_mask2int(mask)); > } > *outsize = o; > ======================================================== > > > I don't really understand your other two questions, but maybe the intrinsics > code will help. Yeah, it helps. I missed o += __builtin_popcount(_mm512_mask2int(mask)); so the output vector address is computed via a reduction over the number of 'true' conditions met.