https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91198

--- Comment #4 from Moritz Kreutzer <moritz.kreutzer at siemens dot com> ---
> How would a vectorized version with the intrinsic look like?

Something along the lines of (assuming insize is a multiple of 16):

========================================================
__mmask16 mask;
__m512 vin;
__m512 const thr = _mm512_set1_ps(threshold);
int o = 0;
for (int i = 0; i < insize; i+=16) {
  vin = _mm512_loadu_ps(&input[i]);
  mask = _mm512_cmplt_ps_mask(vin, thr);
  _mm512_mask_compressstoreu_ps(&output[o], mask, vin);
  o += __builtin_popcount(_mm512_mask2int(mask));
}
*outsize = o;
========================================================


I don't really understand your other two questions, but maybe the intrinsics
code will help.

Reply via email to