https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91198

--- Comment #5 from rguenther at suse dot de <rguenther at suse dot de> ---
On Fri, 19 Jul 2019, moritz.kreutzer at siemens dot com wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91198
> 
> --- Comment #4 from Moritz Kreutzer <moritz.kreutzer at siemens dot com> ---
> > How would a vectorized version with the intrinsic look like?
> 
> Something along the lines of (assuming insize is a multiple of 16):
> 
> ========================================================
> __mmask16 mask;
> __m512 vin;
> __m512 const thr = _mm512_set1_ps(threshold);
> int o = 0;
> for (int i = 0; i < insize; i+=16) {
>   vin = _mm512_loadu_ps(&input[i]);
>   mask = _mm512_cmplt_ps_mask(vin, thr);
>   _mm512_mask_compressstoreu_ps(&output[o], mask, vin);
>   o += __builtin_popcount(_mm512_mask2int(mask));
> }
> *outsize = o;
> ========================================================
> 
> 
> I don't really understand your other two questions, but maybe the intrinsics
> code will help.

Yeah, it helps.  I missed

  o += __builtin_popcount(_mm512_mask2int(mask));

so the output vector address is computed via a reduction over the
number of 'true' conditions met.

Reply via email to