Zach,

What is the implementation you are comparing against? I assume you are
comparing with C++, right? What compiler are you using?

The key for performance in your case is "exp()" call. If C++ compiler has
SVML library, that should explain the difference. ISPC scalarizes exp()
call by default. Looks like we can do a little better by using
llvm.exp.v4f64() intrinsics, but it should be still worse than SVML. Though
ISPC supports generating SVML calls, you'll need ICC to link the program in
this case.

Also, what hardware are you running on? My experiments on Core-i9-9980HK
(AVX2 machine) show that using avx2-i64x4 target is more beneficial than
avx2-i32x8. Note, that it does not affect precision - all computations are
done using the types specified by the program. But masking is optimized
either for 32 or 64 types.

Can you share a more complete example (i.e. with C++ part of the program),
so we would be on the same page discussing the measurements?

Dmitry.

On Mon, Aug 3, 2020 at 5:51 AM Zach Matson <[email protected]>
wrote:

> I am tuning a population genetics simulation, and after replacing the two
> main computational kernels (formerly intrinsics and SLEEF) with ISPC code,
> my reference benchmark performance regressed from ~4.5s to ~19s.
>
> Here is the ISPC code:
>
>
> // Update the frequencies performing n *= exp(w * growth_factor) on count 
> elements
> export uniform double update_frequencies_sum_n(
>     uniform uint len,
>     uniform double n_arr[],
>     uniform const double w_arr[],
>     uniform double growth_factor
> ) {
>     double sum_n = 0;
>
>     foreach (i = 0 ... len) {
>         double temp = w_arr[i] * growth_factor;
>         temp = exp(temp);
>         n_arr[i] *= temp;
>         sum_n += n_arr[i];
>     }
>
>     return reduce_add(sum_n);
> }
>
> /// Normalize n to sum of 1 and sum w weighted by n
> export uniform double renormalize_n_weighted_sum_w(
>     uniform uint len,
>     uniform double n_arr[],
>     uniform const double w_arr[],
>     uniform double sum_n
> ) {
>     double sum_w = 0;
>
>     foreach (i = 0 ... len) {
>         n_arr[i] /= sum_n;
>         sum_w += n_arr[i] * w_arr[i];
>     }
>
>     return reduce_add(sum_w);
> }
>
>
> Both this and the original kernels with SLEEF + intrinsics were using AVX2
> (i64x4 settings for ISPC/SLEEF). The original kernels included a scalar
> peel loop for alignment, a pure vector loop, and a scalar remainder loop to
> process extra elements.
> I noticed that the performance difference was much milder if using floats
> instead of doubles for the data being processed (~1.2s original kernel
> switched to floats, ~2s ISPC) but the higher precision is needed here and
> the performance is still much worse with the ISPC code.
> Is there some performance optimization I'm missing with this ISPC code?
>
> --
> You received this message because you are subscribed to the Google Groups
> "Intel SPMD Program Compiler Users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/ispc-users/e5303487-9c0c-40f9-91a4-51a1b93f0bb0o%40googlegroups.com
> <https://groups.google.com/d/msgid/ispc-users/e5303487-9c0c-40f9-91a4-51a1b93f0bb0o%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"Intel SPMD Program Compiler Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/ispc-users/CACRFwuiU%2B1dmMk1x-0p3W%2BO8_5QDOmtcV5E4nJDp7rBqRGs4WQ%40mail.gmail.com.

Reply via email to