scalar stores

tkoenig at gcc dot gnu.org Mon, 20 Feb 2017 08:58:53 -0800

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79151


--- Comment #5 from Thomas Koenig <tkoenig at gcc dot gnu.org> ---
(In reply to Richard Biener from comment #3)
> The question is of course whether vector division has comparable latency /
> throughput as the scalar one.

Here's a test case on a rather old CPU, a Core 2 Q8200:

$ cat foo.c
#include <stdio.h>

double foo(double a, double b)
#ifdef SCALAR

{
  return 1/a + 1/b;
}
#else
{
  typedef double v2do __attribute__((vector_size (16)));
  v2do x, y;

  x[0] = a;
  x[1] = b;
  y = 1/x;
  return y[0] + y[1];
}
#endif

#define NMAX 1000000000

int main()
{
  double a, b, s;
  s = 0.0;
  for (a=1.0; a<NMAX+0.5; a++)
    {
      b = a+0.5;
      s += foo(a,b);
    }
  printf("%f\n", s);
}
$ gcc -DSCALAR -O3 foo.c && time ./a.out
41.987257

real    0m19,508s
user    0m19,500s
sys     0m0,000s
$ gcc -O3 foo.c && time ./a.out
41.987257

real    0m9,146s
user    0m9,140s
sys     0m0,000s

[Bug tree-optimization/79151] Missed BB vectorization with strided/scalar stores

Reply via email to