https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79151
--- Comment #5 from Thomas Koenig <tkoenig at gcc dot gnu.org> --- (In reply to Richard Biener from comment #3) > The question is of course whether vector division has comparable latency / > throughput as the scalar one. Here's a test case on a rather old CPU, a Core 2 Q8200: $ cat foo.c #include <stdio.h> double foo(double a, double b) #ifdef SCALAR { return 1/a + 1/b; } #else { typedef double v2do __attribute__((vector_size (16))); v2do x, y; x[0] = a; x[1] = b; y = 1/x; return y[0] + y[1]; } #endif #define NMAX 1000000000 int main() { double a, b, s; s = 0.0; for (a=1.0; a<NMAX+0.5; a++) { b = a+0.5; s += foo(a,b); } printf("%f\n", s); } $ gcc -DSCALAR -O3 foo.c && time ./a.out 41.987257 real 0m19,508s user 0m19,500s sys 0m0,000s $ gcc -O3 foo.c && time ./a.out 41.987257 real 0m9,146s user 0m9,140s sys 0m0,000s