https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118818

            Bug ID: 118818
           Summary: Optimization of divps to rcpps + newton can cause slow
                    down
           Product: gcc
           Version: 14.2.1
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: c
          Assignee: unassigned at gcc dot gnu.org
          Reporter: benjamin.meier70 at gmail dot com
  Target Milestone: ---

Hey

I work a lot with SSE vectorized code. Mainly with floats

gcc optimizes most of the code very well. When I compute reciprocals, I've
recognized that it replaces `divps` by `rcpps` + newton. It seems to be a smart
optimization, but on many machines it's actually slower than `divps`.

E.g. the following test program can be used to test that:
------------------------------
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <stdint.h>
#include <math.h>
#include <xmmintrin.h>
#include <unistd.h>

#define N     (1024 * 16)

#define FORCE_INLINE                        inline
__attribute__((always_inline))

__attribute__((aligned(16))) float y[N] = {0};

static FORCE_INLINE __m128 inverse(__m128 x, __m128 one)
{
    // good old 1.0f/x
    return _mm_div_ps(
        one,
        x
    );
}

__attribute__((noinline))
void f(const float *restrict in, float *restrict out)
{
    const __m128 one = _mm_set1_ps(1.0f);
    for (size_t i = 0; i < N; i += 4)
    {
        __m128 v_in = _mm_load_ps(&in[i]);
        __m128 v_out = inverse(v_in, one);
        _mm_store_ps(&out[i], v_out);
    }
}

unsigned long takeMonotonicTimestampNs()
{
    struct timespec tv_start;
    clock_gettime(CLOCK_MONOTONIC_RAW, &tv_start);
    return ((tv_start.tv_sec * 1000000000) + tv_start.tv_nsec);
}

void test_lat(const float *restrict values) {

  f(values, y);
  uint64_t tsa = takeMonotonicTimestampNs();
  for (int i = 0; i < 100000; ++i)
  {
      f(values, y);
  }
  uint64_t tsb = takeMonotonicTimestampNs();

  printf("%.10f\n", y[N - 1]);
  printf("%.3fms (slow)\n", (tsb - tsa) / 1e6);
}


int main()
{
    // generate some "random" inputs
    srand(0);
    float *values = aligned_alloc(16, N * sizeof(values[0]));
    for (int i = 0; i < N; ++i) {
        values[i] = (rand() + 1) * (rand() + 1);
    }

    while (1) {
        test_lat(values);
    }
}
------------------------------

Compile with `divps`: gcc -O3 ./main.c -msse4.2

Compile with `rcpps` + newton: gcc -Ofast ./main.c -msse4.2

With `divps` it's about 25% faster (tested on a `Intel(R) Xeon(R) Platinum
8275CL CPU )

Can this specific optimization be disabled? I mean only the one that div gets
replaced by rcp plus newton. In general gcc optimization work very well and due
to that I don't like to disable anything else.

Plus is there a reason why the optimization is still used? I believe it was
faster at some point, but maybe that's not the case anymore? Plus I can see
that icx does not do this optimization.

Thanks a lot

Reply via email to