https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118818
Bug ID: 118818 Summary: Optimization of divps to rcpps + newton can cause slow down Product: gcc Version: 14.2.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c Assignee: unassigned at gcc dot gnu.org Reporter: benjamin.meier70 at gmail dot com Target Milestone: --- Hey I work a lot with SSE vectorized code. Mainly with floats gcc optimizes most of the code very well. When I compute reciprocals, I've recognized that it replaces `divps` by `rcpps` + newton. It seems to be a smart optimization, but on many machines it's actually slower than `divps`. E.g. the following test program can be used to test that: ------------------------------ #include <stdio.h> #include <stdlib.h> #include <time.h> #include <stdint.h> #include <math.h> #include <xmmintrin.h> #include <unistd.h> #define N (1024 * 16) #define FORCE_INLINE inline __attribute__((always_inline)) __attribute__((aligned(16))) float y[N] = {0}; static FORCE_INLINE __m128 inverse(__m128 x, __m128 one) { // good old 1.0f/x return _mm_div_ps( one, x ); } __attribute__((noinline)) void f(const float *restrict in, float *restrict out) { const __m128 one = _mm_set1_ps(1.0f); for (size_t i = 0; i < N; i += 4) { __m128 v_in = _mm_load_ps(&in[i]); __m128 v_out = inverse(v_in, one); _mm_store_ps(&out[i], v_out); } } unsigned long takeMonotonicTimestampNs() { struct timespec tv_start; clock_gettime(CLOCK_MONOTONIC_RAW, &tv_start); return ((tv_start.tv_sec * 1000000000) + tv_start.tv_nsec); } void test_lat(const float *restrict values) { f(values, y); uint64_t tsa = takeMonotonicTimestampNs(); for (int i = 0; i < 100000; ++i) { f(values, y); } uint64_t tsb = takeMonotonicTimestampNs(); printf("%.10f\n", y[N - 1]); printf("%.3fms (slow)\n", (tsb - tsa) / 1e6); } int main() { // generate some "random" inputs srand(0); float *values = aligned_alloc(16, N * sizeof(values[0])); for (int i = 0; i < N; ++i) { values[i] = (rand() + 1) * (rand() + 1); } while (1) { test_lat(values); } } ------------------------------ Compile with `divps`: gcc -O3 ./main.c -msse4.2 Compile with `rcpps` + newton: gcc -Ofast ./main.c -msse4.2 With `divps` it's about 25% faster (tested on a `Intel(R) Xeon(R) Platinum 8275CL CPU ) Can this specific optimization be disabled? I mean only the one that div gets replaced by rcp plus newton. In general gcc optimization work very well and due to that I don't like to disable anything else. Plus is there a reason why the optimization is still used? I believe it was faster at some point, but maybe that's not the case anymore? Plus I can see that icx does not do this optimization. Thanks a lot