[Bug middle-end/107304] New: internal compiler error: in convert_move, at expr.cc:220 with -march=tigerlake
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107304 Bug ID: 107304 Summary: internal compiler error: in convert_move, at expr.cc:220 with -march=tigerlake Product: gcc Version: 12.2.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: middle-end Assignee: unassigned at gcc dot gnu.org Reporter: colin.king at intel dot com Target Milestone: --- compiling stress-ng using CLFLAGS=-march-tigerlake: git clone https://github.com/ColinIanKing/stress-ng cd stress-ng make clean CFLAGS=-march=tigerlake make -j 8 make stress-ng VERBOSE= make[1]: Entering directory '/home/cking/repos/stress-ng' CC stress-vecshuf.c during RTL pass: expand stress-vecshuf.c: In function 'stress_vecshuf_u128_4.arch_alderlake': stress-vecshuf.c:107:39: internal compiler error: in convert_move, at expr.cc:220 107 | static double TARGET_CLONES OPTIMIZE3 stress_vecshuf_ ## tag ## _ ## elements ( \ | ^~~ stress-vecshuf.c:139:1: note: in expansion of macro 'STRESS_VEC_SHUFFLE' 139 | STRESS_VEC_SHUFFLE(u128, 4) | ^~ 0x7f47b0cfb209 __libc_start_call_main ../sysdeps/nptl/libc_start_call_main.h:58 0x7f47b0cfb2bb __libc_start_main_impl ../csu/libc-start.c:389 Please submit a full bug report, with preprocessed source (by using -freport-bug). Please include the complete backtrace with any bug report. See for instructions. make[1]: *** [Makefile:504: stress-vecshuf.o] Error 1 make[1]: Leaving directory '/home/cking/repos/stress-ng' make: *** [Makefile:488: all] Error 2 Without CFLAGS=-march=tigerlake it builds fine.
[Bug middle-end/107304] internal compiler error: in convert_move, at expr.cc:220 with -march=tigerlake
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107304 --- Comment #1 from Colin Ian King --- See: https://github.com/ColinIanKing/stress-ng/issues/235
[Bug middle-end/107304] internal compiler error: in convert_move, at expr.cc:220 with -march=tigerlake
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107304 --- Comment #2 from Colin Ian King --- Created attachment 53724 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=53724&action=edit preprocessed source that can be compiled to show bug This is the pre-processed output from stress-vecshuf.c, compiling it will trigger the issue. gcc-12 -c stress-vecshuf-post-cpp.c -march=tigerlake
[Bug middle-end/107304] internal compiler error: in convert_move, at expr.cc:220 with -march=tigerlake
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107304 --- Comment #3 from Colin Ian King --- Created attachment 53725 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=53725&action=edit Got the issue down to a small reproducer Ran code through gcc -E, hacked out the irrelevant code, got it down to the smallest reproducer. Notes: gcc-12 -c stress-vecshuf-repro-small.c -march=tigerlake Removing "arch=alderlake" from target clones makes the issue disappear. Making the for loop to a small number of iterations (e.g. 4) makes the issue disappear too. Compiling without -march=tigerlake makes the issue disappear too.
[Bug c/114987] New: floating point vector regression, x86, between gcc 14 and gcc-13 using -O3 and target clones on skylake platforms
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114987 Bug ID: 114987 Summary: floating point vector regression, x86, between gcc 14 and gcc-13 using -O3 and target clones on skylake platforms Product: gcc Version: 14.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c Assignee: unassigned at gcc dot gnu.org Reporter: colin.king at intel dot com Target Milestone: --- Created attachment 58126 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=58126&action=edit reproducer.c source code I'm seeing a ~10% performance regression in gcc-14 compared to gcc-13, using gcc on Ubuntu 24.04: Versions: gcc version 13.2.0 (Ubuntu 13.2.0-23ubuntu4) gcc version 14.0.1 20240412 (experimental) [master r14-9935-g67e1433a94f] (Ubuntu 14-20240412-0ubuntu1) king@skylake:~$ CFLAGS="" gcc-13 reproducer.c; ./a.out 4.92 secs duration, 2130.379 Mfp-ops/sec cking@skylake:~$ CFLAGS="" gcc-14 reproducer.c; ./a.out 5.46 secs duration, 1921.799 Mfp-ops/sec The original issue appeared when regression testing stress-ng vecfp stressor [1] using the floating point vector 16 add stressor method. I've managed to extract the attached reproducer (reproducer.c) from the original code. Salient points to focus on: 1. The issue is dependant on the OPTIMIZE3 macro in the reproducer being __attribute__((optimize("-O3"))) 2. The issue is also dependant on the TARGET_CLONES macro being defined as __attribute__((target_clones("mmx,avx,default"))) - the avx target clones seems to be an issue in reproducing this problem. Attached are the reproducer.c C source and disassembled object code. The stress_vecfp_float_add_16.avx from gcc-13 is significantly different from the gcc-14 code. References: [1] https://github.com/ColinIanKing/stress-ng/blob/master/stress-vecfp.c
[Bug c/114987] floating point vector regression, x86, between gcc 14 and gcc-13 using -O3 and target clones on skylake platforms
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114987 --- Comment #1 from Colin Ian King --- Created attachment 58127 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=58127&action=edit gcc-13 disassembly
[Bug c/114987] floating point vector regression, x86, between gcc 14 and gcc-13 using -O3 and target clones on skylake platforms
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114987 --- Comment #2 from Colin Ian King --- Created attachment 58128 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=58128&action=edit gcc-14 disassembly
[Bug c/114987] floating point vector regression, x86, between gcc 14 and gcc-13 using -O3 and target clones on skylake platforms
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114987 --- Comment #3 from Colin Ian King --- perf report from gcc-13 of stress_vecfp_float_add_16.avx of compute loop: 57.93 │200: vaddps 0xc0(%rsp),%ymm3,%ymm5 11.11 │ vaddps 0xe0(%rsp),%ymm2,%ymm6 0.02 │ vmovaps %ymm5,0x60(%rsp) 2.92 │ mov 0x60(%rsp),%rax │ mov 0x68(%rsp),%rdx 0.37 │ vmovaps %ymm6,0x40(%rsp) │ vmovaps %ymm5,0x80(%rsp) 6.30 │ vmovq%rax,%xmm1 4.11 │ mov 0x40(%rsp),%rax │ vmovdqa 0x90(%rsp),%xmm5 │ vmovaps %ymm6,0xa0(%rsp) 3.27 │ vpinsrq $0x1,%rdx,%xmm1,%xmm1 │ mov 0x48(%rsp),%rdx │ vmovdqa 0xb0(%rsp),%xmm6 3.22 │ vmovdqa %xmm1,0xc0(%rsp) 0.42 │ vmovq%rax,%xmm0 │ vmovdqa %xmm5,0xd0(%rsp) 6.80 │ vpinsrq $0x1,%rdx,%xmm0,%xmm0 3.52 │ vmovdqa %xmm0,0xe0(%rsp) │ vmovdqa %xmm6,0xf0(%rsp) │ sub $0x1,%ecx │ ↑ jne 200 perf report from gcc-14 of stress_vecfp_float_add_16.avx of compute loop: 65.79 │200: vaddps 0xc0(%rsp),%ymm3,%ymm5 3.26 │ vaddps 0xe0(%rsp),%ymm2,%ymm6 0.00 │ vmovaps %ymm5,0x60(%rsp) 9.25 │ mov 0x60(%rsp),%rax 0.00 │ mov 0x68(%rsp),%rdx │ vmovaps %ymm6,0x40(%rsp) │ vmovaps %ymm5,0x80(%rsp) 6.49 │ vmovq%rax,%xmm1 0.00 │ mov 0x40(%rsp),%rax 0.00 │ vmovaps %ymm6,0xa0(%rsp) 3.02 │ vpinsrq $0x1,%rdx,%xmm1,%xmm1 │ mov 0x48(%rsp),%rdx 0.35 │ vmovdqa %xmm1,0xc0(%rsp) 0.68 │ vmovq%rax,%xmm0 0.00 │ vmovdqa 0x90(%rsp),%xmm1 5.18 │ vpinsrq $0x1,%rdx,%xmm0,%xmm0 3.00 │ vmovdqa %xmm0,0xe0(%rsp) │ vmovdqa 0xb0(%rsp),%xmm0 │ vmovdqa %xmm1,0xd0(%rsp) │ vmovdqa %xmm0,0xf0(%rsp) │ sub $0x1,%ecx 2.94 │ ↑ jne 200
[Bug c/115002] New: wide integer vector performance regression, x86, between gcc-14 and gcc-13 using target clones on skylake platform
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115002 Bug ID: 115002 Summary: wide integer vector performance regression, x86, between gcc-14 and gcc-13 using target clones on skylake platform Product: gcc Version: 14.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c Assignee: unassigned at gcc dot gnu.org Reporter: colin.king at intel dot com Target Milestone: --- Created attachment 58138 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=58138&action=edit reproducer source code I'm seeing a ~1.5% performance regression in gcc-14 compared to gcc-13, using gcc on Ubuntu 24.04: Versions: gcc version 13.2.0 (Ubuntu 13.2.0-23ubuntu4) gcc version 14.0.1 20240412 (experimental) [master r14-9935-g67e1433a94f] (Ubuntu 14-20240412-0ubuntu1) CFLAGS="" gcc-13 reproducer-vecwide.c -O2 -Wall cking@skylake:~$ ./a.out 7615.58 vint8w2048_t ops per sec, duration = 13.13 secs cking@skylake:~$ CFLAGS="" gcc-14 reproducer-vecwide.c -O2 -Wall cking@skylake:~$ ./a.out 7489.42 vint8w2048_t ops per sec, duration = 13.35 secs The original issue appeared when regression testing stress-ng vecwide stressor [1]. I've managed to extract the attached reproducer from the original code (see attached). Salient point to focus on: 1. The issue is also dependant on the TARGET_CLONES macro being defined as __attribute__((target_clones("avx,default"))) - the avx target clones seems to be an issue in reproducing this problem. Attached are the reproducer C source and disassembled object code. References: [1] https://github.com/ColinIanKing/stress-ng/blob/master/stress-vecwide.c
[Bug target/115002] [14/15 regression] wide integer vector performance regression, x86, between gcc-14 and gcc-13 using target clones on skylake platform
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115002 --- Comment #1 from Colin Ian King --- Created attachment 58139 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=58139&action=edit gcc-13 disassembly gcc-13 disassembly
[Bug target/115002] [14/15 regression] wide integer vector performance regression, x86, between gcc-14 and gcc-13 using target clones on skylake platform
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115002 --- Comment #2 from Colin Ian King --- Created attachment 58140 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=58140&action=edit gcc-14 disassembly gcc-14 disassembly
[Bug target/115002] [14/15 regression] wide integer vector performance regression, x86, between gcc-14 and gcc-13 using target clones on skylake platform
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115002 --- Comment #3 from Colin Ian King --- Created attachment 58141 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=58141&action=edit perf output of stress_vecwide_2048 for gcc-13 compiled code perf output of stress_vecwide_2048 for gcc-13 compiled code
[Bug target/115002] [14/15 regression] wide integer vector performance regression, x86, between gcc-14 and gcc-13 using target clones on skylake platform
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115002 --- Comment #4 from Colin Ian King --- Created attachment 58142 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=58142&action=edit perf output of stress_vecwide_2048 for gcc-14 compiled code perf output of stress_vecwide_2048 for gcc-14 compiled code
[Bug target/115024] New: 128 bit division performance regression, x86, between gcc-14 and gcc-13 using target clones on skylake platform
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115024 Bug ID: 115024 Summary: 128 bit division performance regression, x86, between gcc-14 and gcc-13 using target clones on skylake platform Product: gcc Version: 14.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: colin.king at intel dot com Target Milestone: --- Created attachment 58158 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=58158&action=edit reproducer source code for __int128_t division regression I'm seeing a 5% performance regression in gcc-14 compared to gcc-13, using gcc on Ubuntu 24.04: Versions: gcc version 13.2.0 (Ubuntu 13.2.0-23ubuntu4) gcc version 14.0.1 20240412 (experimental) [master r14-9935-g67e1433a94f] (Ubuntu 14-20240412-0ubuntu1) cking@skylake:~$ CFLAGS="" gcc-13 -O2 reproducer-div128.c cking@skylake:~$ ./a.out 1650.83 div128 ops per sec cking@skylake:~$ CFLAGS="" gcc-14 -O2 reproducer-div128.c cking@skylake:~$ ./a.out 1567.48 div128 ops per sec The original issue appeared when regression testing stress-ng cpu div128 stressor [1]. I've managed to extract the attached reproducer from the original code (see attached). Salient point to focus on: 1. The issue is also dependant on the TARGET_CLONES macro being defined as __attribute__((target_clones("avx,default"))) - the avx target clones seems to be an issue in reproducing this problem. Attached are the reproducer C source and disassembled object code. References: [1] https://github.com/ColinIanKing/stress-ng/blob/master/stress-cpu.c
[Bug target/115024] 128 bit division performance regression, x86, between gcc-14 and gcc-13 using target clones on skylake platform
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115024 --- Comment #1 from Colin Ian King --- Created attachment 58159 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=58159&action=edit gcc-13 disassembly gcc-13 disassembly
[Bug target/115024] 128 bit division performance regression, x86, between gcc-14 and gcc-13 using target clones on skylake platform
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115024 --- Comment #2 from Colin Ian King --- Created attachment 58160 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=58160&action=edit gcc-14 disassembly gcc-14 disassembly
[Bug target/115024] 128 bit division performance regression, x86, between gcc-14 and gcc-13 using target clones on skylake platform
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115024 --- Comment #3 from Colin Ian King --- Created attachment 58161 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=58161&action=edit perf output for gcc-13 compiled code
[Bug target/115024] 128 bit division performance regression, x86, between gcc-14 and gcc-13 using target clones on skylake platform
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115024 --- Comment #4 from Colin Ian King --- Created attachment 58162 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=58162&action=edit perf output for gcc-14 compiled code
[Bug target/115025] New: prime computation performance regression, x86, between gcc-14 and gcc-13 on skylake platform
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115025 Bug ID: 115025 Summary: prime computation performance regression, x86, between gcc-14 and gcc-13 on skylake platform Product: gcc Version: 14.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: colin.king at intel dot com Target Milestone: --- Created attachment 58163 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=58163&action=edit reproducer source code I'm seeing a ~7% performance regression in gcc-14 compared to gcc-13, using gcc on Ubuntu 24.04 computing prime numbers: Versions: gcc version 13.2.0 (Ubuntu 13.2.0-23ubuntu4) gcc version 14.0.1 20240412 (experimental) [master r14-9935-g67e1433a94f] (Ubuntu 14-20240412-0ubuntu1) cking@skylake:~$ CFLAGS="" gcc-13 -O2 reproducer-prime.c -lm cking@skylake:~$ ./a.out 473.04 prime ops per sec cking@skylake:~$ CFLAGS="" gcc-14 -O2 reproducer-prime.c -lm cking@skylake:~$ ./a.out 439.86 prime ops per sec Attached is the reproducer. Note that the use of __attribute__((optimize("-O3"))) and/or __builtin_expect((x), 0) does not affect the performance regression. The original issue appeared when regression testing stress-ng cpu prime number stressor [1]. I've managed to extract the attached reproducer from the original code (see attached). Attached are the reproducer C source and disassembled object code. References: [1] https://github.com/ColinIanKing/stress-ng/blob/master/stress-cpu.c
[Bug target/115025] prime computation performance regression, x86, between gcc-14 and gcc-13 on skylake platform
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115025 --- Comment #1 from Colin Ian King --- Created attachment 58164 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=58164&action=edit gcc-13 disassembly
[Bug target/115025] prime computation performance regression, x86, between gcc-14 and gcc-13 on skylake platform
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115025 --- Comment #2 from Colin Ian King --- Created attachment 58165 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=58165&action=edit gcc-14 disassembly
[Bug target/115025] prime computation performance regression, x86, between gcc-14 and gcc-13 on skylake platform
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115025 --- Comment #3 from Colin Ian King --- Created attachment 58166 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=58166&action=edit perf output for gcc-13 compiled code
[Bug target/115025] prime computation performance regression, x86, between gcc-14 and gcc-13 on skylake platform
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115025 --- Comment #4 from Colin Ian King --- Created attachment 58167 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=58167&action=edit perf output for gcc-14 compiled code
[Bug target/115029] New: FFT computation performance regression, x86, between gcc-14 and gcc-13 on skylake platform
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115029 Bug ID: 115029 Summary: FFT computation performance regression, x86, between gcc-14 and gcc-13 on skylake platform Product: gcc Version: 14.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: colin.king at intel dot com Target Milestone: --- Created attachment 58172 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=58172&action=edit reproducer source code I'm seeing a ~0.8-1.4% performance regression in gcc-14 compared to gcc-13, using gcc on Ubuntu 24.04 computing Fast Fourier Transforms on 4096 values. Versions: gcc version 13.2.0 (Ubuntu 13.2.0-23ubuntu4) gcc version 14.0.1 20240412 (experimental) [master r14-9935-g67e1433a94f] (Ubuntu 14-20240412-0ubuntu1) cking@skylake:~$ CFLAGS="" gcc-13 reproducer-fft.c -lm -O2 cking@skylake:~$ ./a.out 1927.23 fft ops per sec CFLAGS="" gcc-14 reproducer-fft.c -lm -O2 cking@skylake:~$ ./a.out 1906.73 fft ops per sec I did some analysis on 20 runs of each gcc-13 and gcc-14 runs I noted a ~0.44 percentage std.deviation jitter in my results, but it's clear that there gcc-14 build is always 0.8%-1.4% slower on my i7-6700 test machine, so I think this is a significant regression in performance to be reported. Attached is the reproducer. The original issue appeared when regression testing stress-ng cpu fft number stressor [1]. I've managed to extract the attached reproducer from the original code (see attached). Attached are the reproducer C source and disassembled object code. References: [1] https://github.com/ColinIanKing/stress-ng/blob/master/stress-cpu.c
[Bug target/115029] FFT computation performance regression, x86, between gcc-14 and gcc-13 on skylake platform
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115029 --- Comment #1 from Colin Ian King --- Created attachment 58174 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=58174&action=edit gcc-13 disassembly
[Bug target/115029] FFT computation performance regression, x86, between gcc-14 and gcc-13 on skylake platform
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115029 --- Comment #2 from Colin Ian King --- Created attachment 58175 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=58175&action=edit gcc-14 disassembly
[Bug target/115069] New: 8 bit integer vector performance regression, x86, between gcc-14 and gcc-13 using avx2 target clones on skylake platform
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115069 Bug ID: 115069 Summary: 8 bit integer vector performance regression, x86, between gcc-14 and gcc-13 using avx2 target clones on skylake platform Product: gcc Version: 14.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: colin.king at intel dot com Target Milestone: --- Created attachment 58188 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=58188&action=edit reproducer source code I'm seeing a ~12-14% performance regression in gcc-14 compared to gcc-13, using gcc on Ubuntu 24.04: Versions: gcc version 13.2.0 (Ubuntu 13.2.0-23ubuntu4) gcc version 14.0.1 20240412 (experimental) [master r14-9935-g67e1433a94f] (Ubuntu 14-20240412-0ubuntu1) cking@skylake:~$ gcc-13 reproducer-vecmath.c -O2 cking@skylake:~$ ./a.out 13540.16 vec8 ops per sec, duration = 14.77 secs cking@skylake:~$ gcc-14 reproducer-vecmath.c -O2 cking@skylake:~$ ./a.out 11720.25 vec8 ops per sec, duration = 17.06 secs The original issue appeared when regression testing stress-ng vecmath stressor [1]. I've managed to extract the attached reproducer from the original code (see attached). Salient point to focus on: 1. The issue is also dependant on the TARGET_CLONES macro being defined as __attribute__((target_clones("mmx,avx,avx2,default"))) - the avx2 target clones seems to be an issue in reproducing this problem, remove it for gcc-14 and the performance regression is reduced. Attached are the reproducer C source and disassembled object code. References: [1] https://github.com/ColinIanKing/stress-ng/blob/master/stress-vecmath.c
[Bug target/115069] 8 bit integer vector performance regression, x86, between gcc-14 and gcc-13 using avx2 target clones on skylake platform
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115069 --- Comment #1 from Colin Ian King --- Created attachment 58189 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=58189&action=edit gcc-13 disassembly
[Bug target/115069] 8 bit integer vector performance regression, x86, between gcc-14 and gcc-13 using avx2 target clones on skylake platform
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115069 --- Comment #2 from Colin Ian King --- Created attachment 58190 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=58190&action=edit gcc-14 disassembly
[Bug target/115071] New: performance regression, x86, between gcc-14 and gcc-13 using -O3 and _Pragma("GCC unroll 4") on skylake
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115071 Bug ID: 115071 Summary: performance regression, x86, between gcc-14 and gcc-13 using -O3 and _Pragma("GCC unroll 4") on skylake Product: gcc Version: 14.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: colin.king at intel dot com Target Milestone: --- Created attachment 58191 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=58191&action=edit reproducer source code I'm seeing a ~15% performance regression in gcc-14 compared to gcc-13, using gcc on Ubuntu 24.04: Versions: gcc version 13.2.0 (Ubuntu 13.2.0-23ubuntu4) gcc version 14.0.1 20240412 (experimental) [master r14-9935-g67e1433a94f] (Ubuntu 14-20240412-0ubuntu1) cking@skylake:~$ gcc-13 reproducer-bitonicsort.c -O2 cking@skylake:~$ ./a.out duration: 5.71 seconds, count = 1119566602 cking@skylake:~$ gcc-14 reproducer-bitonicsort.c -O2 cking@skylake:~$ ./a.out duration: 6.56 seconds, count = 1119566602 The original issue appeared when regression testing stress-ng bitonic sorting stressor [1]. I've managed to extract the attached reproducer from the original code (see attached). Salient point to focus on: 1. The issue is also dependant on the use of _Pragma("GCC unroll 4") 2. The issue is also dependant on the use of __attribute__((optimize("-O3"))) by use of the OPTIMIZE3 macro in the example. Attached are the reproducer C source and disassembled object code. References: [1] https://github.com/ColinIanKing/stress-ng/blob/master/stress-bitonicsort.c
[Bug target/115071] performance regression, x86, between gcc-14 and gcc-13 using -O3 and _Pragma("GCC unroll 4") on skylake
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115071 --- Comment #1 from Colin Ian King --- Created attachment 58192 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=58192&action=edit gcc-13 disassembly
[Bug target/115071] performance regression, x86, between gcc-14 and gcc-13 using -O3 and _Pragma("GCC unroll 4") on skylake
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115071 --- Comment #2 from Colin Ian King --- Created attachment 58193 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=58193&action=edit gcc-14 disassembly