[Bug middle-end/107304] New: internal compiler error: in convert_move, at expr.cc:220 with -march=tigerlake

2022-10-18 Thread colin.king at intel dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107304

Bug ID: 107304
   Summary: internal compiler error: in convert_move, at
expr.cc:220 with -march=tigerlake
   Product: gcc
   Version: 12.2.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: middle-end
  Assignee: unassigned at gcc dot gnu.org
  Reporter: colin.king at intel dot com
  Target Milestone: ---

compiling stress-ng using CLFLAGS=-march-tigerlake:

git clone https://github.com/ColinIanKing/stress-ng
cd stress-ng
make clean
CFLAGS=-march=tigerlake make -j 8
make stress-ng VERBOSE=
make[1]: Entering directory '/home/cking/repos/stress-ng'
CC stress-vecshuf.c
during RTL pass: expand
stress-vecshuf.c: In function 'stress_vecshuf_u128_4.arch_alderlake':
stress-vecshuf.c:107:39: internal compiler error: in convert_move, at
expr.cc:220
  107 | static double TARGET_CLONES OPTIMIZE3 stress_vecshuf_ ## tag ## _ ##
elements ( \
  |   ^~~
stress-vecshuf.c:139:1: note: in expansion of macro 'STRESS_VEC_SHUFFLE'
  139 | STRESS_VEC_SHUFFLE(u128,  4)
  | ^~
0x7f47b0cfb209 __libc_start_call_main
../sysdeps/nptl/libc_start_call_main.h:58
0x7f47b0cfb2bb __libc_start_main_impl
../csu/libc-start.c:389
Please submit a full bug report, with preprocessed source (by using
-freport-bug).
Please include the complete backtrace with any bug report.
See  for instructions.
make[1]: *** [Makefile:504: stress-vecshuf.o] Error 1
make[1]: Leaving directory '/home/cking/repos/stress-ng'
make: *** [Makefile:488: all] Error 2

Without CFLAGS=-march=tigerlake it builds fine.

[Bug middle-end/107304] internal compiler error: in convert_move, at expr.cc:220 with -march=tigerlake

2022-10-18 Thread colin.king at intel dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107304

--- Comment #1 from Colin Ian King  ---
See: https://github.com/ColinIanKing/stress-ng/issues/235

[Bug middle-end/107304] internal compiler error: in convert_move, at expr.cc:220 with -march=tigerlake

2022-10-18 Thread colin.king at intel dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107304

--- Comment #2 from Colin Ian King  ---
Created attachment 53724
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=53724&action=edit
preprocessed source that can be compiled to show bug

This is the pre-processed output from stress-vecshuf.c, compiling it will
trigger the issue.

gcc-12 -c stress-vecshuf-post-cpp.c -march=tigerlake

[Bug middle-end/107304] internal compiler error: in convert_move, at expr.cc:220 with -march=tigerlake

2022-10-18 Thread colin.king at intel dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107304

--- Comment #3 from Colin Ian King  ---
Created attachment 53725
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=53725&action=edit
Got the issue down to a small reproducer

Ran code through gcc -E, hacked out the irrelevant code, got it down to the
smallest reproducer.

Notes: gcc-12 -c stress-vecshuf-repro-small.c -march=tigerlake

Removing "arch=alderlake" from target clones makes the issue disappear.
Making the for loop to a small number of iterations (e.g. 4) makes the issue
disappear too.
Compiling without -march=tigerlake makes the issue disappear too.

[Bug c/114987] New: floating point vector regression, x86, between gcc 14 and gcc-13 using -O3 and target clones on skylake platforms

2024-05-08 Thread colin.king at intel dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114987

Bug ID: 114987
   Summary: floating point vector regression, x86, between gcc 14
and gcc-13 using -O3 and target clones on skylake
platforms
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c
  Assignee: unassigned at gcc dot gnu.org
  Reporter: colin.king at intel dot com
  Target Milestone: ---

Created attachment 58126
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=58126&action=edit
reproducer.c source code

I'm seeing a ~10% performance regression in gcc-14 compared to gcc-13, using
gcc on Ubuntu 24.04:

Versions:
gcc version 13.2.0 (Ubuntu 13.2.0-23ubuntu4) 
gcc version 14.0.1 20240412 (experimental) [master r14-9935-g67e1433a94f]
(Ubuntu 14-20240412-0ubuntu1) 

king@skylake:~$ CFLAGS="" gcc-13 reproducer.c; ./a.out  
4.92 secs duration, 2130.379 Mfp-ops/sec
cking@skylake:~$ CFLAGS="" gcc-14 reproducer.c; ./a.out  
5.46 secs duration, 1921.799 Mfp-ops/sec

The original issue appeared when regression testing stress-ng vecfp stressor
[1] using the floating point vector 16 add stressor method. I've managed to
extract the attached reproducer (reproducer.c) from the original code.

Salient points to focus on:

1. The issue is dependant on the OPTIMIZE3 macro in the reproducer being
__attribute__((optimize("-O3")))
2. The issue is also dependant on the TARGET_CLONES macro being defined as
__attribute__((target_clones("mmx,avx,default")))  - the avx target clones
seems to be an issue in reproducing this problem.

Attached are the reproducer.c C source and disassembled object code. The
stress_vecfp_float_add_16.avx from gcc-13 is significantly different from the
gcc-14 code.

References: [1]
https://github.com/ColinIanKing/stress-ng/blob/master/stress-vecfp.c

[Bug c/114987] floating point vector regression, x86, between gcc 14 and gcc-13 using -O3 and target clones on skylake platforms

2024-05-08 Thread colin.king at intel dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114987

--- Comment #1 from Colin Ian King  ---
Created attachment 58127
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=58127&action=edit
gcc-13 disassembly

[Bug c/114987] floating point vector regression, x86, between gcc 14 and gcc-13 using -O3 and target clones on skylake platforms

2024-05-08 Thread colin.king at intel dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114987

--- Comment #2 from Colin Ian King  ---
Created attachment 58128
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=58128&action=edit
gcc-14 disassembly

[Bug c/114987] floating point vector regression, x86, between gcc 14 and gcc-13 using -O3 and target clones on skylake platforms

2024-05-08 Thread colin.king at intel dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114987

--- Comment #3 from Colin Ian King  ---
perf report from gcc-13 of stress_vecfp_float_add_16.avx of compute loop:

 57.93 │200:   vaddps   0xc0(%rsp),%ymm3,%ymm5
 11.11 │   vaddps   0xe0(%rsp),%ymm2,%ymm6
  0.02 │   vmovaps  %ymm5,0x60(%rsp)  
  2.92 │   mov  0x60(%rsp),%rax   
   │   mov  0x68(%rsp),%rdx   
  0.37 │   vmovaps  %ymm6,0x40(%rsp)  
   │   vmovaps  %ymm5,0x80(%rsp)  
  6.30 │   vmovq%rax,%xmm1
  4.11 │   mov  0x40(%rsp),%rax   
   │   vmovdqa  0x90(%rsp),%xmm5  
   │   vmovaps  %ymm6,0xa0(%rsp)  
  3.27 │   vpinsrq  $0x1,%rdx,%xmm1,%xmm1 
   │   mov  0x48(%rsp),%rdx   
   │   vmovdqa  0xb0(%rsp),%xmm6  
  3.22 │   vmovdqa  %xmm1,0xc0(%rsp)  
  0.42 │   vmovq%rax,%xmm0
   │   vmovdqa  %xmm5,0xd0(%rsp)  
  6.80 │   vpinsrq  $0x1,%rdx,%xmm0,%xmm0 
  3.52 │   vmovdqa  %xmm0,0xe0(%rsp)  
   │   vmovdqa  %xmm6,0xf0(%rsp)  
   │   sub  $0x1,%ecx 
   │ ↑ jne  200

perf report from gcc-14 of stress_vecfp_float_add_16.avx of compute loop:

 65.79 │200:   vaddps   0xc0(%rsp),%ymm3,%ymm5
  3.26 │   vaddps   0xe0(%rsp),%ymm2,%ymm6
  0.00 │   vmovaps  %ymm5,0x60(%rsp)  
  9.25 │   mov  0x60(%rsp),%rax   
  0.00 │   mov  0x68(%rsp),%rdx   
   │   vmovaps  %ymm6,0x40(%rsp)  
   │   vmovaps  %ymm5,0x80(%rsp)  
  6.49 │   vmovq%rax,%xmm1
  0.00 │   mov  0x40(%rsp),%rax   
  0.00 │   vmovaps  %ymm6,0xa0(%rsp)  
  3.02 │   vpinsrq  $0x1,%rdx,%xmm1,%xmm1 
   │   mov  0x48(%rsp),%rdx   
  0.35 │   vmovdqa  %xmm1,0xc0(%rsp)  
  0.68 │   vmovq%rax,%xmm0
  0.00 │   vmovdqa  0x90(%rsp),%xmm1  
  5.18 │   vpinsrq  $0x1,%rdx,%xmm0,%xmm0 
  3.00 │   vmovdqa  %xmm0,0xe0(%rsp)  
   │   vmovdqa  0xb0(%rsp),%xmm0  
   │   vmovdqa  %xmm1,0xd0(%rsp)  
   │   vmovdqa  %xmm0,0xf0(%rsp)  
   │   sub  $0x1,%ecx 
  2.94 │ ↑ jne  200

[Bug c/115002] New: wide integer vector performance regression, x86, between gcc-14 and gcc-13 using target clones on skylake platform

2024-05-09 Thread colin.king at intel dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115002

Bug ID: 115002
   Summary: wide integer vector performance regression, x86,
between gcc-14 and gcc-13 using target clones on
skylake platform
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c
  Assignee: unassigned at gcc dot gnu.org
  Reporter: colin.king at intel dot com
  Target Milestone: ---

Created attachment 58138
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=58138&action=edit
reproducer source code

I'm seeing a ~1.5% performance regression in gcc-14 compared to gcc-13, using
gcc on Ubuntu 24.04:

Versions:
gcc version 13.2.0 (Ubuntu 13.2.0-23ubuntu4) 
gcc version 14.0.1 20240412 (experimental) [master r14-9935-g67e1433a94f]
(Ubuntu 14-20240412-0ubuntu1) 

CFLAGS="" gcc-13 reproducer-vecwide.c -O2 -Wall
cking@skylake:~$ ./a.out 
7615.58 vint8w2048_t ops per sec, duration = 13.13 secs

cking@skylake:~$ CFLAGS="" gcc-14 reproducer-vecwide.c -O2 -Wall
cking@skylake:~$ ./a.out 
7489.42 vint8w2048_t ops per sec, duration = 13.35 secs

The original issue appeared when regression testing stress-ng vecwide stressor
[1]. I've managed to extract the attached reproducer from the original code
(see attached).

Salient point to focus on:

1. The issue is also dependant on the TARGET_CLONES macro being defined as
__attribute__((target_clones("avx,default")))  - the avx target clones seems to
be an issue in reproducing this problem.

Attached are the reproducer C source and disassembled object code. 

References: [1]
https://github.com/ColinIanKing/stress-ng/blob/master/stress-vecwide.c

[Bug target/115002] [14/15 regression] wide integer vector performance regression, x86, between gcc-14 and gcc-13 using target clones on skylake platform

2024-05-09 Thread colin.king at intel dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115002

--- Comment #1 from Colin Ian King  ---
Created attachment 58139
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=58139&action=edit
gcc-13 disassembly

gcc-13 disassembly

[Bug target/115002] [14/15 regression] wide integer vector performance regression, x86, between gcc-14 and gcc-13 using target clones on skylake platform

2024-05-09 Thread colin.king at intel dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115002

--- Comment #2 from Colin Ian King  ---
Created attachment 58140
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=58140&action=edit
gcc-14 disassembly

gcc-14 disassembly

[Bug target/115002] [14/15 regression] wide integer vector performance regression, x86, between gcc-14 and gcc-13 using target clones on skylake platform

2024-05-09 Thread colin.king at intel dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115002

--- Comment #3 from Colin Ian King  ---
Created attachment 58141
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=58141&action=edit
perf output of stress_vecwide_2048 for gcc-13 compiled code

perf output of stress_vecwide_2048 for gcc-13 compiled code

[Bug target/115002] [14/15 regression] wide integer vector performance regression, x86, between gcc-14 and gcc-13 using target clones on skylake platform

2024-05-09 Thread colin.king at intel dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115002

--- Comment #4 from Colin Ian King  ---
Created attachment 58142
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=58142&action=edit
perf output of stress_vecwide_2048 for gcc-14 compiled code

perf output of stress_vecwide_2048 for gcc-14 compiled code

[Bug target/115024] New: 128 bit division performance regression, x86, between gcc-14 and gcc-13 using target clones on skylake platform

2024-05-10 Thread colin.king at intel dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115024

Bug ID: 115024
   Summary: 128 bit division performance regression, x86, between
gcc-14 and gcc-13 using target clones on skylake
platform
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: colin.king at intel dot com
  Target Milestone: ---

Created attachment 58158
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=58158&action=edit
reproducer source code for __int128_t division regression

I'm seeing a 5% performance regression in gcc-14 compared to gcc-13, using gcc
on Ubuntu 24.04:

Versions:
gcc version 13.2.0 (Ubuntu 13.2.0-23ubuntu4) 
gcc version 14.0.1 20240412 (experimental) [master r14-9935-g67e1433a94f]
(Ubuntu 14-20240412-0ubuntu1) 

cking@skylake:~$ CFLAGS="" gcc-13 -O2 reproducer-div128.c 
cking@skylake:~$ ./a.out 
1650.83 div128 ops per sec

cking@skylake:~$ CFLAGS="" gcc-14 -O2 reproducer-div128.c 
cking@skylake:~$ ./a.out 
1567.48 div128 ops per sec

The original issue appeared when regression testing stress-ng cpu div128
stressor [1]. I've managed to extract the attached reproducer from the original
code (see attached).

Salient point to focus on:

1. The issue is also dependant on the TARGET_CLONES macro being defined as
__attribute__((target_clones("avx,default")))  - the avx target clones seems to
be an issue in reproducing this problem.

Attached are the reproducer C source and disassembled object code. 

References: [1]
https://github.com/ColinIanKing/stress-ng/blob/master/stress-cpu.c

[Bug target/115024] 128 bit division performance regression, x86, between gcc-14 and gcc-13 using target clones on skylake platform

2024-05-10 Thread colin.king at intel dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115024

--- Comment #1 from Colin Ian King  ---
Created attachment 58159
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=58159&action=edit
gcc-13 disassembly

gcc-13 disassembly

[Bug target/115024] 128 bit division performance regression, x86, between gcc-14 and gcc-13 using target clones on skylake platform

2024-05-10 Thread colin.king at intel dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115024

--- Comment #2 from Colin Ian King  ---
Created attachment 58160
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=58160&action=edit
gcc-14 disassembly

gcc-14 disassembly

[Bug target/115024] 128 bit division performance regression, x86, between gcc-14 and gcc-13 using target clones on skylake platform

2024-05-10 Thread colin.king at intel dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115024

--- Comment #3 from Colin Ian King  ---
Created attachment 58161
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=58161&action=edit
perf output for gcc-13 compiled code

[Bug target/115024] 128 bit division performance regression, x86, between gcc-14 and gcc-13 using target clones on skylake platform

2024-05-10 Thread colin.king at intel dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115024

--- Comment #4 from Colin Ian King  ---
Created attachment 58162
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=58162&action=edit
perf output for gcc-14 compiled code

[Bug target/115025] New: prime computation performance regression, x86, between gcc-14 and gcc-13 on skylake platform

2024-05-10 Thread colin.king at intel dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115025

Bug ID: 115025
   Summary: prime computation performance regression, x86, between
gcc-14 and gcc-13 on skylake platform
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: colin.king at intel dot com
  Target Milestone: ---

Created attachment 58163
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=58163&action=edit
reproducer source code

I'm seeing a ~7% performance regression in gcc-14 compared to gcc-13, using gcc
on Ubuntu 24.04 computing prime numbers:

Versions:
gcc version 13.2.0 (Ubuntu 13.2.0-23ubuntu4) 
gcc version 14.0.1 20240412 (experimental) [master r14-9935-g67e1433a94f]
(Ubuntu 14-20240412-0ubuntu1) 

cking@skylake:~$ CFLAGS="" gcc-13 -O2 reproducer-prime.c -lm
cking@skylake:~$ ./a.out 
473.04 prime ops per sec

cking@skylake:~$ CFLAGS="" gcc-14 -O2 reproducer-prime.c -lm
cking@skylake:~$ ./a.out 
439.86 prime ops per sec

Attached is the reproducer. Note that the use of
__attribute__((optimize("-O3"))) and/or __builtin_expect((x), 0) does not
affect the performance regression.

The original issue appeared when regression testing stress-ng cpu prime number
stressor [1]. I've managed to extract the attached reproducer from the original
code (see attached).

Attached are the reproducer C source and disassembled object code. 

References: [1]
https://github.com/ColinIanKing/stress-ng/blob/master/stress-cpu.c

[Bug target/115025] prime computation performance regression, x86, between gcc-14 and gcc-13 on skylake platform

2024-05-10 Thread colin.king at intel dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115025

--- Comment #1 from Colin Ian King  ---
Created attachment 58164
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=58164&action=edit
gcc-13 disassembly

[Bug target/115025] prime computation performance regression, x86, between gcc-14 and gcc-13 on skylake platform

2024-05-10 Thread colin.king at intel dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115025

--- Comment #2 from Colin Ian King  ---
Created attachment 58165
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=58165&action=edit
gcc-14 disassembly

[Bug target/115025] prime computation performance regression, x86, between gcc-14 and gcc-13 on skylake platform

2024-05-10 Thread colin.king at intel dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115025

--- Comment #3 from Colin Ian King  ---
Created attachment 58166
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=58166&action=edit
perf output for gcc-13 compiled code

[Bug target/115025] prime computation performance regression, x86, between gcc-14 and gcc-13 on skylake platform

2024-05-10 Thread colin.king at intel dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115025

--- Comment #4 from Colin Ian King  ---
Created attachment 58167
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=58167&action=edit
perf output for gcc-14 compiled code

[Bug target/115029] New: FFT computation performance regression, x86, between gcc-14 and gcc-13 on skylake platform

2024-05-10 Thread colin.king at intel dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115029

Bug ID: 115029
   Summary: FFT computation performance regression, x86, between
gcc-14 and gcc-13 on skylake platform
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: colin.king at intel dot com
  Target Milestone: ---

Created attachment 58172
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=58172&action=edit
reproducer source code

I'm seeing a ~0.8-1.4% performance regression in gcc-14 compared to gcc-13,
using gcc on Ubuntu 24.04 computing Fast Fourier Transforms on 4096 values.

Versions:
gcc version 13.2.0 (Ubuntu 13.2.0-23ubuntu4) 
gcc version 14.0.1 20240412 (experimental) [master r14-9935-g67e1433a94f]
(Ubuntu 14-20240412-0ubuntu1)

cking@skylake:~$ CFLAGS="" gcc-13 reproducer-fft.c -lm -O2
cking@skylake:~$ ./a.out 
1927.23 fft ops per sec

CFLAGS="" gcc-14 reproducer-fft.c -lm -O2
cking@skylake:~$ ./a.out 
1906.73 fft ops per sec

I did some analysis on 20 runs of each gcc-13 and gcc-14 runs I noted a ~0.44
percentage std.deviation jitter in my results, but it's clear that there gcc-14
build is always 0.8%-1.4% slower on my i7-6700 test machine, so I think this is
a significant regression in performance to be reported.

Attached is the reproducer.

The original issue appeared when regression testing stress-ng cpu fft number
stressor [1]. I've managed to extract the attached reproducer from the original
code (see attached).

Attached are the reproducer C source and disassembled object code. 

References: [1]
https://github.com/ColinIanKing/stress-ng/blob/master/stress-cpu.c

[Bug target/115029] FFT computation performance regression, x86, between gcc-14 and gcc-13 on skylake platform

2024-05-10 Thread colin.king at intel dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115029

--- Comment #1 from Colin Ian King  ---
Created attachment 58174
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=58174&action=edit
gcc-13 disassembly

[Bug target/115029] FFT computation performance regression, x86, between gcc-14 and gcc-13 on skylake platform

2024-05-10 Thread colin.king at intel dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115029

--- Comment #2 from Colin Ian King  ---
Created attachment 58175
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=58175&action=edit
gcc-14 disassembly

[Bug target/115069] New: 8 bit integer vector performance regression, x86, between gcc-14 and gcc-13 using avx2 target clones on skylake platform

2024-05-13 Thread colin.king at intel dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115069

Bug ID: 115069
   Summary: 8 bit integer vector performance regression, x86,
between gcc-14 and gcc-13 using avx2 target clones on
skylake platform
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: colin.king at intel dot com
  Target Milestone: ---

Created attachment 58188
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=58188&action=edit
reproducer source code

I'm seeing a ~12-14% performance regression in gcc-14 compared to gcc-13, using
gcc on Ubuntu 24.04:

Versions:
gcc version 13.2.0 (Ubuntu 13.2.0-23ubuntu4) 
gcc version 14.0.1 20240412 (experimental) [master r14-9935-g67e1433a94f]
(Ubuntu 14-20240412-0ubuntu1) 

cking@skylake:~$ gcc-13 reproducer-vecmath.c -O2
cking@skylake:~$ ./a.out 
13540.16 vec8 ops per sec, duration = 14.77 secs

cking@skylake:~$ gcc-14 reproducer-vecmath.c -O2
cking@skylake:~$ ./a.out 
11720.25 vec8 ops per sec, duration = 17.06 secs

The original issue appeared when regression testing stress-ng vecmath stressor
[1]. I've managed to extract the attached reproducer from the original code
(see attached).

Salient point to focus on:

1. The issue is also dependant on the TARGET_CLONES macro being defined as
__attribute__((target_clones("mmx,avx,avx2,default")))  - the avx2 target
clones seems to be an issue in reproducing this problem, remove it for gcc-14
and the performance regression is reduced.

Attached are the reproducer C source and disassembled object code. 

References: [1]
https://github.com/ColinIanKing/stress-ng/blob/master/stress-vecmath.c

[Bug target/115069] 8 bit integer vector performance regression, x86, between gcc-14 and gcc-13 using avx2 target clones on skylake platform

2024-05-13 Thread colin.king at intel dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115069

--- Comment #1 from Colin Ian King  ---
Created attachment 58189
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=58189&action=edit
gcc-13 disassembly

[Bug target/115069] 8 bit integer vector performance regression, x86, between gcc-14 and gcc-13 using avx2 target clones on skylake platform

2024-05-13 Thread colin.king at intel dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115069

--- Comment #2 from Colin Ian King  ---
Created attachment 58190
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=58190&action=edit
gcc-14 disassembly

[Bug target/115071] New: performance regression, x86, between gcc-14 and gcc-13 using -O3 and _Pragma("GCC unroll 4") on skylake

2024-05-13 Thread colin.king at intel dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115071

Bug ID: 115071
   Summary: performance regression, x86, between gcc-14 and gcc-13
using -O3 and _Pragma("GCC unroll 4") on skylake
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: colin.king at intel dot com
  Target Milestone: ---

Created attachment 58191
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=58191&action=edit
reproducer source code

I'm seeing a ~15% performance regression in gcc-14 compared to gcc-13, using
gcc on Ubuntu 24.04:

Versions:
gcc version 13.2.0 (Ubuntu 13.2.0-23ubuntu4) 
gcc version 14.0.1 20240412 (experimental) [master r14-9935-g67e1433a94f]
(Ubuntu 14-20240412-0ubuntu1) 

cking@skylake:~$ gcc-13 reproducer-bitonicsort.c -O2
cking@skylake:~$ ./a.out 
duration: 5.71 seconds, count = 1119566602

cking@skylake:~$ gcc-14 reproducer-bitonicsort.c -O2
cking@skylake:~$ ./a.out 
duration: 6.56 seconds, count = 1119566602

The original issue appeared when regression testing stress-ng bitonic sorting
stressor [1]. I've managed to extract the attached reproducer from the original
code (see attached).

Salient point to focus on:

1. The issue is also dependant on the use of _Pragma("GCC unroll 4")
2. The issue is also dependant on the use of __attribute__((optimize("-O3")))
by use of the OPTIMIZE3 macro in the example.

Attached are the reproducer C source and disassembled object code. 

References:
[1] https://github.com/ColinIanKing/stress-ng/blob/master/stress-bitonicsort.c

[Bug target/115071] performance regression, x86, between gcc-14 and gcc-13 using -O3 and _Pragma("GCC unroll 4") on skylake

2024-05-13 Thread colin.king at intel dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115071

--- Comment #1 from Colin Ian King  ---
Created attachment 58192
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=58192&action=edit
gcc-13 disassembly

[Bug target/115071] performance regression, x86, between gcc-14 and gcc-13 using -O3 and _Pragma("GCC unroll 4") on skylake

2024-05-13 Thread colin.king at intel dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115071

--- Comment #2 from Colin Ian King  ---
Created attachment 58193
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=58193&action=edit
gcc-14 disassembly