[Bug target/82261] x86: missing peephole for SHLD / SHRD
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82261 Michael Clark changed: What|Removed |Added CC||michaeljclark at mac dot com --- Comment #2 from Michael Clark --- Just refreshing this issue. I found it while testing some code-gen on Godbolt: - https://godbolt.org/z/uXGxZ9 I noticed that Haswell code-gen uses SHRX/SHLX, but I think -Os and pre-Haswell would benefit from this peephole if it is not complex to add. Noting that Clang prefers SHLD / SHRD over the SHRX+SHLX pair no matter the -march flavor.
[Bug target/95251] New: x86 code size expansion inserting field into a union
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95251 Bug ID: 95251 Summary: x86 code size expansion inserting field into a union Product: gcc Version: 10.1.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: michaeljclark at mac dot com Target Milestone: --- Testing code on Godbolt and I came across some pathological code amplification when SSE is enabled for field insertion into a structure containing a union. Here is the Godbolt link: https://godbolt.org/z/z_RpFt Compiler flags: gcc -Os --save-temps -march=ivybridge -c x7b00.c The function `x7b00`, inserts into the structure via char fields and it has a voluminous translation (30 instructions). The functionally equivalent `xyb87` inserts into the structure via an 64-bit integer and it translates simply (5 instructions). `x`, `a7x` and `x7bcd` are for comparison. Not adding -march=ivybridge improves the code size but it is still nowhere near optimal. `xyb87` serves as a reference for near optimal translation. It seemed worthy of filing a bug due to the observed code amplification factor (6X). Can the backend choose the non-SSE code generation if it is more efficient? --- CODE SNIPPET BEGINS --- typedef unsigned long long u64; typedef char u8; typedef struct mr { union { u64 y; struct { u8 a,b,c,d; } i; } u; u64 x; } mr; u64 x(mr mr) { return mr.x; } mr a7x(u64 x) { return (mr) { .u = { .i = { 7,0,0,0 } }, .x = x }; } mr x7bcd(u64 x,u8 b,u8 c,u8 d) { return (mr) {.u={.i={7,b,c,d }}, .x=x }; } mr xyb87(u64 x, u8 b) { return (mr) {.u={ .y =(u64)b << 8|7},.x=x }; } mr x7b00(u64 x, u8 b) { return (mr) {.u={ .i ={7,b,0,0}}, .x=x }; } --- EXPECTED OUTPUT --- .cfi_startproc endbr64 movsbq %sil, %rax movq%rdi, %rdx salq$8, %rax orq $7, %rax ret .cfi_endproc --- OBSERVED OUTPUT --- .cfi_startproc endbr64 pushq %rbp .cfi_def_cfa_offset 16 .cfi_offset 6, -16 movq%rdi, %r8 xorl%eax, %eax movl$6, %ecx movq%rsp, %rbp .cfi_def_cfa_register 6 andq$-32, %rsp leaq-32(%rsp), %rdi rep stosb movq$0, -48(%rsp) movabsq $281474976710655, %rax movq$0, -40(%rsp) movq-48(%rsp), %rdx andq-32(%rsp), %rax movzwl %dx, %edx salq$16, %rax orq %rax, %rdx movq%rdx, -48(%rsp) movb$7, -48(%rsp) vmovdqa -48(%rsp), %xmm1 vpinsrb $1, %esi, %xmm1, %xmm0 vmovaps %xmm0, -48(%rsp) movq-48(%rsp), %rax movq%r8, -40(%rsp) movq-40(%rsp), %rdx leave .cfi_def_cfa 7, 8 ret .cfi_endproc
[Bug target/96201] New: x86 movsd/movsq string instructions and alignment inference
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96201 Bug ID: 96201 Summary: x86 movsd/movsq string instructions and alignment inference Product: gcc Version: 10.1.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: michaeljclark at mac dot com Target Milestone: --- Taking the time to record some observations and extract minimal test code for alignment (inference) and x86 string instruction selection. GCC9 and GCC10 are not generating x86 string instructions in cases apparently due to the compiler believing the addresses are not aligned. GCC10 appears to have an additional issue whereby x86 string instructions are not selected unless the address is aligned to twice the natural alignment. Two observations: * (GCC9/10) integer alignment is not inferred from expressions i.e. x & ~3 * (GCC10) __builtin_assume_aligned appears to require double the alignment The double alignment issue was observed with both int/movsd and long/movsq whereby GCC10 will only generate movsd or movsq if the alignment is double the type's natural alignment. The test case here is for int. --- BEGIN SAMPLE CODE --- void f1(long d, long s, unsigned n) { int *sn = (int*)( (long)(s) & ~3l ); int *dn = (int*)( (long)(d) & ~3l ); int *de = (int*)( (long)(d + n) & ~3l ); while (dn < de) *dn++ = *sn++; } void f2(long d, long s, unsigned n) { int *sn = (int*)( (long)(s) & ~7l ); int *dn = (int*)( (long)(d) & ~7l ); int *de = (int*)( (long)(d + n) & ~7l ); while (dn < de) *dn++ = *sn++; } void f3(long d, long s, unsigned n) { int *sn = __builtin_assume_aligned( (int*)( (long)(s) & ~3l ), 4 ); int *dn = __builtin_assume_aligned( (int*)( (long)(d) & ~3l ), 4 ); int *de = __builtin_assume_aligned( (int*)( (long)(d + n) & ~3l ), 4 ); while (dn < de) *dn++ = *sn++; } void f4(long d, long s, unsigned n) { int *sn = __builtin_assume_aligned( (int*)((long)(s) & ~3l ), 8 ); int *dn = __builtin_assume_aligned( (int*)((long)(d) & ~3l ), 8 ); int *de = __builtin_assume_aligned( (int*)((long)(d + n) & ~3l ), 8 ); while (dn < de) *dn++ = *sn++; } --- END SAMPLE CODE --- GCC9 generates this for f1, f2 and GCC10 generates this for f1, f2, f3 .Ln: leaq(%rax,%rsi), %rcx movq%rax, %rdx addq$4, %rax movl(%rcx), %ecx movl%ecx, (%rdx) cmpq%rax, %rdi ja .Ln GCC9 generates this for f3, f4 and GCC10 generates this only for f4 .Ln: movsl cmpq%rdi, %rdx ja .Ln
[Bug target/100077] New: x86: by-value floating point array in struct - xmm regs spilling to stack
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100077 Bug ID: 100077 Summary: x86: by-value floating point array in struct - xmm regs spilling to stack Product: gcc Version: 10.3.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: michaeljclark at mac dot com Target Milestone: --- Hi, compiling a vec3 cross product using struct by-value on msvc, clang and gcc. gcc is going through memory on the stack. operands are by-value so I can't use restrict. same with -O2 and -Os. i vaguely remember seeing this a couple of times but i searched to see if i had reported it and couldn't find a duplicate report. link with the 3 compilers here: https://godbolt.org/z/YWWfYxbM3 MSVC: /O2 /fp:fast /arch:AVX2 Clang: -Os -mavx -x c GCC: -Os -mavx -x c --- BEGIN EXAMPLE --- struct vec3a { float v[3]; }; typedef struct vec3a vec3a; vec3a vec3f_cross_0(vec3a v1, vec3a v2) { vec3a dest = { v1.v[1]*v2.v[2]-v1.v[2]*v2.v[1], v1.v[2]*v2.v[0]-v1.v[0]*v2.v[2], v1.v[0]*v2.v[1]-v1.v[1]*v2.v[0] }; return dest; } struct vec3f { float x, y, z; }; typedef struct vec3f vec3f; vec3f vec3f_cross_1(vec3f v1, vec3f v2) { vec3f dest = { v1.y*v2.z-v1.z*v2.y, v1.z*v2.x-v1.x*v2.z, v1.x*v2.y-v1.y*v2.x }; return dest; } void vec3f_cross_2(float dest[3], float v1[3], float v2[3]) { dest[0]=v1[1]*v2[2]-v1[2]*v2[1]; dest[1]=v1[2]*v2[0]-v1[0]*v2[2]; dest[2]=v1[0]*v2[1]-v1[1]*v2[0]; } --- END EXAMPLE ---
[Bug target/70053] Returning a struct of _Decimal128 values generates extraneous stores and loads
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70053 Michael Clark changed: What|Removed |Added CC||michaeljclark at mac dot com --- Comment #10 from Michael Clark --- another data point. I am seeing something similar on x86-64. SysV x86-64 ABI specifies that _Decimal128 is to be passed in xmm regs so I believe the stack stores here are redundant. ; cat > dec1.c << EOF _Decimal128 add_d(_Decimal128 a, _Decimal128 b) { return a + b; } EOF ; gcc -O2 -S -masm=intel dec1.c ; cat dec1.s add_d: .LFB0: .cfi_startproc endbr64 sub rsp, 40 .cfi_def_cfa_offset 48 movaps XMMWORD PTR [rsp], xmm0 movaps XMMWORD PTR 16[rsp], xmm1 call__bid_addtd3@PLT movaps XMMWORD PTR [rsp], xmm0 add rsp, 40 .cfi_def_cfa_offset 8 ret .cfi_endproc