[Bug d/94496] [D] Use aggressive optimizations in release mode
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94496 Witold Baryluk changed: What|Removed |Added CC||witold.baryluk+gcc at gmail dot co ||m --- Comment #1 from Witold Baryluk --- We are close to making 'in' mean 'scope const', it is already available as a preview in dmd 2.092: https://dlang.org/changelog/2.092.0.html#preview-in
[Bug d/95120] New: [D] Incorrectly allows fqdn access to imported symbols when doing selective imports.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95120 Bug ID: 95120 Summary: [D] Incorrectly allows fqdn access to imported symbols when doing selective imports. Product: gcc Version: 10.1.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: d Assignee: ibuclaw at gdcproject dot org Reporter: witold.baryluk+gcc at gmail dot com Target Milestone: --- Created attachment 48529 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=48529&action=edit Example of incorrectly accepted d source by gdc-10 gdc does violatate D language spec: https://dlang.org/spec/module.html#selective_imports 4.7 Selective Imports Specific symbols can be exclusively imported from a module and bound into the current namespace: import std.stdio : writeln, foo = write; void main() { std.stdio.writeln("hello!"); // error, std is undefined writeln("hello!"); // ok, writeln bound into current namespace write("world"); // error, write is undefined foo("world");// ok, calls std.stdio.write() fwritefln(stdout, "abc");// error, fwritefln undefined } = I found that in some weird situations the gdc-10 does behave differently than dmd and ldc2. Here are the versions I used: $ dmd --version DMD64 D Compiler v2.092.0 Copyright (C) 1999-2020 by The D Language Foundation, All Rights Reserved written by Walter Bright $ ldc2 --version LDC - the LLVM D compiler (1.20.1): based on DMD v2.090.1 and LLVM 9.0.1 built with LDC - the LLVM D compiler (1.20.1) Default target: x86_64-pc-linux-gnu Host CPU: znver1 $ gdc-10 --version gdc-10 (Debian 10.1.0-1) 10.1.0 $ All on Debian testing/unstable, amd64. badimport.d = void main() { import std.stdio; import std.algorithm.comparison : min; static struct S { int min_; // int min() { return min_; } void opOpAssign(string op)(const S other) if (op == "+") { min_ = std.algorithm.comparison.min(min_, other.min_); } } S x = {3}; x += x; } = (the intention was to use fqdn here, to not reference struct member function min; using `.min(min_, other.min_)`, is another option, but it actually shouldn't work either, due to other reasons). Anyway: $ gdc-10 badimport.d # Compiles. $ $ ldc2 badimport.d # Correct error. badimport.d(11): Error: undefined identifier algorithm in package std, perhaps add static import std.algorithm; badimport.d(16): Error: template instance badimport.main.S.opOpAssign!"+" error instantiating $ $ dmd badimport.d # Correct error. badimport.d(11): Error: undefined identifier algorithm in package std, perhaps add static import std.algorithm; badimport.d(16): Error: template instance badimport.main.S.opOpAssign!"+" error instantiating $ Produced code by gdc-10 does work correctly. However, it shouldn't compile at all. >From what I can see, it is some kind of interaction with preceding imports, that is the `import std.stdio;`. Removing `import std.stdio;` makes gdc-10 correctly report the error and stop compilation. The test case can be further minimizes, and it attached to the bug. Same behaviour: == import std.stdio; import std.algorithm.comparison : min; struct S { int min_; void add(const S other) { min_ = std.algorithm.comparison.min(min_, other.min_); } } void main() { S x = {3}; x.add(x); } ==
[Bug d/95120] [D] Incorrectly allows fqdn access to imported symbols when doing selective imports.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95120 --- Comment #1 from Witold Baryluk --- Further minimized: == import std.stdio; import std.algorithm.comparison : min; int main() { return std.algorithm.comparison.min(3, 2); } == Removing `import std.stdio;`, results in the same error messages in gdc-10, dmd and ldc2. $ gdc badimport.d badimport.d:5:10: error: undefined identifier ‘std’ 5 | return std.algorithm.comparison.min(3, 2); | ^ $ $ ldc2 badimport.d badimport.d(5): Error: undefined identifier std $ dmd badimport.d badimport.d(5): Error: undefined identifier std $ it complains about `unknown std`. When I use `import std.stdio;` at the start, dmd and ldc complain about `unknown algorithm in package std`. Not sure if this is something in `std.stdio` package maybe.
[Bug d/95120] [D] Incorrectly allows fqdn access to imported symbols when doing selective imports.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95120 --- Comment #2 from Witold Baryluk --- Created attachment 48530 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=48530&action=edit Minimized example
[Bug d/94496] [D] Use aggressive optimizations in release mode
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94496 --- Comment #3 from Witold Baryluk --- Also about 'nothrow' and Errors. I would really welcome a flag to compiler that simply terminates all threads immidetly any Error is thrown at throw location. They aren't really recoverable. The only option is either catch them really high in the stack or terminate the program (in D this will I think unwind all the stack and destroy scoped structs, also call full GC collection, optionally call all class destrustors, and module destructors). But in many cases terminating the program at the spot (_exit(2) or _Exit(2), from glibc (not kernel) to terminate all threads via exit_group). As of the 'nothrow' itself. I belive it doesn't mean there is 'no thrown exceptions in the call tree'. I think it means there is no 'uncought exceptions possibly throw by call to this function'. ```d extern int g(int x); // not nothrow int f(int x) nothrow { try { return g(x); throw new MyException("ble"); } catch (Exception e) { return 1; } return 0; } ``` https://gcc.godbolt.org/z/Y3vNQr As of the asm pure, considering there is asm volatile, wouldn't it make sense to not allow 'asm pure volatilve' in the first place in the source? strict aliasing should be enabled for dynamic arrays, static arrays and normal pointer to other types. I.e. ```d void f(int* x, float[] y); // x, y, y.ptr should not alias. ``` ```d void f(int* x, int[] y); // x, y and y.ptr can alias. ``` Also how about using `restrict` automatically for transitively const types? I.e. ```d void f(const scope int[] a, int *b); // can't alias. if b aliases a, then it is UB. ```
[Bug d/95173] New: [D] ICE on some architecture targets when trying to use unknown attribute
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95173 Bug ID: 95173 Summary: [D] ICE on some architecture targets when trying to use unknown attribute Product: gcc Version: 9.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: d Assignee: ibuclaw at gdcproject dot org Reporter: witold.baryluk+gcc at gmail dot com Target Milestone: --- https://explore.dgnu.org/z/bseyKQ ``` import gcc.attribute; @attribute("foo") void f() {} ``` This code crashes the compiler when compiling for alpha-linux-gnu, sparc64-elf, hppa-linux-gnu, mmix-knuth-mmixware, pdp11-aout, lm32-elf and possibly more with variations. (But sparc64-sun-solaris2.11 for examples works fine). The code compiles correctly for known attributes. Other targets do compile without a crash, just with a warning about unknown attribute. That is correct behaviour.
[Bug d/95174] New: [D] Incorrect compiled functions involving const fixed size arrays
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95174 Bug ID: 95174 Summary: [D] Incorrect compiled functions involving const fixed size arrays Product: gcc Version: 9.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: d Assignee: ibuclaw at gdcproject dot org Reporter: witold.baryluk+gcc at gmail dot com Target Milestone: --- https://explore.dgnu.org/z/LppySp ``` void f(immutable(float[64]) x, float[64] o) { o[] = x[] * 2.0f; } ``` and ``` void f(immutable(float[64]) x, float[64] o) { foreach (i; 0 .. 64) { o[i] = x[i] * 2.0f; } } ``` and ``` void f(immutable(float[64]) x, float[64] o) { o[1] = x[5] + x[7]; } ``` Is incorrectly compiled to 'nop; ret' It appears DMD (v2.092) also essentially do the same, and do not perform any computations in the function. LDC2 (1.20.1, based on DMD v2.090.1) does generate correct code in some cases (fully unrolled and fully vectorized in this specific case), but in some other also do nothing and simply does 'ret' in the function. As a bonus: ``` void f(immutable(float[4]) x, float[4] o) { o[2] = x[1] + x[3]; } import std.stdio : writeln; void main() { immutable(float[4]) k = [7.0f, 5.3f, 1.2f, 3.2f]; float[4] o; f(k, o); writeln(o); } ``` prints '[nan, nan, nan, nan]', but it should: '[nan, nan, 8.5, nan]'. I got the same results using my local gdc version 10.1.0 on amd64.
[Bug d/95174] [D] Incorrect compiled functions involving const fixed size arrays
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95174 --- Comment #2 from Witold Baryluk --- Doh. Of course. My bad. Sorry. static arrays are value type, dynamic arrays are reference type. Changing signature to: ``` void f(immutable(float[64]) x, float[] o); ``` solves the problem.
[Bug d/95198] New: [D] extern(C) private final functions should use 'local' linker attribute
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95198 Bug ID: 95198 Summary: [D] extern(C) private final functions should use 'local' linker attribute Product: gcc Version: 10.1.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: d Assignee: ibuclaw at gdcproject dot org Reporter: witold.baryluk+gcc at gmail dot com Target Milestone: --- ``` module t1; extern(C) private final int f() { return 5; } pragma(msg, f.mangleof); ``` `gdc -c t1.d -o t1.o` results in object with this symbols: D _D2t111__moduleRefZ D _D2t112__ModuleInfoZ U _d_dso_registry T f W gdc.dso_ctor W gdc.dso_dtor u gdc.dso_initialized u gdc.dso_slot 0016 t _GLOBAL__D_2t1 000b t _GLOBAL__I_2t1 U _GLOBAL_OFFSET_TABLE_ U __start_minfo U __stop_minfo Symbol, ' T f' should instead be ' t f' Additional when using optimizations, I would expect the f to not be emitted at all, but it is still there (unless compiler decides not to inline it or its address is not taken and passed around), even with `gdc -O3`. gcc for C does use LOCAL for static functions and variables in translation unit. Similarly probably for C++ symbols in anonymous namespaces. Example of linking issues: t1.d: ``` module t1; extern(C) private final int f() { return 5; } ``` t2.d: ``` module t2; extern(C) private final int f() { return 10; } ``` tm.d: ``` module tm; void main() { } ``` $ gdc -O0 -c t1.d -o t1.o $ gdc -O0 -c t2.d -o t2.o $ gdc t1.o t2.o tm.d -o t12 /usr/bin/ld: t2.o: in function `f': t2.d:(.text+0x0): multiple definition of `f'; t1.o:t1.d:(.text+0x0): first defined here collect2: error: ld returned 1 exit status $ This code should link, similar to equivalent code in C. The use case is local function that is passed in some other module function or method (or static module constructor for example), to C libraries or other modules as a callback or for variables a return value.
[Bug d/95198] [D] extern(C) private final functions should use 'local' linker attribute
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95198 --- Comment #1 from Witold Baryluk --- BTW. Using: ``` extern(C) private final static int f() { ... } ``` doesn't change anything.
[Bug d/95198] [D] extern(C) private final functions should use 'local' linker attribute
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95198 --- Comment #3 from Witold Baryluk --- > The main example to demonstrate the current behaviour is correct would be the > following: ``` extern(C) private final int f() { return 5; } auto pubf()() { return f(); } ``` I see, I guess you are right. I don't know how would one go to fix this to work correctly with existing linkers and not break other code. Thanks for clarifications.
[Bug d/95250] New: [D] ICE instead of error when trying to use bad template type inside template
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95250 Bug ID: 95250 Summary: [D] ICE instead of error when trying to use bad template type inside template Product: gcc Version: 10.1.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: d Assignee: ibuclaw at gdcproject dot org Reporter: witold.baryluk+gcc at gmail dot com Target Milestone: --- https://godbolt.org/z/xWrXP5 Minimized version ``` module m; import std.traits : Unsigned; void* f(T)(T a, T b) { alias UnsignedVoid = Unsigned!(T); return cast(T)(cast(T)(cast(UnsignedVoid)(a-b) / 2)); } //static assert(is(typeof(f(null, null)) == void*)); // ICE static assert(is(typeof(f!(void*)(null, null)) == void*)); // ICE ``` The code is not correct, but on DMD v2.092.0 and LDC 1.20.1 (LLVM 9.0.1) it does say static assert is false (which is also incorrect), and doesn't crash. Instead it should say, something like this: /usr/lib/gcc/x86_64-linux-gnu/11.0.0/include/d/std/traits.d:7163:13: error: static assert "Type void* does not have an Unsigned counterpart" 7163 | static assert(false, "Type " ~ T.stringof ~ | ^ Here is a local run, on Linux, amd64. $ gdc gdc_ice.d d21: internal compiler error: Segmentation fault 0xbd63ef crash_signal ../../src/gcc/toplev.c:328 0x7f31b746c7ff ??? ./signal/../sysdeps/unix/sysv/linux/x86_64/sigaction.c:0 0x71ed1e isAggregate(Type*) ../../src/gcc/d/dmd/opover.c:161 0x71ed1e visit ../../src/gcc/d/dmd/opover.c:586 0x71e935 op_overload(Expression*, Scope*) ../../src/gcc/d/dmd/opover.c:1385 0x6d0b88 Expression::op_overload(Scope*) ../../src/gcc/d/dmd/expression.h:213 0x6d0b88 ExpressionSemanticVisitor::visit(DivExp*) ../../src/gcc/d/dmd/expressionsem.c:6891 0x6cd7b4 semantic(Expression*, Scope*) ../../src/gcc/d/dmd/expressionsem.c:8214 0x6cd7b4 unaSemantic(UnaExp*, Scope*) ../../src/gcc/d/dmd/expressionsem.c:8164 0x6cd7b4 ExpressionSemanticVisitor::visit(CastExp*) ../../src/gcc/d/dmd/expressionsem.c:4203 0x6cd7b4 semantic(Expression*, Scope*) ../../src/gcc/d/dmd/expressionsem.c:8214 0x6cd7b4 unaSemantic(UnaExp*, Scope*) ../../src/gcc/d/dmd/expressionsem.c:8164 0x6cd7b4 ExpressionSemanticVisitor::visit(CastExp*) ../../src/gcc/d/dmd/expressionsem.c:4203 0x6c5a45 semantic(Expression*, Scope*) ../../src/gcc/d/dmd/expressionsem.c:8214 0x74795f StatementSemanticVisitor::visit(ReturnStatement*) ../../src/gcc/d/dmd/statementsem.c:2757 0x74a949 semantic(Statement*, Scope*) ../../src/gcc/d/dmd/statementsem.c:3782 0x74a949 StatementSemanticVisitor::visit(CompoundStatement*) ../../src/gcc/d/dmd/statementsem.c:142 0x743755 semantic(Statement*, Scope*) ../../src/gcc/d/dmd/statementsem.c:3782 0x6e8ba9 FuncDeclaration::semantic3(Scope*) ../../src/gcc/d/dmd/func.c:1711 0x6e8ba9 FuncDeclaration::semantic3(Scope*) ../../src/gcc/d/dmd/func.c:1354 $ $ gdc -v Using built-in specs. COLLECT_GCC=gdc COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/10/lto-wrapper OFFLOAD_TARGET_NAMES=nvptx-none:amdgcn-amdhsa:hsa OFFLOAD_TARGET_DEFAULT=1 Target: x86_64-linux-gnu Configured with: ../src/configure -v --with-pkgversion='Debian 10.1.0-1' --with-bugurl=file:///usr/share/doc/gcc-10/README.Bugs --enable-languages=c,ada,c++,go,brig,d,fortran,objc,obj-c++,m2 --prefix=/usr --with-gcc-major-version-only --program-suffix=-10 --program-prefix=x86_64-linux-gnu- --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --libdir=/usr/lib --enable-nls --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --with-default-libstdcxx-abi=new --enable-gnu-unique-object --disable-vtable-verify --enable-plugin --enable-default-pie --with-system-zlib --enable-libphobos-checking=release --with-target-system-zlib=auto --enable-objc-gc=auto --enable-multiarch --disable-werror --with-arch-32=i686 --with-abi=m64 --with-multilib-list=m32,m64,mx32 --enable-multilib --with-tune=generic --enable-offload-targets=nvptx-none,amdgcn-amdhsa,hsa --without-cuda-driver --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu Thread model: posix Supported LTO compression algorithms: zlib zstd gcc version 10.1.0 (Debian 10.1.0-1) $
[Bug c/96275] New: Vectorizer doesn't take into account bitmask condition from branch conditions.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96275 Bug ID: 96275 Summary: Vectorizer doesn't take into account bitmask condition from branch conditions. Product: gcc Version: 11.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c Assignee: unassigned at gcc dot gnu.org Reporter: witold.baryluk+gcc at gmail dot com Target Milestone: --- https://godbolt.org/z/Gfebjd With gcc trunk 20200720 If the loop to be vectorized is inside a if condition that check for loop counter, or there is preceding assert / function return on such condition, the gcc seems to forgot about it and not take into account in the optimizer / vectorizer, and still emits the backup scalar code to take care of stragglers despite it being a dead code. #include "assert.h" void fillArray(const unsigned int N, float * restrict a, const float* restrict b, const float* restrict c) { //assert(N >= 1024); for (int i = 0; i < (N & ~31u); i++) { a[i] = b[0] * c[i]; } } produces: fillArray: and edi, -32 je .L8 shr edi, 3 vbroadcastssymm1, DWORD PTR [rdx] xor eax, eax mov edx, edi sal rdx, 5 .L3: vmulps ymm0, ymm1, YMMWORD PTR [rcx+rax] vmovups YMMWORD PTR [rsi+rax], ymm0 add rax, 32 cmp rax, rdx jne .L3 vzeroupper .L8: ret but: #include "assert.h" void fillArray(const unsigned int N, float * restrict a, const float* restrict b, const float* restrict c) { //assert(N >= 1024); if ((N & 31u) == 0) { for (int i = 0; i < N; i++) { a[i] = b[0] * c[i]; } } } produces this sub-optimal code: fillArray: mov eax, edi and eax, 31 jne .L14 testedi, edi je .L14 lea r8d, [rdi-1] vmovss xmm1, DWORD PTR [rdx] cmp r8d, 6 jbe .L8 mov edx, edi vbroadcastssymm2, xmm1 xor eax, eax shr edx, 3 sal rdx, 5 .L4: vmulps ymm0, ymm2, YMMWORD PTR [rcx+rax] vmovups YMMWORD PTR [rsi+rax], ymm0 add rax, 32 cmp rdx, rax jne .L4 mov eax, edi and eax, -8 mov edx, eax cmp edi, eax je .L16 vzeroupper .L3: mov r9d, edi sub r8d, eax sub r9d, eax cmp r8d, 2 jbe .L6 mov eax, eax vshufps xmm0, xmm1, xmm1, 0 vmulps xmm0, xmm0, XMMWORD PTR [rcx+rax*4] vmovups XMMWORD PTR [rsi+rax*4], xmm0 mov eax, r9d and eax, -4 add edx, eax cmp r9d, eax je .L14 .L6: movsx rax, edx vmulss xmm0, xmm1, DWORD PTR [rcx+rax*4] vmovss DWORD PTR [rsi+rax*4], xmm0 lea eax, [rdx+1] cmp edi, eax jbe .L14 cdqe add edx, 2 vmulss xmm0, xmm1, DWORD PTR [rcx+rax*4] vmovss DWORD PTR [rsi+rax*4], xmm0 cmp edi, edx jbe .L14 movsx rdx, edx vmulss xmm1, xmm1, DWORD PTR [rcx+rdx*4] vmovss DWORD PTR [rsi+rdx*4], xmm1 .L14: ret .L16: vzeroupper ret .L8: xor edx, edx jmp .L3 Adding `assert(N == (N & ~31u));` doesn't help.
[Bug c/96275] Vectorizer doesn't take into account bitmask condition from branch conditions.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96275 --- Comment #1 from Witold Baryluk --- FYI. clang trunk 12 / 76a0c0ee6ffa9c38485776921948d8f930109674, doesn't do that either: fillArray: # @fillArray testdil, 31 jne .LBB0_8 testedi, edi je .LBB0_8 vmovss xmm0, dword ptr [rdx] # xmm0 = mem[0],zero,zero,zero mov eax, edi cmp edi, 32 jae .LBB0_4 xor edx, edx jmp .LBB0_7 .LBB0_4: vbroadcastssymm1, xmm0 mov edx, eax xor edi, edi and edx, -32 .LBB0_5:# =>This Inner Loop Header: Depth=1 vmulps ymm2, ymm1, ymmword ptr [rcx + 4*rdi] vmulps ymm3, ymm1, ymmword ptr [rcx + 4*rdi + 32] vmulps ymm4, ymm1, ymmword ptr [rcx + 4*rdi + 64] vmulps ymm5, ymm1, ymmword ptr [rcx + 4*rdi + 96] vmovups ymmword ptr [rsi + 4*rdi], ymm2 vmovups ymmword ptr [rsi + 4*rdi + 32], ymm3 vmovups ymmword ptr [rsi + 4*rdi + 64], ymm4 vmovups ymmword ptr [rsi + 4*rdi + 96], ymm5 add rdi, 32 cmp rdx, rdi jne .LBB0_5 cmp rdx, rax je .LBB0_8 .LBB0_7:# =>This Inner Loop Header: Depth=1 vmulss xmm1, xmm0, dword ptr [rcx + 4*rdx] vmovss dword ptr [rsi + 4*rdx], xmm1 inc rdx cmp rax, rdx jne .LBB0_7 .LBB0_8: vzeroupper ret the main inner loop is unrolled / pipelined more aggressively, and the fallback code is simpler (just handle scalars scalarly), which is unrelated. But the fallback code is still there. Changing to different variations of the condition, like `if ((N/32)*32 == N) {`, `if ((N % 32) == 0) {`, `if ((N & ~31u) == N) {`, `if ((N >> 5) << 5 == N) {`, doesn't make any difference. I tried with signed int, and unsigned int. Same effect. Reassigning to N (after removing constness), i.e. `N = N & ~31u`, or `N = (N >> 5) << 5`, does appear to do something, but if it is inside the condition it is already too late.
[Bug tree-optimization/92130] Missed vectorization for iteration dependent loads and simple multiplicative accumulators
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92130 --- Comment #9 from Witold Baryluk --- Indeed, passing -fno-tree-pre in the first example does make it be vectorized. In the mesh_simple.c this corresponds to ONTHEFLY_CONSTANTS being defined, but USE_LOOP_CONSTANTS being not. The SIMPLIFIED can be defined or not, it vectorizes now in both cases. Targeting -march=knm. This is with #define OCTAVES 12, a compile time constant, so compiler fully unrolls the most inner loop. Without -fno-tree-pre: 1230 : 1230: 41 57 push %r15 1232: 62 a1 7d 40 ef c0 vpxord %zmm16,%zmm16,%zmm16 1238: 49 ba 53 ec 85 1a femovabs $0xc4ceb9fe1a85ec53,%r10 123f: b9 ce c4 1242: 41 56 push %r14 1244: c5 7a 10 0d f8 0d 00vmovss 0xdf8(%rip),%xmm9# 2044 <_IO_stdin_used+0x44> 124b: 00 124c: 62 31 7c 48 28 d0 vmovaps %zmm16,%zmm10 1252: 41 55 push %r13 1254: c5 7a 10 3d ec 0d 00vmovss 0xdec(%rip),%xmm15# 2048 <_IO_stdin_used+0x48> 125b: 00 125c: 62 a1 7c 48 28 d0 vmovaps %zmm16,%zmm18 1262: 41 54 push %r12 1264: c5 7a 10 35 e0 0d 00vmovss 0xde0(%rip),%xmm14# 204c <_IO_stdin_used+0x4c> 126b: 00 126c: 49 b9 cd 8c 55 ed d7movabs $0xff51afd7ed558ccd,%r9 1273: af 51 ff 1276: 55 push %rbp 1277: c5 7a 10 2d d1 0d 00vmovss 0xdd1(%rip),%xmm13# 2050 <_IO_stdin_used+0x50> 127e: 00 127f: 49 be 68 66 ac 6a bfmovabs $0xfa8d7ebf6aac6668,%r14 1286: 7e 8d fa 1289: 53 push %rbx 128a: c5 7a 10 25 c2 0d 00vmovss 0xdc2(%rip),%xmm12# 2054 <_IO_stdin_used+0x54> 1291: 00 1292: 48 89 7c 24 f8 mov%rdi,-0x8(%rsp) 1297: c7 44 24 f0 00 00 00movl $0x0,-0x10(%rsp) 129e: 00 129f: c7 44 24 f4 00 00 00movl $0x0,-0xc(%rsp) 12a6: 00 12a7: c5 7a 10 1d a9 0d 00vmovss 0xda9(%rip),%xmm11# 2058 <_IO_stdin_used+0x58> 12ae: 00 12af: 62 e1 7e 08 10 0d a3vmovss 0xda3(%rip),%xmm17# 205c <_IO_stdin_used+0x5c> 12b6: 0d 00 00 12b9: 0f 1f 80 00 00 00 00nopl 0x0(%rax) 12c0: 48 8b 6c 24 f8 mov-0x8(%rsp),%rbp 12c5: 31 f6 xor%esi,%esi 12c7: 31 db xor%ebx,%ebx 12c9: 62 31 7c 48 28 c2 vmovaps %zmm18,%zmm8 12cf: 90 nop 12d0: 8b 54 24 f0 mov-0x10(%rsp),%edx 12d4: 45 31 e4xor%r12d,%r12d 12d7: 62 b1 7c 48 28 f8 vmovaps %zmm16,%zmm7 12dd: 62 c1 7c 48 28 d9 vmovaps %zmm9,%zmm19 12e3: c5 32 11 cc vmovss %xmm9,%xmm9,%xmm4 12e7: eb 26 jmp130f 12e9: 0f 1f 80 00 00 00 00nopl 0x0(%rax) 12f0: c5 ba 59 c4 vmulss %xmm4,%xmm8,%xmm0 12f4: 62 f3 7d 08 0a c0 09vrndscaless $0x9,%xmm0,%xmm0,%xmm0 12fb: c5 fa 2c f0 vcvttss2si %xmm0,%esi 12ff: c4 c1 5a 59 c2 vmulss %xmm10,%xmm4,%xmm0 1304: 62 f3 7d 08 0a c0 09vrndscaless $0x9,%xmm0,%xmm0,%xmm0 130b: c5 fa 2c d0 vcvttss2si %xmm0,%edx 130f: 4c 89 e1mov%r12,%rcx 1312: 62 c1 7c 48 28 e8 vmovaps %zmm8,%zmm21 1318: 48 c1 e9 21 shr$0x21,%rcx 131c: 62 e1 7c 48 28 e4 vmovaps %zmm4,%zmm20 1322: c5 d2 2a ea vcvtsi2ss %edx,%xmm5,%xmm5 1326: 4c 31 e1xor%r12,%rcx 1329: 49 0f af ca imul %r10,%rcx 132d: 48 63 d2movslq %edx,%rdx 1330: c5 e2 2a de vcvtsi2ss %esi,%xmm3,%xmm3 1334: 4f 8d 24 0c lea(%r12,%r9,1),%r12 1338: 48 69 d2 53 42 41 4eimul $0x4e414253,%rdx,%rdx 133f: 62 c2 55 08 9b e2 vfmsub132ss %xmm10,%xmm5,%xmm20 1345: c4 c1 52 58 e9 vaddss %xmm9,%xmm5,%xmm5 134a: 48 8d 01lea(%rcx),%rax 134d: 48 c1 e8 21 shr$0x21,%rax 1351: 62 e2 65 08 9b ec vfmsub132ss %xmm4,%xmm3,%xmm21 1357: 48 31 c1xor%rax,%rcx 135a: 4c 8d ba 53 42 41 4elea0x4e414253(%rdx),%r15 1361: 48 89 cfmov%rcx,%rdi 1364: 48 89 c8mov%rcx,%rax 1367: 48 81 f7 70 46 ab 58xor$0x58ab4670,%rdi 136e: c4 c1 62 58 d9 vaddss %xmm9,%xmm3,%xmm3 1373: 48 c1 e8 21 shr$0x21,%
[Bug c/83584] "ISO C forbids conversion of object pointer to function pointer type" -- no, not really
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83584 Witold Baryluk changed: What|Removed |Added CC||witold.baryluk+gcc at gmail dot co ||m --- Comment #19 from Witold Baryluk --- This is still happening, even when using -std=c11 with gcc 9.2.1 C11 does state in annex J.5.7: http://port70.net/~nsz/c/c11/n1570.html#J.5.7p1 """ J.5.7 Function pointer casts 1 A pointer to an object or to void may be cast to a pointer to a function, allowing data to be invoked as a function (6.5.4). """ I am not sure how else I am supposed to use `dlsym(3)`. Maybe if there was a version of dlsym that instead of returning (void*), would return (void(*)()) or (void (*)(void)) it would help. Maybe it is a POSIX bug then? This issue however is not a duplicate of bug 11234. And the error message is incorrect anyway. Sorry if this was mentioned before.
[Bug c/83584] "ISO C forbids conversion of object pointer to function pointer type" -- no, not really
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83584 --- Comment #20 from Witold Baryluk --- FYI. http://austingroupbugs.net/view.php?id=74#c205 says Note that conversion from a void * pointer to a function pointer as in: fptr = (int (*)(int))dlsym(handle, "my_function"); is not defined by the ISO C Standard. This standard requires this conversion to work correctly on conforming implementations. This is published now as IEEE Std 1003.1-2017, aka POSIX.1-2017: https://pubs.opengroup.org/onlinepubs/9699919799/functions/dlsym.html POSIX standard is free to do so.
[Bug tree-optimization/63945] Missing vectorization optimization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63945 Witold Baryluk changed: What|Removed |Added CC||witold.baryluk+gcc at gmail dot co ||m --- Comment #1 from Witold Baryluk --- It does vectorize for me on gcc 9.2.1: -march=skylake-avx512 aa.cpp:34:29: optimized: loop vectorized using 32 byte vectors aa.cpp:25:27: optimized: loop vectorized using 32 byte vectors if (val<100.) 1279: c5 fb 10 0b vmovsd (%rbx),%xmm1 127d: c5 fb 10 05 8b 0d 00vmovsd 0xd8b(%rip),%xmm0# 2010 <_IO_stdin_used+0x10> 1284: 00 1285: c5 f9 2f c1 vcomisd %xmm1,%xmm0 1289: 76 2b jbe12b6 <_ZN4TEST4testEv+0xc6> 128b: c4 e2 7d 19 c9 vbroadcastsd %xmm1,%ymm1 1290: 31 c0 xor%eax,%eax 1292: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1) c[i] = val*a[i]+b[i]; 1298: c4 c1 7d 10 04 04 vmovupd (%r12,%rax,1),%ymm0 129e: c4 c2 f5 a8 44 05 00vfmadd213pd 0x0(%r13,%rax,1),%ymm1,%ymm0 12a5: c5 fd 11 04 07 vmovupd %ymm0,(%rdi,%rax,1) for (unsigned int i=0; i ::operator delete(__p); 12b6: c5 f8 77vzeroupper Similarly: -march=knm aa.cpp:34:29: optimized: loop vectorized using 64 byte vectors aa.cpp:25:27: optimized: loop vectorized using 64 byte vectors if (val<100.) 15bc: 31 c0 xor%eax,%eax 15be: 66 90 xchg %ax,%ax c[i] = val*a[i]+b[i]; 15c0: 62 f1 fd 48 28 04 01vmovapd (%rcx,%rax,1),%zmm0 15c7: 62 f2 ed 48 a8 04 06vfmadd213pd (%rsi,%rax,1),%zmm2,%zmm0 15ce: 62 d1 fd 48 11 04 01vmovupd %zmm0,(%r9,%rax,1) for (unsigned int i=0; i (plus a lot of handling for unaligned stack). -march=znver2 aa.cpp:34:29: optimized: loop vectorized using 32 byte vectors aa.cpp:25:27: optimized: loop vectorized using 32 byte vectors if (val<100.) 1279: c5 fb 10 0b vmovsd (%rbx),%xmm1 127d: c5 fb 10 05 8b 0d 00vmovsd 0xd8b(%rip),%xmm0# 2010 <_IO_stdin_used+0x10> 1284: 00 1285: c5 f9 2f c1 vcomisd %xmm1,%xmm0 1289: 76 33 jbe12be <_ZN4TEST4testEv+0xce> 128b: c4 e2 7d 19 c9 vbroadcastsd %xmm1,%ymm1 1290: 31 c0 xor%eax,%eax 1292: 66 66 2e 0f 1f 84 00data16 nopw %cs:0x0(%rax,%rax,1) 1299: 00 00 00 00 129d: 0f 1f 00nopl (%rax) c[i] = val*a[i]+b[i]; 12a0: c4 c1 7d 10 04 04 vmovupd (%r12,%rax,1),%ymm0 12a6: c4 c2 f5 a8 44 05 00vfmadd213pd 0x0(%r13,%rax,1),%ymm1,%ymm0 12ad: c5 fd 11 04 07 vmovupd %ymm0,(%rdi,%rax,1) for (unsigned int i=0; i -march=core2 aa.cpp:34:29: optimized: loop vectorized using 16 byte vectors aa.cpp:25:27: optimized: loop vectorized using 16 byte vectors if (val<100.) 1276: f2 0f 10 13 movsd (%rbx),%xmm2 127a: f2 0f 10 05 8e 0d 00movsd 0xd8e(%rip),%xmm0# 2010 <_IO_stdin_used+0x10> 1281: 00 1282: 66 0f 2f c2 comisd %xmm2,%xmm0 1286: 76 40 jbe12c8 <_ZN4TEST4testEv+0xd8> 1288: 31 c0 xor%eax,%eax 128a: 66 0f 14 d2 unpcklpd %xmm2,%xmm2 128e: 66 90 xchg %ax,%ax c[i] = val*a[i]+b[i]; 1290: f3 0f 7e 44 05 00 movq 0x0(%rbp,%rax,1),%xmm0 1296: f3 41 0f 7e 0c 04 movq (%r12,%rax,1),%xmm1 129c: 66 0f 16 44 05 08 movhpd 0x8(%rbp,%rax,1),%xmm0 12a2: 66 0f 59 c2 mulpd %xmm2,%xmm0 12a6: 66 41 0f 16 4c 04 08movhpd 0x8(%r12,%rax,1),%xmm1 12ad: 66 0f 58 c1 addpd %xmm1,%xmm0 12b1: 66 0f 13 04 07 movlpd %xmm0,(%rdi,%rax,1) 12b6: 66 0f 17 44 07 08 movhpd %xmm0,0x8(%rdi,%rax,1) for (unsigned int i=0; i Looks all pretty optimally vectorized to me. The code can be made even better, if you ensure proper alignment of std::vector arrrays, which they might not be at the moment.
[Bug tree-optimization/92130] New: Missed vectorization for iteration dependent loads and simple multiplicative accumulators
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92130 Bug ID: 92130 Summary: Missed vectorization for iteration dependent loads and simple multiplicative accumulators Product: gcc Version: 9.2.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: witold.baryluk+gcc at gmail dot com Target Milestone: --- Created attachment 47051 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=47051&action=edit Perlin2D noise mesh generation So, I do have pretty complex multi level loop spread across many functions, but it can be all vectorized, but under certain scenarios gcc does not vectorize it with gcc 9.2.1 I am attaching somehow simplified code with few defines inside to play with it. The one exposed by default present the biggest challenge to gcc, despite me able to vectorize it manually. I tested this on SSE2, AVX2 (cascadelake and znver2), AVX512 (-march=knm and -march=skylake-avx512) and ARM SVE, with all same effects. I am using associative math and other flags mentioned in the sourcefile at the top. The high level overview is like this: input: A, F, W, maxO, sufficiently aligned d. foreach y: foreach x: float v = 0.0 float a = 1.0 float f = 1.0 foreach o in [0, maxO): v += a * g(f * x, f * y, o, h(o, p)) a *= A f *= F d[y*W + x] = v where both g and h are pure functions (relatively complex tho) with no control flow or data dependent flow. In some situations if a and f are replaced by a precomputed table of coefficient for every o, and then used as v += a[o] * g(f[o] * x, f[o] * y, h(o, p)), it does vectorize, but not always. h(o, p) could also be precomputed, but I didn't bother as it appears to not have any bad effect on vectorizer. Vectorizater should vectorize along the 'foreach x', and compute multiple x-s per-lane completely independently. It is true that when updating a and f, each lane need to be duplicated, but that can be done by computing it scalarly, and then broadcasting, or by repeating same constants updates in each lane.
[Bug tree-optimization/92130] Missed vectorization for iteration dependent loads and simple multiplicative accumulators
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92130 --- Comment #1 from Witold Baryluk --- Created attachment 47052 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=47052&action=edit Minimized test case
[Bug tree-optimization/92130] Missed vectorization for iteration dependent loads and simple multiplicative accumulators
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92130 --- Comment #2 from Witold Baryluk --- Added a minimized test case that has only one outer loop, and f and h are removed for simple inlined replacement. Example diagnostic: $ gcc -std=c17 -march=knm -O3 -ffast-math -fassociative-math -ftree-vectorizer-verbose=2 -fopt-info-vec-all -ggdb -Wall mesh_minimal.c -o mesh_minimal_knm -lm mesh_minimal.c:34:3: missed: couldn't vectorize loop mesh_minimal.c:34:3: missed: not vectorized: latch block not empty. mesh_minimal.c:33:13: note: vectorized 0 loops in function.
[Bug tree-optimization/92130] Missed vectorization for iteration dependent loads and simple multiplicative accumulators
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92130 --- Comment #3 from Witold Baryluk --- If only the frequency is updated in the inner loop: frequency *= 2.131f; function fill_data is vectorized: mesh_minimal.c:34:3: optimized: loop vectorized using 64 byte vectors mesh_minimal.c:33:13: note: vectorized 1 loops in function. However if amplitude is updated in the inner loop: amplitude *= 0.781f; function fill_data is NOT vectorized. mesh_minimal.c:34:3: missed: couldn't vectorize loop mesh_minimal.c:34:3: missed: not vectorized: latch block not empty. mesh_minimal.c:33:13: note: vectorized 0 loops in function. Here for reference: /* line 20 */ static float perlin1d(float x) { float accum = 0.0; float frequency = 1.0; float amplitude = 1.0; for (int i = 0; i < 8; i++) { accum += amplitude * (sinf(x * frequency + (float)i)); frequency *= 2.131f; amplitude *= 0.781f; } return accum; } __attribute__((noinline)) /* line 33 */ static void fill_data(int width, float * __restrict__ height_data, float scale) { /* line 34 */ for (int i = 0; i < width; i++) { height_data[i] = perlin1d(i); } }
[Bug tree-optimization/92130] Missed vectorization for iteration dependent loads and simple multiplicative accumulators
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92130 --- Comment #4 from Witold Baryluk --- If I reduce minimized test case even further: only frequency update: VECTORIZED: static float perlin1d(float x) { float accum = 0.0f; float amplitude = 1.0f; float frequency = 1.0f; for (int i = 0; i < 8; i++) { accum += amplitude * sinf(x * frequency); frequency *= 2.131f; } return accum; } __attribute__((noinline)) static void fill_data(int width, float * __restrict__ height_data, float scale) { for (int i = 0; i < width; i++) { height_data[i] = perlin1d(i); } } only amplitude update: VECTORIZED: static float perlin1d(float x) { float accum = 0.0f; float amplitude = 1.0f; float frequency = 1.0f; for (int i = 0; i < 8; i++) { accum += amplitude * sinf(x * frequency); amplitude *= 0.781f; } return accum; } __attribute__((noinline)) static void fill_data(int width, float * __restrict__ height_data, float scale) { for (int i = 0; i < width; i++) { height_data[i] = perlin1d(i); } } both frequency and amplitude update: NOT VECTORIZED: static float perlin1d(float x) { float accum = 0.0f; float amplitude = 1.0f; float frequency = 1.0f; for (int i = 0; i < 8; i++) { accum += amplitude * sinf(x * frequency); amplitude *= 0.781f; frequency *= 2.131f; } return accum; } __attribute__((noinline)) static void fill_data(int width, float * __restrict__ height_data, float scale) { for (int i = 0; i < width; i++) { height_data[i] = perlin1d(i); } }
[Bug tree-optimization/92130] Missed vectorization for iteration dependent loads and simple multiplicative accumulators
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92130 --- Comment #5 from Witold Baryluk --- As a bonus: static float perlin1d(float x) { float accum = 0.0f; for (int i = 0; i < 8; i++) { accum += powf(0.781f, i) * sinf(x * powf(2.131f, i)); } return accum; } claims to be vectorized, but really isn't, and has non inline or lowered calls to sinf and expf_finite.
[Bug tree-optimization/92130] Missed vectorization for iteration dependent loads and simple multiplicative accumulators
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92130 --- Comment #6 from Witold Baryluk --- I also tested clang with LLVM 10~svn374655 and it does vectorize the loop properly, even when both frequency and amplitude variables are updated every loop. It still doesn't inline calls to sinf, even if I set -fno-math-errno and other things from -ffast-math. My random guess is that it is because there is no hardware support for vectorized sinf, and there is no vectorized variant of sinf software implementation either. If I provide my own version of sinf using simple Taylor expansion, clang fully vectorized the code: 401320: 62 e1 7d 58 fe 3d 56vpaddd 0xd56(%rip){1to16},%zmm0,%zmm23 # 402080 <_IO_stdin_used+0x80> 401327: 0d 00 00 40132a: 62 61 7c 48 5b c0 vcvtdq2ps %zmm0,%zmm24 401330: 62 a1 7c 48 5b ff vcvtdq2ps %zmm23,%zmm23 401336: 62 f1 7c 48 10 4c 24vmovups 0x140(%rsp),%zmm1 40133d: 05 40133e: 62 61 3c 40 59 d1 vmulps %zmm1,%zmm24,%zmm26 401344: 62 61 44 40 59 f9 vmulps %zmm1,%zmm23,%zmm31 40134a: 62 f1 7c 48 10 4c 24vmovups 0x100(%rsp),%zmm1 401351: 04 401352: 62 61 3c 40 59 d9 vmulps %zmm1,%zmm24,%zmm27 401358: 62 f1 44 40 59 c9 vmulps %zmm1,%zmm23,%zmm1 40135e: 62 01 2c 40 59 ca vmulps %zmm26,%zmm26,%zmm25 401364: 62 f1 7c 48 10 54 24vmovups 0x80(%rsp),%zmm2 40136b: 02 40136c: 62 61 3c 40 59 e2 vmulps %zmm2,%zmm24,%zmm28 401372: 62 f1 44 40 59 d2 vmulps %zmm2,%zmm23,%zmm2 401378: 62 02 25 40 ac ca vfnmadd213ps %zmm26,%zmm27,%zmm25 40137e: 62 f1 7c 48 10 5c 24vmovups 0x40(%rsp),%zmm3 401385: 01 401386: 62 61 3c 40 59 eb vmulps %zmm3,%zmm24,%zmm29 40138c: 62 f1 44 40 59 db vmulps %zmm3,%zmm23,%zmm3 401392: 62 01 1c 40 59 d4 vmulps %zmm28,%zmm28,%zmm26 401398: 62 01 04 40 59 df vmulps %zmm31,%zmm31,%zmm27 40139e: 62 02 15 40 ac d4 vfnmadd213ps %zmm28,%zmm29,%zmm26 4013a4: 62 f1 7c 48 10 6c 24vmovups -0x40(%rsp),%zmm5 4013ab: ff 4013ac: 62 f1 3c 40 59 e5 vmulps %zmm5,%zmm24,%zmm4 4013b2: 62 f1 44 40 59 ed vmulps %zmm5,%zmm23,%zmm5 4013b8: 62 61 6c 48 59 e2 vmulps %zmm2,%zmm2,%zmm28 4013be: 62 f1 7c 48 10 7c 24vmovups -0x80(%rsp),%zmm7 4013c5: fe 4013c6: 62 f1 3c 40 59 f7 vmulps %zmm7,%zmm24,%zmm6 4013cc: 62 f1 44 40 59 ff vmulps %zmm7,%zmm23,%zmm7 4013d2: 62 61 5c 48 59 ec vmulps %zmm4,%zmm4,%zmm29 4013d8: 62 61 54 48 59 f5 vmulps %zmm5,%zmm5,%zmm30 4013de: 62 62 4d 48 ac ec vfnmadd213ps %zmm4,%zmm6,%zmm29 4013e4: 62 d1 3c 40 59 e3 vmulps %zmm11,%zmm24,%zmm4 4013ea: 62 d1 44 40 59 f3 vmulps %zmm11,%zmm23,%zmm6 4013f0: 62 02 75 48 ac df vfnmadd213ps %zmm31,%zmm1,%zmm27 4013f6: 62 d1 3c 40 59 cc vmulps %zmm12,%zmm24,%zmm1 4013fc: 62 41 44 40 59 fc vmulps %zmm12,%zmm23,%zmm31 401402: 62 71 5c 48 59 c4 vmulps %zmm4,%zmm4,%zmm8 401408: 62 62 65 48 ac e2 vfnmadd213ps %zmm2,%zmm3,%zmm28 40140e: 62 72 75 48 ac c4 vfnmadd213ps %zmm4,%zmm1,%zmm8 401414: 62 d1 3c 40 59 ce vmulps %zmm14,%zmm24,%zmm1 40141a: 62 d1 44 40 59 d6 vmulps %zmm14,%zmm23,%zmm2 401420: 62 62 45 48 ac f5 vfnmadd213ps %zmm5,%zmm7,%zmm30 401426: 62 d1 3c 40 59 df vmulps %zmm15,%zmm24,%zmm3 40142c: 62 d1 44 40 59 e7 vmulps %zmm15,%zmm23,%zmm4 401432: 62 f1 74 48 59 e9 vmulps %zmm1,%zmm1,%zmm5 401438: 62 f1 4c 48 59 fe vmulps %zmm6,%zmm6,%zmm7 40143e: 62 71 6c 48 59 ca vmulps %zmm2,%zmm2,%zmm9 401444: 62 f2 65 48 ac e9 vfnmadd213ps %zmm1,%zmm3,%zmm5 40144a: 62 b1 3c 40 59 c9 vmulps %zmm17,%zmm24,%zmm1 401450: 62 f2 05 40 ac fe vfnmadd213ps %zmm6,%zmm31,%zmm7 401456: 62 b1 44 40 59 d9 vmulps %zmm17,%zmm23,%zmm3 40145c: 62 b1 3c 40 59 f2 vmulps %zmm18,%zmm24,%zmm6 401462: 62 21 44 40 59 fa vmulps %zmm18,%zmm23,%zmm31 401468: 62 72 5d 48 ac ca vfnmadd213ps %zmm2,%zmm4,%zmm9 40146e: 62 f1 74 48 59 d1 vmulps %zmm1,%zmm1,%zmm2 401474: 62 f1 64 48 59 e3 vmulps %zmm3,%zmm3,%zmm4 40147a: 62 f2 4d 48 ac d1 vfnmadd213ps %zmm1,%zmm6,%zmm2 401480: 62 f2 05 40 ac e3 vfnmadd213ps %zmm3,%zmm31,%zmm4 401486: 62 b1 3c 40 59 cc vmulps %zmm20,%zmm24,%zmm1 40148c: 62 b1 3c 40 59 dd vmulps %zmm21,%zmm24,%zmm3 401492: 62 f1 74 48 59 f1 vmulps %zmm1,%zmm1,%zmm6 401498: 62 21 44 40 59 fc vmulps %zmm20,%zmm23,%zmm31 40149e: 62 f2 65 48 ac f1 vfnmadd213ps %zmm1,%zmm3,%z
[Bug tree-optimization/92130] Missed vectorization for iteration dependent loads and simple multiplicative accumulators
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92130 --- Comment #7 from Witold Baryluk --- Online examples: https://gcc.godbolt.org/z/Nyjty3
[Bug c/100257] New: poor codegen with vcvtph2ps / stride of 6
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100257 Bug ID: 100257 Summary: poor codegen with vcvtph2ps / stride of 6 Product: gcc Version: 12.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c Assignee: unassigned at gcc dot gnu.org Reporter: witold.baryluk+gcc at gmail dot com Target Milestone: --- gcc (Compiler-Explorer-Build) 12.0.0 20210424 (experimental) https://godbolt.org/z/n6ooMdnz8 This C code: ``` #include #include #include struct float3 { float f1; float f2; float f3; }; struct util_format_r16g16b16_float { uint16_t r; uint16_t g; uint16_t b; }; static inline struct float3 _mesa_half3_to_float3(uint16_t val_0, uint16_t val_1, uint16_t val_2) { #if defined(__F16C__) //const __m128i in = {val_0, val_1, val_2}; //__m128 out; //__asm volatile("vcvtph2ps %1, %0" : "=v"(out) : "v"(in)); const __m128i in = _mm_setr_epi16(val_0, val_1, val_2, 0, 0, 0, 0, 0); const __m128 out = _mm_cvtph_ps(in); const struct float3 r = {out[0], out[1], out[2]}; return r; #endif } void util_format_r16g16b16_float_unpack_rgba_float(void *restrict dst_row, const uint8_t *restrict src, unsigned width) { float *dst = dst_row; for (unsigned x = 0; x < width; x += 1) { const struct util_format_r16g16b16_float pixel; memcpy(&pixel, src, sizeof pixel); struct float3 r = _mesa_half3_to_float3(pixel.r, pixel.g, pixel.b); dst[0] = r.f1; /* r */ dst[1] = r.f2; /* g */ dst[2] = r.f3; /* b */ dst[3] = 1; /* a */ src += 6; dst += 4; } } ``` Is compiled "poorly" by gcc, even worse when compiled on i386 (with -mf16c enabled) when using -FPIE. Example: gcc -O3 -m32 -march=znver2 -mfpmath=sse -fPIE util_format_r16g16b16_float_unpack_rgba_float: pushebp pushedi pushesi pushebx sub esp, 28 mov ecx, DWORD PTR 56[esp] mov edx, DWORD PTR 48[esp] call__x86.get_pc_thunk.ax add eax, OFFSET FLAT:_GLOBAL_OFFSET_TABLE_ mov ebx, DWORD PTR 52[esp] testecx, ecx je .L8 vmovss xmm3, DWORD PTR .LC0@GOTOFF[eax] xor esi, esi xor ebp, ebp vpxor xmm2, xmm2, xmm2 .L3: mov eax, DWORD PTR [ebx] vmovss DWORD PTR 12[edx], xmm3 add ebx, 6 add edx, 16 inc esi mov ecx, eax vmovd xmm0, eax shr ecx, 16 mov edi, ecx movzx ecx, WORD PTR -2[ebx] vpinsrw xmm0, xmm0, edi, 1 vmovd xmm1, ecx vpinsrw xmm1, xmm1, ebp, 1 vpunpckldq xmm0, xmm0, xmm1 vpunpcklqdq xmm0, xmm0, xmm2 vcvtph2ps xmm0, xmm0 vmovss DWORD PTR -16[edx], xmm0 vextractps DWORD PTR -12[edx], xmm0, 1 vextractps DWORD PTR -8[edx], xmm0, 2 cmp DWORD PTR 56[esp], esi jne .L3 .L8: add esp, 28 pop ebx pop esi pop edi pop ebp ret .LC0: .long 1065353216 __x86.get_pc_thunk.ax: mov eax, DWORD PTR [esp] ret clang: util_format_r16g16b16_float_unpack_rgba_float: # @util_format_r16g16b16_float_unpack_rgba_float mov eax, dword ptr [esp + 12] testeax, eax je .LBB0_3 mov ecx, dword ptr [esp + 8] mov edx, dword ptr [esp + 4] .LBB0_2:# =>This Inner Loop Header: Depth=1 vmovd xmm0, dword ptr [ecx] # xmm0 = mem[0],zero,zero,zero vpinsrw xmm0, xmm0, word ptr [ecx + 4], 2 add ecx, 6 vcvtph2ps xmm0, xmm0 vmovss dword ptr [edx], xmm0 vextractps dword ptr [edx + 4], xmm0, 1 vextractps dword ptr [edx + 8], xmm0, 2 mov dword ptr [edx + 12], 1065353216 add edx, 16 dec eax jne .LBB0_2 .LBB0_3: ret clang code is essentially optimal. The issue persist if I use `vcvtph2ps` directly via asm, or via intrinsics. The issue might be the src stride, of 6, instead 8, that is confusing gcc. Additionally, constant 1065353216 (which is weird, I would expect it to be 0), is stored in data section, instead inline as immediate, this makes code actually larger, and in PIE mode, requires extra pointer trickery, and on -m32, even calling extra function. Even without -fPIE the main loop has poor codegen even on x86-64 / amd64 compared to clang or what I would considered good code. gcc -m64 -O3 -march=native util_format_r16g16b16_float_unpack_rgba_float: testedx, edx je .L8 mov edx, edx sal rdx, 4 vmovss xmm3,
[Bug tree-optimization/96275] Vectorizer doesn't take into account bitmask condition from branch conditions.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96275 --- Comment #3 from Witold Baryluk --- Thanks for looking into that. I just wanted to update that this still suboptimal in current gcc trunk 20201226. While clang produces superior code.
[Bug d/98457] New: [d] writef!"%s" doesn't work with MonoTime / SysTick
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98457 Bug ID: 98457 Summary: [d] writef!"%s" doesn't work with MonoTime / SysTick Product: gcc Version: 10.2.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: d Assignee: ibuclaw at gdcproject dot org Reporter: witold.baryluk+gcc at gmail dot com Target Milestone: --- void main() { import std.stdio; import core.time : MonoTime; writef!"%s"(MonoTime.currTime()); } Doesn't compile with gdc 10.2.1: $ gdc test_monotime.d /usr/lib/gcc/x86_64-linux-gnu/10/include/d/core/time.d:2405:16: error: static variable _ticksPerSecond cannot be read at compile time 2405 | return _ticksPerSecond[_clockIdx]; |^ /usr/lib/gcc/x86_64-linux-gnu/10/include/d/core/time.d:2418:99: note: called from here: ticksPerSecond() 2418 | return "MonoTime(" ~ signedToTempString(_ticks, 10) ~ " ticks, " ~ signedToTempString(ticksPerSecond, 10) ~ " ticks per second)"; | ^ /usr/lib/gcc/x86_64-linux-gnu/10/include/d/core/time.d:2418:98: note: called from here: signedToTempString(ticksPerSecond(), 10u) 2418 | return "MonoTime(" ~ signedToTempString(_ticks, 10) ~ " ticks, " ~ signedToTempString(ticksPerSecond, 10) ~ " ticks per second)"; | ^ /usr/lib/gcc/x86_64-linux-gnu/10/include/d/std/format.d:3353:28: note: called from here: val.toString() 3353 | put(w, val.toString()); |^ /usr/lib/gcc/x86_64-linux-gnu/10/include/d/std/format.d:3353:12: note: called from here: put(w, val.toString()) 3353 | put(w, val.toString()); |^ /usr/lib/gcc/x86_64-linux-gnu/10/include/d/std/format.d:3672:21: note: called from here: formatObject(w, val, f) 3672 | formatObject(w, val, f); | ^ /usr/lib/gcc/x86_64-linux-gnu/10/include/d/std/format.d:568:28: note: called from here: formatValue(w, _param_2, spec) 568 | formatValue(w, args[i], spec); |^ /usr/lib/gcc/x86_64-linux-gnu/10/include/d/std/format.d:5767:28: note: called from here: formattedWrite(w, fmt, _param_1) 5767 | auto n = formattedWrite(w, fmt, args); |^ /usr/lib/gcc/x86_64-linux-gnu/10/include/d/std/format.d:5729:16: note: called from here: format("%s", MonoTimeImpl(0L)) 5729 | .format(fmt, Args.init); |^ /usr/lib/gcc/x86_64-linux-gnu/10/include/d/std/format.d:5733:2: note: called from here: (*function () => null)() 5733 | }(); | ^ (null):0: confused by earlier errors, bailing out Adding manually .toString() makes it work (at the expense of possible extra allocation). No issues in ldc2 1.24.0 or dmd2 2.095.0-beta.1 It doesn't look like issue in phobos, but something deeper.
[Bug d/98457] [d] writef!"%s" doesn't work with MonoTime / SysTick
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98457 --- Comment #1 from Witold Baryluk --- Godbolt link: https://godbolt.org/z/q3bzhP with gcc trunk 20201217 and a bit more diagnostic /opt/compiler-explorer/gcc-trunk-20201227/lib/gcc/x86_64-linux-gnu/11.0.0/include/d/core/time.d:2405:16: error: static variable _ticksPerSecond cannot be read at compile time 2405 | return _ticksPerSecond[_clockIdx]; |^ /opt/compiler-explorer/gcc-trunk-20201227/lib/gcc/x86_64-linux-gnu/11.0.0/include/d/core/time.d:2418:99: note: called from here: ticksPerSecond() 2418 | return "MonoTime(" ~ signedToTempString(_ticks, 10) ~ " ticks, " ~ signedToTempString(ticksPerSecond, 10) ~ " ticks per second)"; | ^ /opt/compiler-explorer/gcc-trunk-20201227/lib/gcc/x86_64-linux-gnu/11.0.0/include/d/core/time.d:2418:98: note: called from here: signedToTempString(ticksPerSecond(), 10u) 2418 | return "MonoTime(" ~ signedToTempString(_ticks, 10) ~ " ticks, " ~ signedToTempString(ticksPerSecond, 10) ~ " ticks per second)"; | ^ /opt/compiler-explorer/gcc-trunk-20201227/lib/gcc/x86_64-linux-gnu/11.0.0/include/d/std/format.d:3353:28: note: called from here: val.toString() 3353 | put(w, val.toString()); |^ /opt/compiler-explorer/gcc-trunk-20201227/lib/gcc/x86_64-linux-gnu/11.0.0/include/d/std/format.d:3353:12: note: called from here: put(w, val.toString()) 3353 | put(w, val.toString()); |^ /opt/compiler-explorer/gcc-trunk-20201227/lib/gcc/x86_64-linux-gnu/11.0.0/include/d/std/format.d:3672:21: note: called from here: formatObject(w, val, f) 3672 | formatObject(w, val, f); | ^ /opt/compiler-explorer/gcc-trunk-20201227/lib/gcc/x86_64-linux-gnu/11.0.0/include/d/std/format.d:568:28: note: called from here: formatValue(w, _param_2, spec) 568 | formatValue(w, args[i], spec); |^ /opt/compiler-explorer/gcc-trunk-20201227/lib/gcc/x86_64-linux-gnu/11.0.0/include/d/std/format.d:5767:28: note: called from here: formattedWrite(w, fmt, _param_1) 5767 | auto n = formattedWrite(w, fmt, args); |^ /opt/compiler-explorer/gcc-trunk-20201227/lib/gcc/x86_64-linux-gnu/11.0.0/include/d/std/format.d:5729:16: note: called from here: format("%s", MonoTimeImpl(0L)) 5729 | .format(fmt, Args.init); |^ /opt/compiler-explorer/gcc-trunk-20201227/lib/gcc/x86_64-linux-gnu/11.0.0/include/d/std/format.d:5733:2: note: called from here: (*function () => null)() 5733 | }(); | ^ /opt/compiler-explorer/gcc-trunk-20201227/lib/gcc/x86_64-linux-gnu/11.0.0/include/d/std/stdio.d:3754:15: error: template instance std.format.checkFormatException!("�}�", MonoTimeImpl!cast(ClockType)0) error instantiating 3754 | alias e = checkFormatException!(fmt, A); | ^ :4:14: note: instantiated from here: writef!("%s", MonoTimeImpl!cast(ClockType)0) 4 | writef!"%s"(MonoTime.currTime()); | ^ /opt/compiler-explorer/gcc-trunk-20201227/lib/gcc/x86_64-linux-gnu/11.0.0/include/d/std/stdio.d:3755:5: note: while evaluating: static assert(!e) 3755 | static assert(!e, e.msg); | ^ Compiler returned: 1
[Bug d/98494] New: libphobos: std.process Config.stderrPassThrough missing
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98494 Bug ID: 98494 Summary: libphobos: std.process Config.stderrPassThrough missing Product: gcc Version: 10.2.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: d Assignee: ibuclaw at gdcproject dot org Reporter: witold.baryluk+gcc at gmail dot com Target Milestone: --- It appears that gdc version of libphobos is somehow lagging in some aspects behind upstream. One of the things I see missing, is `Config.stderrPassThrough` in std.process. I see it was added upstream about 12 months ago: enum Config { ... /** By default, the $(LREF execute) and $(LREF executeShell) functions will capture child processes' both stdout and stderr. This can be undesirable if the standard output is to be processed or otherwise used by the invoking program, as `execute`'s result would then contain a mix of output and warning/error messages. Specify this flag when calling `execute` or `executeShell` to cause invoked processes' stderr stream to be sent to $(REF stderr, std,stdio), and only capture and return standard output. This flag has no effect on $(LREF spawnProcess) or $(LREF spawnShell). */ stderrPassThrough = 128, } The implementation usage of this is relatively small and easy to backport: in executeImpl: -auto p = pipeFunc(commandLine, Redirect.stdout | Redirect.stderrToStdout, - env, config, workDir, extraArgs); +auto redirect = (config & Config.stderrPassThrough) +? Redirect.stdout +: Redirect.stdout | Redirect.stderrToStdout; + +auto p = pipeFunc(commandLine, redirect, + env, config, workDir, extraArgs); There are some other minor changes there, but nothing functionally significant. Mostly unittests and minor signature changes (adding `scope` to many input parameters). Thank you.
[Bug d/100769] New: [D] memcmp() == 0 for small constant strings not folded
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100769 Bug ID: 100769 Summary: [D] memcmp() == 0 for small constant strings not folded Product: gcc Version: 10.2.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: d Assignee: ibuclaw at gdcproject dot org Reporter: witold.baryluk+gcc at gmail dot com Target Milestone: --- I expect this D code to be quite optimal, but it isn't. ``` extern(C) int memcmp(const void *s1, const void *s2, size_t n); int recognize3(const char* s) { return memcmp(s, "stract class", 12) == 0; } ``` https://godbolt.org/z/vx17WK9rs It produces a call to memcmp, instead of inlining and specializing the code for this specific case. int example.recognize3(const(char*)): sub rsp, 8 mov edx, 12 mov esi, OFFSET FLAT:.LC0 callmemcmp testeax, eax seteal add rsp, 8 movzx eax, al ret ldc2 1.24.0 (for D) and clang 11.0.1-2 (for C and C++), and gcc 10.2.1 (for C and C++) produce close to optimal codes. Similarly ldc2 1.26.0 (for D), and gcc 11.1 (for C and C++): int example.recognize3(const(char*)): movabs rcx, 7142836979195081843 xor rcx, qword ptr [rdi] mov edx, dword ptr [rdi + 8] xor rdx, 1936941420 xor eax, eax or rdx, rcx seteal ret and recognize3: movabs rax, 7142836979195081843 cmp QWORD PTR [rdi], rax je .L6 .L2: mov eax, 1 xor eax, 1 ret .L6: xor eax, eax cmp DWORD PTR [rdi+8], 1936941420 jne .L2 xor eax, 1 ret Notice, how both gcc, clang and ldc2, compare first 8 bytes of input, then 4 bytes of input. clang and ldc2 just xor/or the result, then return, with no conditional jumps. gcc does a bit poorer, with more conditionals and more jumps, but still pretty good and same idea. gdc however, calls the generic memcmp, that does looping and does about 12 jumps and/or 13 exists.
[Bug d/100769] [D] memcmp() == 0 for small constant strings not folded
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100769 --- Comment #1 from Witold Baryluk --- A typo in the example (godbolt is good), I forgot the `.ptr`: extern(C) int memcmp(const void *s1, const void *s2, size_t n); int recognize3(const char* s) { return memcmp(s, "stract class".ptr, 12) == 0; } casting to ubyte*, or void*, doesn't change anything really. options: -O3 -frelease -fno-semantic-interposition tested on amd64, Debian / Linux.
[Bug d/100769] [D] memcmp() == 0 for small constant strings not folded
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100769 --- Comment #2 from Witold Baryluk --- Hmm. It appears that using `import core.stdc.string : memcmp;` actually resolves the problem. It looks like my manually declaration of memcmp for some reason disabled optimisations for memcmp.
[Bug d/100769] [D] memcmp() == 0 for small constant strings not folded
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100769 Witold Baryluk changed: What|Removed |Added Status|UNCONFIRMED |RESOLVED Resolution|--- |FIXED --- Comment #4 from Witold Baryluk --- Ok. That makes sense. Thanks.
[Bug d/100769] [D] memcmp() == 0 for small constant strings not folded
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100769 Witold Baryluk changed: What|Removed |Added Resolution|FIXED |INVALID
[Bug d/105360] New: Inlined lazy parameters / delegate literals, still emitted
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105360 Bug ID: 105360 Summary: Inlined lazy parameters / delegate literals, still emitted Product: gcc Version: 12.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: d Assignee: ibuclaw at gdcproject dot org Reporter: witold.baryluk+gcc at gmail dot com Target Milestone: --- ``` extern bool g(); extern void f(int n); void log(lazy int num) { if (g()) { const n = num(); f(n); } } void p(int n) { log(n * 137); } ``` This should emit the same (or close to the same) as code with no `lazy` (and num reference changed accordingly) on `log` function. (Because compiler knows that `num ` is called once, has no side effects, is moderately expensive, etc). And the code for p is exactly the same - log and `n * 137` fully inlined. However, the anonymous dgliteral code is still emitted, despite not being referenced anywhere: ``` pure nothrow @nogc @safe int example.p(int).__dgliteral2(): # < This should not be in object file imuleax, DWORD PTR [rdi], 137 ret ``` Rest of the object file is correct and optimal: ``` void example.log(lazy int): pushrbp pushrbx mov rbp, rdi mov rbx, rsi sub rsp, 8 callbool example.g() testal, al je .L3 mov rdi, rbp callrbx add rsp, 8 pop rbx pop rbp mov edi, eax jmp void example.f(int) .L3: add rsp, 8 pop rbx pop rbp ret void example.p(int): pushrbx mov ebx, edi callbool example.g() testal, al je .L6 imuledi, ebx, 137 pop rbx jmp void example.f(int) .L6: pop rbx ret ``` gdc (Compiler-Explorer-Build-gcc-748d46cd049c89a799f99f14547267ebae915af6-binutils-2.36.1) 12.0.1 20220421 (experimental) via godbolt.org For a code passing reasonably big literals, this can lead to object file code duplication. ldc2 shows no such problem.
[Bug d/105360] Inlined lazy parameters / delegate literals, still emitted
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105360 --- Comment #1 from Witold Baryluk --- https://godbolt.org/z/c8oT6E4cf
[Bug d/105413] New: gdc extended assembler cannot constraints r8 - r15
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105413 Bug ID: 105413 Summary: gdc extended assembler cannot constraints r8 - r15 Product: gcc Version: 12.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: d Assignee: ibuclaw at gdcproject dot org Reporter: witold.baryluk+gcc at gmail dot com Target Milestone: --- gcc in C does not support directly register constraints for x86_64 registers r8 - r15. In C this can be done however using local register variables and asm attributes. https://gcc.gnu.org/onlinedocs/gcc/Local-Register-Variables.html There is no way to use this in GDC extended assembler. version (linux) { version (GNU) { enum SYSCALL { OPENAT = 56, } @nogc: nothrow: size_t syscall(SYSCALL ident)(size_t arg1, size_t arg2, size_t arg3, size_t arg4) { version (X86_64) { asm @nogc nothrow { "syscall" // output: : "=a" (arg1) // inputs: : "a" (ident), // rax - syscall number "D" (arg1), // rdi - arg1 "S" (arg2), // rsi - arg2 "d" (arg3), // rdx - arg3 "r10" (arg4), // r10 - arg4 "m"( *cast(ubyte*)arg1) // "dummy" input instead of full memory clobber // clobers : "c", "r11"; // Clobers rax, and rcx and r11. } return arg1; } else { static assert(false, "This platform/architecture is not supported when using GDC compiler"); } } } private int openatdummy() @nogc nothrow { return cast(int)syscall!(SYSCALL.OPENAT)(0, 0, 0, 0); } } myio.d: In function ‘syscall’: myio.d:232:10: error: matching constraint references invalid operand number 232 | ; https://godbolt.org/z/xGzxa6orc
[Bug d/105413] gdc extended assembler cannot constraints r8 - r15
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105413 --- Comment #3 from Witold Baryluk --- It works. Thank you. Any chance this will be in gcc 12.x? I work a lot on Debian Linux, and I doubt I will have gcc trunk or gcc 13 available any time soon. Also weirdly gcc does not inline this function, unless I add @attribute("always_inline") on syscall, or @attribute("flatten") on openatdummy.
[Bug d/107241] New: std.bitmanip.bigEndianToNative et al not inlined
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107241 Bug ID: 107241 Summary: std.bitmanip.bigEndianToNative et al not inlined Product: gcc Version: 12.2.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: d Assignee: ibuclaw at gdcproject dot org Reporter: witold.baryluk+gcc at gmail dot com Target Milestone: --- gdc fails to inline number of small functions that should fully inline and end in single instruction. on amd64 / x86, for example std.bitmanip.bigEndianToNative causes a chain of calls / jumps, even with @attribute("flatten") import std.bitmanip; import gcc.attributes; @attribute("flatten") size_t f(char[] b) { return std.bitmanip.bigEndianToNative!(size_t, 8)(cast(ubyte[8])(b[2..10])); } gcc -O3 -march=znver2 -frelease pure nothrow @nogc @safe ulong std.bitmanip.swapEndian!(ulong).swapEndian(const(ulong)): mov rax, rdi bswap rax ret pure nothrow @nogc @safe ulong std.bitmanip.endianToNativeImpl!(true, ulong, 8uL).endianToNativeImpl(ubyte[8]): jmp pure nothrow @nogc @safe ulong std.bitmanip.swapEndian!(ulong).swapEndian(const(ulong)) pure nothrow @nogc @safe ulong std.bitmanip.bigEndianToNative!(ulong, 8uL).bigEndianToNative(ubyte[8]): jmp pure nothrow @nogc @safe ulong std.bitmanip.endianToNativeImpl!(true, ulong, 8uL).endianToNativeImpl(ubyte[8]) ulong example.f(char[]): mov rdi, QWORD PTR [rsi+2] jmp pure nothrow @nogc @safe ulong std.bitmanip.bigEndianToNative!(ulong, 8uL).bigEndianToNative(ubyte[8]) No issues with LDC. ulong example.f(char[]): mov rax, qword ptr [rsi + 2] bswap rax ret godbolt: https://godbolt.org/z/Pj3f7oGso
[Bug c++/103966] New: std::atomic relaxed load, inc, store sub-optimal codegen
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103966 Bug ID: 103966 Summary: std::atomic relaxed load, inc, store sub-optimal codegen Product: gcc Version: 12.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: witold.baryluk+gcc at gmail dot com Target Milestone: --- Both functions below, should compile to the same assembly on x86: #include #include uint64_t x; void inc_a() { x++; } std::atomic y; void inc_b_non_atomic() { y.store(y.load(std::memory_order_relaxed) + 1, std::memory_order_relaxed); } and it does so in clang. It does not in gcc 12 (and earlier). https://godbolt.org/z/GcM67xz8T This pattern is very popular in approximate statistical counters / metrics, where the flow of information is unidirectional (i.e. from one thread that does updates, to another thread that only reads the counters), and its performance is critical in many codebases.
[Bug c++/103966] std::atomic relaxed load, inc, store sub-optimal codegen
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103966 --- Comment #1 from Witold Baryluk --- Current codegen on gcc 12 on 64-bit x86: inc_a(): inc QWORD PTR x[rip] ret inc_b_non_atomic(): mov rax, QWORD PTR y[rip] inc rax mov QWORD PTR y[rip], rax ret y: .zero 8 x: .zero 8
[Bug c++/103966] std::atomic relaxed load, inc, store sub-optimal codegen
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103966 --- Comment #2 from Witold Baryluk --- Similarly, dec, add, sub, are affected, as well mul. Example: #include #include uint64_t x; void add_a() { x += 5; } std::atomic y; void add_b_non_atomic() { y.store(y.load(std::memory_order_relaxed) + 5, std::memory_order_relaxed); } Producing: add_a(): add QWORD PTR x[rip], 5 ret add_b_non_atomic(): mov rax, QWORD PTR y[rip] add rax, 5 mov QWORD PTR y[rip], rax ret y: .zero 8 x: .zero 8
[Bug middle-end/35560] Missing CSE/PRE for memory operations involved in virtual call.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=35560 Witold Baryluk changed: What|Removed |Added CC||witold.baryluk+gcc at gmail dot co ||m --- Comment #15 from Witold Baryluk --- I know this is a pretty old bug, but I was exploring some assembly of gcc and clang on godbolt, and also stumbled into same issue. https://godbolt.org/z/qPzMhWse1 class A { public: virtual int f7(int x) const; }; int g(const A * const a, int x) { int r = 0; for (int i = 0; i < 1; i++) r += a->f7(x); return r; } (same happens without loop, when just calling a->f7 multiple times) g(A const*, int): pushr13 mov r13d, esi pushr12 xor r12d, r12d pushrbp mov rbp, rdi pushrbx mov ebx, 1 sub rsp, 8 .L2: mov rax, QWORD PTR [rbp+0] # a vtable deref mov esi, r13d mov rdi, rbp call[QWORD PTR [rax]]# f7 indirect call add r12d, eax dec ebx jne .L2 add rsp, 8 pop rbx pop rbp mov eax, r12d pop r12 pop r13 ret I was expecting mov rax, QWORD PTR [rbp+0] and call[QWORD PTR [rax]], to be hoisted out of the loop (call converted to lea, and call register). A bit sad. Is there some recent work done on this optimization? Are there at least some cases where it is valid to do CSE, or change code so it is moved out of the loop?
[Bug c/108255] New: Repeated address-of (lea) not optimized for size.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108255 Bug ID: 108255 Summary: Repeated address-of (lea) not optimized for size. Product: gcc Version: 13.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c Assignee: unassigned at gcc dot gnu.org Reporter: witold.baryluk+gcc at gmail dot com Target Milestone: --- https://godbolt.org/z/q5sx9e49j void f(int *); int g(int of) { int x = 13; f(&x); f(&x); f(&x); f(&x); f(&x); f(&x); f(&x); f(&x); return 0; } Got: g(int): sub rsp, 24 lea rdi, [rsp+12] mov DWORD PTR [rsp+12], 13 callf(int*) lea rdi, [rsp+12] # compute, 5 bytes callf(int*) lea rdi, [rsp+12] # recompute, 5 bytes callf(int*) lea rdi, [rsp+12] # recompute, 5 bytes callf(int*) lea rdi, [rsp+12] callf(int*) lea rdi, [rsp+12] callf(int*) lea rdi, [rsp+12] callf(int*) lea rdi, [rsp+12] callf(int*) xor eax, eax add rsp, 24 ret But, note that lea is 5 bytes. Expected (generated by clang 3.0 - 15.0): g(int): # @g(int) pushrbx # extra, but just 1 byte sub rsp, 16 mov dword ptr [rsp + 12], 13 # CSE temp lea rbx, [rsp + 12] mov rdi, rbx # use callf(int*)@PLT mov rdi, rbx # reuse, 3 bytes callf(int*)@PLT mov rdi, rbx # reuse, 3 bytes callf(int*)@PLT mov rdi, rbx callf(int*)@PLT mov rdi, rbx callf(int*)@PLT mov rdi, rbx callf(int*)@PLT mov rdi, rbx callf(int*)@PLT mov rdi, rbx callf(int*)@PLT xor eax, eax add rsp, 16 pop rbx # extra, but just 1 byte ret Technically this is more instructions. But mov rdi, rbx is 3 bytes, which is shorter than 5 bytes of lea. This is at minor expense of needing to save and restore rbx. PS. Same happens when using temporary `int *const y = &x;` Also same when optimizing for size (`-Os`). It looks like gcc 4.8.5 produced expected code, but gcc 4.9.0 does not. It is possible that the code produced by gcc 4.9.0 is faster, but it is also likely it contributes quite a bit to binary size. clang uses CSE even if there are even just two uses of `&x` in the above example. It is likely a bit higher threshold is (3 or 4) is actually optimal (can be calculated knowing encoding sizes). Weirdly tho, gcc -m32 does this: g(): pushebp mov ebp, esp pushebx lea ebx, [ebp-12] sub esp, 32 mov DWORD PTR [ebp-12], 13 pushebx callf(int*) mov DWORD PTR [esp], ebx callf(int*) mov DWORD PTR [esp], ebx callf(int*) mov ebx, DWORD PTR [ebp-4] xor eax, eax leave ret Where, it does compute address and stores it in temporary. But does it on a stack, instead in a register (my guess is there are no free register to store it and it is spilled)., but in fact lea here would be likely faster (mov DWORD PTR [esp], ebx, but requires memory/cache access, lea is 5 bytes, but does not require memory access)
[Bug d/109221] New: std.math.floor, core.math.ldexp, std.math.poly poor inlining
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109221 Bug ID: 109221 Summary: std.math.floor, core.math.ldexp, std.math.poly poor inlining Product: gcc Version: 13.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: d Assignee: ibuclaw at gdcproject dot org Reporter: witold.baryluk+gcc at gmail dot com Target Milestone: --- Example: static float sRGB_case4(float x) { // import std.math : exp; return 1.055f * expImpl(x) - 0.055f; // expImpl not inlined by default // (inlined when using pragma(inline, true), but that fails to inline in DMD) } // pragma(inline, true) // This is borrowed from phobos/exponential.d to help gcc inline it fully. // Only T == float case is here (as some traits are private to phobos). // Also isNaN and range checks are removed, as sRGB performs own checks. static private T expImpl(T)(T x) @safe pure nothrow @nogc { //import std.math : floatTraits, RealFormat; //import std.math.traits : isNaN; //import std.math.rounding : floor; //import std.math.algebraic : poly; //import std.math.constants : LOG2E; import std.math; import core.math; static immutable T[6] P = [ 5.001201E-1, 1.665459E-1, 4.1665795894E-2, 8.3334519073E-3, 1.3981999507E-3, 1.9875691500E-4, ]; enum T C1 = 0.693359375; enum T C2 = -2.12194440e-4; // Overflow and Underflow limits. enum T OF = 88.72283905206835; enum T UF = -103.278929903431851103; // ln(2^-149) // Special cases. //if (isNaN(x)) //return x; //if (x > OF) //return real.infinity; //if (x < UF) //return 0.0; // Express: e^^x = e^^g * 2^^n // = e^^g * e^^(n * LOG2E) // = e^^(g + n * LOG2E) T xx = floor((cast(T) LOG2E) * x + cast(T) 0.5); // NOT INLINED! const int n = cast(int) xx; x -= xx * C1; x -= xx * C2; xx = x * x; x = poly(x, P) * xx + x + 1.0f; // poly is generated optimally, but not inlined // Scale by power of 2. x = core.math.ldexp(x, n);// NOT INLINED return x; } gdc gdc (Compiler-Explorer-Build-gcc-454a4d5041f53cd1f7d902f6c0017b7ce95b36df-binutils-2.38) 13.0.1 20230318 (experimental) gdc -O3 -march=znver2 -frelease -fbounds-check=off pure nothrow @nogc @safe float std.math.algebraic.poly!(float, float, 6).poly(float, ref const(float[6])): vmovss xmm1, DWORD PTR [rdi+20] vfmadd213ss xmm1, xmm0, DWORD PTR [rdi+16] vfmadd213ss xmm1, xmm0, DWORD PTR [rdi+12] vfmadd213ss xmm1, xmm0, DWORD PTR [rdi+8] vfmadd213ss xmm1, xmm0, DWORD PTR [rdi+4] vfmadd213ss xmm0, xmm1, DWORD PTR [rdi] ret pure nothrow @nogc @safe float example.expImpl!(float).expImpl(float): pushrbx vmovaps xmm1, xmm0 sub rsp, 16 vmovss xmm0, DWORD PTR .LC0[rip] vfmadd213ss xmm0, xmm1, DWORD PTR .LC1[rip] vmovss DWORD PTR [rsp+8], xmm1 callpure nothrow @nogc @trusted float std.math.rounding.floor(float) vmovss xmm1, DWORD PTR [rsp+8] mov edi, OFFSET FLAT:immutable(float[6]) example.expImpl!(float).expImpl(float).P vfnmadd231ssxmm1, xmm0, DWORD PTR .LC2[rip] vmovss DWORD PTR [rsp+12], xmm0 vfnmadd231ssxmm1, xmm0, DWORD PTR .LC3[rip] vmulss xmm3, xmm1, xmm1 vmovaps xmm0, xmm1 vmovss DWORD PTR [rsp+8], xmm1 vmovd ebx, xmm3 callpure nothrow @nogc @safe float std.math.algebraic.poly!(float, float, 6).poly(float, ref const(float[6])) vmovss xmm1, DWORD PTR [rsp+8] vmovd xmm4, ebx vmovss xmm2, DWORD PTR [rsp+12] vfmadd132ss xmm0, xmm1, xmm4 vaddss xmm0, xmm0, DWORD PTR .LC4[rip] add rsp, 16 pop rbx vcvttss2si edi, xmm2 jmp ldexpf float example.sRGB_case4(float): sub rsp, 8 callpure nothrow @nogc @safe float example.expImpl!(float).expImpl(float) vmovss xmm1, DWORD PTR .LC6[rip] vfmadd132ss xmm0, xmm1, DWORD PTR .LC5[rip] add rsp, 8 ret https://godbolt.org/z/YMoMPdjn5 Additionally std.math.exp itself, is never inlined by gcc. This is important, as some early checks (isNaN, OF, UF checks) in exp could be removed by proper inlining.
[Bug d/109221] std.math.floor, core.math.ldexp, std.math.poly poor inlining
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109221 --- Comment #1 from Witold Baryluk --- PS. LDC 1.23.0 - 1.32.0 produce optimal code. LDC 1.22.0 a bit worse (due to use of x87 codegen), and 1.21 and older fail to inline `ldexp`, but still inline `poly` and `floor` perfectly.
[Bug d/109221] std.math.floor, core.math.ldexp, std.math.poly poor inlining
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109221 --- Comment #2 from Witold Baryluk --- Interesting enough, GDC 10.2 does inline `poly` instantiation with all the constants.
[Bug d/110113] New: gdc -fpreview=dip1021 crash in d/dmd/root/aav.d:127 dmd_aaGetRvalue from DsymbolTable::lookup(Identifier const*)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110113 Bug ID: 110113 Summary: gdc -fpreview=dip1021 crash in d/dmd/root/aav.d:127 dmd_aaGetRvalue from DsymbolTable::lookup(Identifier const*) Product: gcc Version: 13.1.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: d Assignee: ibuclaw at gdcproject dot org Reporter: witold.baryluk+gcc at gmail dot com Target Milestone: --- Created attachment 55254 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=55254&action=edit Minimized test case with dustmite Debian Linux amd64, experimental gcc-13, gdc 13.1.0-3 This is not very deterministic. Run few times to trigger. ``` user@debian:~$ cat lup.d class LUBench { } float lup(ulong , ulong , int , int = 1) { double[] solution; new LUBench; return solution[0] ; } float lup_3200(ulong iters, ulong flops) { return lup(iters, flops, 3200); } float raytrace() { struct V { float x, y, z; auto normalize() { } import std; auto cross() { } auto norm2() { } auto norm() { } auto opBinary(){ } } } user@debian:~$ gdc-13 -c -fpreview=dip1021 lup.d lup.d:11:7: error: function ‘lup.raytrace’ has no ‘return’ statement, but is expected to return a value of type ‘float’ 11 | float raytrace() { | ^ user@debian:~$ gdc-13 -c -fpreview=dip1021 lup.d lup.d:11:7: error: function ‘lup.raytrace’ has no ‘return’ statement, but is expected to return a value of type ‘float’ 11 | float raytrace() { | ^ user@debian:~$ gdc-13 -c -fpreview=dip1021 lup.d lup.d:11:7: error: function ‘lup.raytrace’ has no ‘return’ statement, but is expected to return a value of type ‘float’ 11 | float raytrace() { | ^ user@debian:~$ gdc-13 -c -fpreview=dip1021 lup.d lup.d:11:7: error: function ‘lup.raytrace’ has no ‘return’ statement, but is expected to return a value of type ‘float’ 11 | float raytrace() { | ^ user@debian:~$ gdc-13 -c -fpreview=dip1021 lup.d /usr/lib/gcc/x86_64-linux-gnu/13/include/d/std/math/algebraic.d:968:47: internal compiler error: Segmentation fault 968 | return cast(Unqual!T) (T(1) << bsr(val) + type); | ^ 0xd32f86 crash_signal ../../src/gcc/toplev.cc:314 0x7f53b651cf8f ??? ./signal/../sysdeps/unix/sysv/linux/x86_64/libc_sigaction.c:0 0x17f7d10 _D3dmd4root3aav15dmd_aaGetRvalueFNaNbNiPSQBnQBmQBk2AAPvZQd ../../src/gcc/d/dmd/root/aav.d:127 0x1706b25 DsymbolTable::lookup(Identifier const*) ../../src/gcc/d/dmd/dsymbol.d:2408 0x1706b25 ScopeDsymbol::search(Loc const&, Identifier*, int) ../../src/gcc/d/dmd/dsymbol.d:1470 0x17ef5b3 _D3dmd6opover15search_functionFCQBe7dsymbol12ScopeDsymbolCQCe10identifier10IdentifierZCQDhQCd7Dsymbol ../../src/gcc/d/dmd/opover.d:1435 0x1701fe0 search_toString(StructDeclaration*) ../../src/gcc/d/dmd/dstruct.d:51 0x180310a semanticTypeInfoMembers(StructDeclaration*) ../../src/gcc/d/dmd/semantic3.d:1650 0x1803394 Semantic3Visitor::visit(AggregateDeclaration*) ../../src/gcc/d/dmd/semantic3.d:1590 0x17fef19 semantic3(Dsymbol*, Scope*) ../../src/gcc/d/dmd/semantic3.d:83 0x175dc89 ExpressionSemanticVisitor::visit(DeclarationExp*) ../../src/gcc/d/dmd/expressionsem.d:5572 0x175dc89 ExpressionSemanticVisitor::visit(DeclarationExp*) ../../src/gcc/d/dmd/expressionsem.d:5407 0x175eb82 expressionSemantic(Expression*, Scope*) ../../src/gcc/d/dmd/expressionsem.d:12706 0x18096fa StatementSemanticVisitor::visit(ExpStatement*) ../../src/gcc/d/dmd/statementsem.d:207 0x18228c1 statementSemantic(Statement*, Scope*) ../../src/gcc/d/dmd/statementsem.d:149 0x18228c1 StatementSemanticVisitor::visit(CompoundStatement*) ../../src/gcc/d/dmd/statementsem.d:270 0x1809112 statementSemantic(Statement*, Scope*) ../../src/gcc/d/dmd/statementsem.d:149 0x18002a1 Semantic3Visitor::visit(FuncDeclaration*) ../../src/gcc/d/dmd/semantic3.d:598 0x17feae4 semantic3(Dsymbol*, Scope*) ../../src/gcc/d/dmd/semantic3.d:83 0x17feae4 Semantic3Visitor::visit(Module*) ../../src/gcc/d/dmd/semantic3.d:205 Please submit a full bug report, with preprocessed source (by using -freport-bug). Please include the complete backtrace with any bug report. See for instructions. user@debian:~$ ``` Could not reduce further, as it is sensitive to identifiers, and due to non-deterministic nature testing requires many repetitions.
[Bug d/110113] gdc -fpreview=dip1021 crash in d/dmd/root/aav.d:127 dmd_aaGetRvalue from DsymbolTable::lookup(Identifier const*)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110113 --- Comment #1 from Witold Baryluk --- BTW. Adding return statement in `raytrace`, does not change anything: ``` user@debian:~$ gdc-13 -c -fpreview=dip1021 lup.d user@debian:~$ gdc-13 -c -fpreview=dip1021 lup.d user@debian:~$ gdc-13 -c -fpreview=dip1021 lup.d user@debian:~$ gdc-13 -c -fpreview=dip1021 lup.d user@debian:~$ gdc-13 -c -fpreview=dip1021 lup.d /usr/lib/gcc/x86_64-linux-gnu/13/include/d/std/math/algebraic.d:968:47: internal compiler error: Segmentation fault 968 | return cast(Unqual!T) (T(1) << bsr(val) + type); | ^ 0xd32f86 crash_signal ../../src/gcc/toplev.cc:314 0x7f7144273f8f ??? ./signal/../sysdeps/unix/sysv/linux/x86_64/libc_sigaction.c:0 0x17f7d10 _D3dmd4root3aav15dmd_aaGetRvalueFNaNbNiPSQBnQBmQBk2AAPvZQd ../../src/gcc/d/dmd/root/aav.d:127 0x1706b25 DsymbolTable::lookup(Identifier const*) ../../src/gcc/d/dmd/dsymbol.d:2408 0x1706b25 ScopeDsymbol::search(Loc const&, Identifier*, int) ../../src/gcc/d/dmd/dsymbol.d:1470 ... ... ```
[Bug d/110113] gdc -fpreview=dip1021 crash in d/dmd/root/aav.d:127 dmd_aaGetRvalue from DsymbolTable::lookup(Identifier const*)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110113 --- Comment #2 from Witold Baryluk --- Also FYI, I was not able to trigger this on DMD64 D Compiler v2.104.0
[Bug d/110113] gdc -fpreview=dip1021 crash in d/dmd/root/aav.d:127 dmd_aaGetRvalue from DsymbolTable::lookup(Identifier const*)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110113 --- Comment #10 from Witold Baryluk --- Thank you Iain. Amazing debugging skills. BTW. `import std;` was because dustmite reduced original import to just that. Original import was `import std.math.algebraic : sqrt;` But you already figured this out without even using Phobos.
[Bug d/110516] New: core.volatile.volatileLoad is broken
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110516 Bug ID: 110516 Summary: core.volatile.volatileLoad is broken Product: gcc Version: 14.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: d Assignee: ibuclaw at gdcproject dot org Reporter: witold.baryluk+gcc at gmail dot com Target Milestone: --- gcc 12.2.0 (from Debian stable) and gcc trunk 14.0.0 (in godbolt) tested. core.volatile.volatileLoad simply does not work. 1) It merges loads. 2) It removes unused loads at -O1 and higher. Example: void actualRun(ubyte* ptr1) { import core.volatile : volatileLoad; volatileLoad(ptr1); volatileLoad(ptr1); volatileLoad(ptr1); volatileLoad(ptr1); } Without optimisations: void example.actualRun(ubyte*): pushrbp mov rbp, rsp mov QWORD PTR [rbp-8], rdi nop pop rbp ret Incorrect. With optimisations: void example.actualRun(ubyte*): ret Incorrect. Expected: void example.actualRun(ubyte*): movzx eax, byte ptr [rdi] movzx eax, byte ptr [rdi] movzx eax, byte ptr [rdi] movzx eax, byte ptr [rdi] ret dmd and ldc behave properly. It looks like it never worked properly. Would be good to have a test case for this, so it does not become a regression later. I did not test volatileStore, but I would not be surprised it is also broken.
[Bug d/110516] core.volatile.volatileLoad discarded if result is unused
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110516 --- Comment #8 from Witold Baryluk --- I see. Point 1 is definitively incorrect. I interpreted asembler wrong: void example.actualRun(ubyte*): pushrbp mov rbp, rsp mov QWORD PTR [rbp-8], rdi nop pop rbp ret The move there, is just some stack manipulation, it has nothing to do with volatileLoad. You are right about the side effect visibility and volatileStore. Still, there should be a way to express real memory read, with result not stored anywhere in program (just written to register, then discarded). This has some (not very common) uses in memmory-mapped IO, i.e. in drivers for devices where the read itself could indicate something (this of course usually also require setting proper page table attributes to disable caching or other optimizations, etc, not just volatile load in machine code). I do not have specific examples at hand, but afaik I saw some examples in the past (mostly on older architectures), as well some watchdog chips that reset timer on read. Another use is for doing memory and cache read benchmarks and profiling. We want to invoke read (to register) from some memory location, but we do not need the value for anything else. And more esoteric use might be memory probing. On some level systems, kernel or bootloader, might not know the memory layout, and resort to just doing reads, and relaying on CPU fault handlers to report invalid reads. And some people might use load without destination, as a prefetch hint, or to prefault some memory pages.
[Bug d/110516] core.volatile.volatileLoad discarded if result is unused
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110516 --- Comment #9 from Witold Baryluk --- Thank you for a quick fix Iain!
[Bug d/113125] New: [D] internal compiler error: in make_import, at d/imports.cc:48
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113125 Bug ID: 113125 Summary: [D] internal compiler error: in make_import, at d/imports.cc:48 Product: gcc Version: 13.2.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: d Assignee: ibuclaw at gdcproject dot org Reporter: witold.baryluk+gcc at gmail dot com Target Milestone: --- Debian testing, amd64, gcc version 13.2.0 (Debian 13.2.0-7) meta.d: ``` module objc.meta; struct A; ``` runtime.d: ``` module objc.runtime; public import meta : A; ``` gdc -v -c -I. runtime.d ``` $ gdc -v -c -I. runtime.d Using built-in specs. COLLECT_GCC=gdc OFFLOAD_TARGET_NAMES=nvptx-none:amdgcn-amdhsa OFFLOAD_TARGET_DEFAULT=1 Target: x86_64-linux-gnu Configured with: ../src/configure -v --with-pkgversion='Debian 13.2.0-7' --with-bugurl=file:///usr/share/doc/gcc-13/README.Bugs --enable-languages=c,ada,c++,go,d,fortran,objc,obj-c++,m2 --prefix=/usr --with-gcc-major-version-only --program-suffix=-13 --program-prefix=x86_64-linux-gnu- --enable-shared --enable-linker-build-id --libexecdir=/usr/libexec --without-included-gettext --enable-threads=posix --libdir=/usr/lib --enable-nls --enable-bootstrap --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --with-default-libstdcxx-abi=new --enable-gnu-unique-object --disable-vtable-verify --enable-plugin --enable-default-pie --with-system-zlib --enable-libphobos-checking=release --with-target-system-zlib=auto --enable-objc-gc=auto --enable-multiarch --disable-werror --enable-cet --with-arch-32=i686 --with-abi=m64 --with-multilib-list=m32,m64,mx32 --enable-multilib --with-tune=generic --enable-offload-targets=nvptx-none=/build/reproducible-path/gcc-13-13.2.0/debian/tmp-nvptx/usr,amdgcn-amdhsa=/build/reproducible-path/gcc-13-13.2.0/debian/tmp-gcn/usr --enable-offload-defaulted --without-cuda-driver --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu --with-build-config=bootstrap-lto-lean --enable-link-serialization=3 Thread model: posix Supported LTO compression algorithms: zlib zstd gcc version 13.2.0 (Debian 13.2.0-7) COLLECT_GCC_OPTIONS='-v' '-c' '-I' '.' '-o' 'runtime.o' '-shared-libgcc' '-mtune=generic' '-march=x86-64' /usr/libexec/gcc/x86_64-linux-gnu/13/d21 runtime.d -quiet -dumpbase runtime.d -dumpbase-ext .d -mtune=generic -march=x86-64 -version -imultiarch x86_64-linux-gnu -I . -v -o /tmp/ccPyiN0m.s GNU D (Debian 13.2.0-7) version 13.2.0 (x86_64-linux-gnu) compiled by GNU C version 13.2.0, GMP version 6.3.0, MPFR version 4.2.1, MPC version 1.3.1, isl version isl-0.26-GMP GGC heuristics: --param ggc-min-expand=100 --param ggc-min-heapsize=131072 binary/usr/libexec/gcc/x86_64-linux-gnu/13/d21 version v2.103.1 predefs GNU D_Version2 LittleEndian GNU_DWARF2_Exceptions GNU_StackGrowsDown GNU_InlineAsm D_LP64 D_PIC D_PIE assert D_PreConditions D_PostConditions D_Invariants D_ModuleInfo D_Exceptions D_TypeInfo all X86_64 D_HardFloat Posix linux CRuntime_Glibc CppRuntime_Gcc parse runtime importall runtime importmeta (meta.d) importobject(/usr/lib/gcc/x86_64-linux-gnu/13/include/d/object.d) importcore.attribute (/usr/lib/gcc/x86_64-linux-gnu/13/include/d/core/attribute.d) importgcc.attributes (/usr/lib/gcc/x86_64-linux-gnu/13/include/d/gcc/attributes.d) importcore.internal.hash (/usr/lib/gcc/x86_64-linux-gnu/13/include/d/core/internal/hash.d) importcore.internal.traits (/usr/lib/gcc/x86_64-linux-gnu/13/include/d/core/internal/traits.d) importcore.internal.entrypoint (/usr/lib/gcc/x86_64-linux-gnu/13/include/d/core/internal/entrypoint.d) importcore.internal.array.appending (/usr/lib/gcc/x86_64-linux-gnu/13/include/d/core/internal/array/appending.d) importcore.internal.array.comparison (/usr/lib/gcc/x86_64-linux-gnu/13/include/d/core/internal/array/comparison.d) importcore.internal.array.equality (/usr/lib/gcc/x86_64-linux-gnu/13/include/d/core/internal/array/equality.d) importcore.internal.array.casting (/usr/lib/gcc/x86_64-linux-gnu/13/include/d/core/internal/array/casting.d) importcore.internal.array.concatenation (/usr/lib/gcc/x86_64-linux-gnu/13/include/d/core/internal/array/concatenation.d) importcore.internal.array.construction (/usr/lib/gcc/x86_64-linux-gnu/13/include/d/core/internal/array/construction.d) importcore.internal.array.arrayassign (/usr/lib/gcc/x86_64-linux-gnu/13/include/d/core/internal/array/arrayassign.d) importcore.internal.array.capacity (/usr/lib/gcc/x86_64-linux-gnu/13/include/d/core/internal/array/capacity.d) importcore.internal.dassert (/usr/lib/gcc/x86_64-linux-gnu/13/include/d/core/internal/dassert.d) importcore.atomic (/usr/lib/gcc/x86_64-linu