[Bug c++/80561] New: Missed optimization: std::array data is aligned if array is aligned
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80561 Bug ID: 80561 Summary: Missed optimization: std::array data is aligned if array is aligned Product: gcc Version: 6.3.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: jzwinck at gmail dot com Target Milestone: --- In the following code, GCC fails to recognize that the data inside a std::array has the same alignment guarantees as the array itself. The result is that using std::array instead of a C-style array carries a significant runtime penalty, as the alignment is checked unnecessarily and code is generated for the unaligned case which should never be used. I tested this using: g++ -std=c++14 -O3 -march=haswell GCC 6.1, 6.3 and 7 all fail to optimize this. Clang 3.7 through 4.0 optimizes it as expected. In the code below, you can swap the comment on the two typedefs to confirm that GCC properly optimizes the C-style array. The optimal code is 4 vmovupd, 2 vaddpd, and 1 vzeroupper. The suboptimal code is 73 instructions including 7 branches. This was discussed on Stack Overflow: http://stackoverflow.com/questions/43651923 --- #include static constexpr size_t my_elements = 8; typedef std::array Vec __attribute__((aligned(32))); // typedef double Vec[my_elements] __attribute__((aligned(32))); void func(Vec& __restrict__ v1, const Vec& v2) { for (unsigned i = 0; i < my_elements; ++i) { v1[i] += v2[i]; } }
[Bug target/80556] [8 Regression] bootstrap failure for Ada compiler
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80556 Eric Botcazou changed: What|Removed |Added Target||x86_64-apple-darwin16 Status|NEW |WAITING CC||ebotcazou at gcc dot gnu.org Component|ada |target Host||x86_64-apple-darwin16 Summary|[8 Regression] Ada breaks |[8 Regression] bootstrap |bootstrap on|failure for Ada compiler |x86_64-apple-darwin16 | Build||x86_64-apple-darwin16 --- Comment #2 from Eric Botcazou --- Other native platforms seem fine, so please post a backtrace.
[Bug tree-optimization/79697] unused realloc(0, n) not eliminated
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79697 --- Comment #8 from prathamesh3492 at gcc dot gnu.org --- Author: prathamesh3492 Date: Sat Apr 29 10:05:13 2017 New Revision: 247407 URL: https://gcc.gnu.org/viewcvs?rev=247407&root=gcc&view=rev Log: 2017-04-29 Prathamesh Kulkarni PR tree-optimization/79697 * tree-ssa-dce.c (mark_stmt_if_obviously_necessary): Check if callee is BUILT_IN_STRDUP, BUILT_IN_STRNDUP, BUILT_IN_REALLOC. (propagate_necessity): Check if def_callee is BUILT_IN_STRDUP or BUILT_IN_STRNDUP. * gimple-fold.c (gimple_fold_builtin_realloc): New function. (gimple_fold_builtin): Call gimple_fold_builtin_realloc. testsuite/ * gcc.dg/tree-ssa/pr79697.c: New test. Added: trunk/gcc/testsuite/gcc.dg/tree-ssa/pr79697.c Modified: trunk/gcc/ChangeLog trunk/gcc/gimple-fold.c trunk/gcc/testsuite/ChangeLog trunk/gcc/tree-ssa-dce.c
[Bug c++/80562] New: ICE using if constexpr with nonconstant expression in function template
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80562 Bug ID: 80562 Summary: ICE using if constexpr with nonconstant expression in function template Product: gcc Version: 8.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: g...@arne-mertz.de Target Milestone: --- Build: GCC v8.0.0 (built from source 20170429) The following code: struct T { constexpr auto foo() { return false; } }; template constexpr auto bf(T t) { if constexpr(t.foo()) { return false; } return true; } Yields the following error, and ends with mmap() failing to allocate memory: : In function 'constexpr auto bf(T)': :7:25: internal compiler error: in cxx_eval_constant_expression, at cp/constexpr.c:4312 if constexpr(t.foo()) { ^ mmap: Cannot allocate memory Please submit a full bug report, with preprocessed source if appropriate. See <https://gcc.gnu.org/bugs/> for instructions. Compiler exited with result code 1 see https://godbolt.org/g/tTpkeD
[Bug fortran/80563] New: [cleanup] handle allocatable DT intent(out) arguments in init_intent_out_dt instead of gfc_conv_procedure_call
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80563 Bug ID: 80563 Summary: [cleanup] handle allocatable DT intent(out) arguments in init_intent_out_dt instead of gfc_conv_procedure_call Product: gcc Version: 8.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: fortran Assignee: unassigned at gcc dot gnu.org Reporter: janus at gcc dot gnu.org Target Milestone: --- Carry-over from PR 80121 comment 7: > > In trans-decl.c there is a function called 'init_intent_out_dt', which takes > > care of deallocating the allocatable components of intent(out) derived-type > > dummies. However, it has a comment saying: > > > > /* Note: Allocatables are excluded as they are already handled > >by the caller. */ > > > Apparently 'gfc_conv_procedure_call' in trans-expr.c does that. My feeling is that it would be a good idea to handle allocatable derived types inside of the callee as well. I can see at least two advantages: * It would avoid code duplication if the procedure is called several times. * It would take some complexity out of gfc_conv_procedure_call, which is quite a monster. >From the technical side a treatment in the callee should be possible AFAICS. I wonder why it is being done in the caller at all?
[Bug libstdc++/80564] New: bind on SFINAE unfriendly generic lambda
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80564 Bug ID: 80564 Summary: bind on SFINAE unfriendly generic lambda Product: gcc Version: 8.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: libstdc++ Assignee: unassigned at gcc dot gnu.org Reporter: colu...@gmx-topmail.de Target Milestone: --- Related to https://gcc.gnu.org/bugzilla/show_bug.cgi?id=49058 --- #include int main() { int i; std::bind([] (auto& x) {x = 1;}, i)(); } --- This is rejected because, during overload resolution, _Bind::operator() const's default template argument is spuriously instantiated.
[Bug target/80556] [8 Regression] bootstrap failure for Ada compiler
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80556 --- Comment #3 from Dominique d'Humieres --- > Other native platforms seem fine, so please post a backtrace. The best I can do without further directives: [Book15] ada/rts% lldb /opt/gcc/build_a/gcc/gnat1 (lldb) run -O2 g-exptty.adb Process 95815 launched: '/opt/gcc/build_a/gcc/gnat1' (x86_64) Process 95815 stopped * thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGABRT frame #0: 0x7fffa7e8fd42 libsystem_kernel.dylib`__pthread_kill + 10 libsystem_kernel.dylib`__pthread_kill: -> 0x7fffa7e8fd42 <+10>: jae0x7fffa7e8fd4c; <+20> 0x7fffa7e8fd44 <+12>: movq %rax, %rdi 0x7fffa7e8fd47 <+15>: jmp0x7fffa7e88caf; cerror_nocancel 0x7fffa7e8fd4c <+20>: retq (lldb) bt error: gnat1 {0x00179120}: unhandled type tag 0x0021 (DW_TAG_subrange_type), please file a bug and attach the file at the start of this error message ... a bunch of similar errors ... * thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGABRT * frame #0: 0x7fffa7e8fd42 libsystem_kernel.dylib`__pthread_kill + 10 frame #1: 0x7fffa7f7d5bf libsystem_pthread.dylib`pthread_kill + 90 frame #2: 0x7fffa7df5420 libsystem_c.dylib`abort + 129 frame #3: 0x000100ff88c1 gnat1`uw_init_context_1(context=, outer_cfa=, outer_ra=) at unwind-dw2.c:1579 frame #4: 0x000100ff8f2e gnat1`_Unwind_RaiseException(exc=0x000144a022a0) at unwind.inc:88 frame #5: 0x00010006663f gnat1`ada__exceptions__exception_propagation__propagate_gcc_exceptionXn(gcc_exception=0x000144a022a0) at a-exexpr.adb:322 frame #6: 0x000100066683 gnat1`ada__exceptions__exception_propagation__propagate_exceptionXn(excep=) at a-exexpr.adb:354 frame #7: 0x000100066af9 gnat1`ada__exceptions__complete_and_propagate_occurrence(x=) at a-except.adb:937 frame #8: 0x000100066b2e gnat1`__gnat_raise_exception(e=, message=) at a-except.adb:978 frame #9: 0x0001001fbf9a gnat1`rtsfind__load_fail(s=const string___XUP @ 0x7fe1c0cf6f50, u_id=, id=) at rtsfind.adb:851 frame #10: 0x0001001fc316 gnat1`rtsfind__load_rtu(u_id=, id=, use_setting=) at rtsfind.adb:987 frame #11: 0x0001001fc74e gnat1`rtsfind__rte at rtsfind.adb:1380 frame #12: 0x0001001fcab8 gnat1`rtsfind__rte_available(e=) at rtsfind.adb:1462 frame #13: 0x00010011d4ad gnat1`exp_ch9__expand_n_delay_relative_statement(n=) at exp_ch9.adb:8068 frame #14: 0x00010017078f gnat1`expander__expand(n=) at expander.adb:214 frame #15: 0x0001002124d8 gnat1`sem__analyze(n=) at sem.adb:753 frame #16: 0x00010029d347 gnat1`sem_ch5__analyze_statements(l=) at sem_ch5.adb:3613 frame #17: 0x00010029f06e gnat1`sem_ch5__analyze_if_statement(n=) at sem_ch5.adb:1665 frame #18: 0x000100212bf0 gnat1`sem__analyze(n=) at sem.adb:306 frame #19: 0x00010029d347 gnat1`sem_ch5__analyze_statements(l=) at sem_ch5.adb:3613 frame #20: 0x0001002396ee gnat1`sem_ch11__analyze_handled_statements(n=) at sem_ch11.adb:426 frame #21: 0x000100212882 gnat1`sem__analyze(n=) at sem.adb:297 frame #22: 0x0001002ad694 gnat1`sem_ch6__analyze_subprogram_body(n=) at sem_ch6.adb:4245 frame #23: 0x000100212ace gnat1`sem__analyze(n=) at sem.adb:547 frame #24: 0x00010027767b gnat1`sem_ch3__analyze_declarations(l=) at sem_ch3.adb:2608 frame #25: 0x0001002b2dbe gnat1`sem_ch7__analyze_package_body(n=) at sem_ch7.adb:786 frame #26: 0x000100212ada gnat1`sem__analyze(n=) at sem.adb:444 frame #27: 0x000100236c22 gnat1`sem_ch10__analyze_compilation_unit(n=) at sem_ch10.adb:897 frame #28: 0x000100212713 gnat1`sem__analyze(n=) at sem.adb:180 frame #29: 0x000100213863 gnat1`sem__semantics at sem.adb:1338 frame #30: 0x0001002137e6 gnat1`sem__semantics frame #31: 0x000100182fd4 gnat1`_ada_frontend at frontend.adb:407 frame #32: 0x00010037a6b1 gnat1`_ada_gnat1drv at gnat1drv.adb:1127 frame #33: 0x00010001daff gnat1`::gnat_parse_file() at misc.c:122 frame #34: 0x000100c583ca gnat1`::compile_file() at toplev.c:467 frame #35: 0x000100ffd717 gnat1`toplev::main(int, char**) at toplev.c:2003 frame #36: 0x000100ffd227 gnat1`toplev::main(this=0x7fff5fbff2fe, argc=, argv=) frame #37: 0x000100fff2fe gnat1`main(argc=3, argv=0x7fff5fbff330) at main.c:39 frame #38: 0x7fffa7d61235 libdyld.dylib`start + 1
[Bug tree-optimization/80487] redundant memset/strncpy calls not eliminated
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80487 --- Comment #3 from Marc Glisse --- Author: glisse Date: Sat Apr 29 14:39:25 2017 New Revision: 247408 URL: https://gcc.gnu.org/viewcvs?rev=247408&root=gcc&view=rev Log: Add st[pr]ncpy to stmt_kills_ref_p 2017-04-29 Marc Glisse PR tree-optimization/80487 gcc/ * tree-ssa-alias.c (stmt_kills_ref_p): Handle stpncpy and strncpy. gcc/testsuite/ * gcc.dg/tree-ssa/strncpy-1.c: New file. Added: trunk/gcc/testsuite/gcc.dg/tree-ssa/strncpy-1.c Modified: trunk/gcc/ChangeLog trunk/gcc/testsuite/ChangeLog trunk/gcc/tree-ssa-alias.c
[Bug tree-optimization/80487] redundant memset/strncpy calls not eliminated
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80487 Marc Glisse changed: What|Removed |Added Status|NEW |RESOLVED Resolution|--- |FIXED --- Comment #4 from Marc Glisse --- .
[Bug rtl-optimization/80491] [6/7/8 Regression] Compiler regression for long-add case.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80491 --- Comment #9 from Jakub Jelinek --- Author: jakub Date: Sat Apr 29 16:17:13 2017 New Revision: 247409 URL: https://gcc.gnu.org/viewcvs?rev=247409&root=gcc&view=rev Log: PR rtl-optimization/80491 * alias.c (memory_modified_in_insn_p): Return true for CALL_INSNs. Modified: trunk/gcc/ChangeLog trunk/gcc/alias.c
[Bug bootstrap/80565] New: ICE at -O2 and -O3 in 32-bit mode (not 64-bit) on x86_64-linux-gnu (in edge_badness, at ipa-inline.c:1028)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80565 Bug ID: 80565 Summary: ICE at -O2 and -O3 in 32-bit mode (not 64-bit) on x86_64-linux-gnu (in edge_badness, at ipa-inline.c:1028) Product: gcc Version: 8.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: bootstrap Assignee: unassigned at gcc dot gnu.org Reporter: chengniansun at gmail dot com Target Milestone: --- $ gcc-trunk -v Using built-in specs. COLLECT_GCC=gcc-trunk COLLECT_LTO_WRAPPER=/usr/local/gcc-trunk/libexec/gcc/x86_64-pc-linux-gnu/8.0.0/lto-wrapper Target: x86_64-pc-linux-gnu Configured with: ../gcc-source-trunk/configure --enable-languages=c,c++,lto --prefix=/usr/local/gcc-trunk --disable-bootstrap Thread model: posix gcc version 8.0.0 20170429 (experimental) [trunk revision 247405] (GCC) $ gcc-trunk -m32 -O2 small.c small.c: In function ‘fn2’: small.c:4:14: warning: type of ‘p1’ defaults to ‘int’ [-Wimplicit-int] static short fn2(p1) { ^~~ small.c: At top level: small.c:39:1: internal compiler error: in edge_badness, at ipa-inline.c:1028 } ^ 0x139f133 edge_badness ../../gcc-source-trunk/gcc/ipa-inline.c:1028 0x13a037b update_edge_key ../../gcc-source-trunk/gcc/ipa-inline.c:1224 0x13a08da update_caller_keys ../../gcc-source-trunk/gcc/ipa-inline.c:1351 0x13a269f inline_small_functions ../../gcc-source-trunk/gcc/ipa-inline.c:2045 0x13a269f ipa_inline ../../gcc-source-trunk/gcc/ipa-inline.c:2438 0x13a269f execute ../../gcc-source-trunk/gcc/ipa-inline.c:2849 Please submit a full bug report, with preprocessed source if appropriate. Please include the complete backtrace with any bug report. See <https://gcc.gnu.org/bugs/> for instructions. $ cat small.c int a, b, c, e, h, j; char d; short f, g; static short fn2(p1) { for (;;) for (; g; g++) if (p1) break; } static short fn3(); static char fn4(char p1) { int i; for (; d;) f = 8; for (; f; f = 0) for (; i; i++) { j = 0; for (; j; j++) ; } } static short fn1(short p1) { fn2(b || fn3()); } short fn3() { if (c) { fn4(e); h = 0; for (;; h++) ; } } int main() { for (; a;) fn1(c); return 0; }
[Bug rtl-optimization/80491] [6/7/8 Regression] Compiler regression for long-add case.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80491 --- Comment #10 from Jakub Jelinek --- Author: jakub Date: Sat Apr 29 16:18:11 2017 New Revision: 247410 URL: https://gcc.gnu.org/viewcvs?rev=247410&root=gcc&view=rev Log: PR rtl-optimization/80491 * ifcvt.c (noce_process_if_block): When looking for x setter with missing else_bb, don't check only the insn right before cond_earliest, but look for the last insn that x is modified in within the same bb. Modified: trunk/gcc/ChangeLog trunk/gcc/ifcvt.c
[Bug target/80566] New: no use of avx vmovups on ymm registry in set and copy
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80566 Bug ID: 80566 Summary: no use of avx vmovups on ymm registry in set and copy Product: gcc Version: 8.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: vincenzo.innocente at cern dot ch Target Milestone: --- in this example #include int * foo() { int * p = new int[16]; memset(p,0,16*sizeof(int)); return p; } int * foo(int * q) { int * p = new int[16]; memcpy(q,p,16*sizeof(int)); return p; } gcc does not make use of vmovups on ymm registry ( c++ -O3 -Wall -march=haswell -S) while (according to gcc.godbolt.org) clang 4.0 does https://godbolt.org/g/qnX975
[Bug target/80556] [8 Regression] bootstrap failure for Ada compiler
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80556 Eric Botcazou changed: What|Removed |Added Status|WAITING |NEW CC||gingold at gcc dot gnu.org --- Comment #4 from Eric Botcazou --- > a bunch of similar errors > ... > * thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGABRT > * frame #0: 0x7fffa7e8fd42 libsystem_kernel.dylib`__pthread_kill + 10 > frame #1: 0x7fffa7f7d5bf libsystem_pthread.dylib`pthread_kill + 90 > frame #2: 0x7fffa7df5420 libsystem_c.dylib`abort + 129 > frame #3: 0x000100ff88c1 > gnat1`uw_init_context_1(context=, outer_cfa=, > outer_ra=) at unwind-dw2.c:1579 > frame #4: 0x000100ff8f2e > gnat1`_Unwind_RaiseException(exc=0x000144a022a0) at unwind.inc:88 > frame #5: 0x00010006663f > gnat1`ada__exceptions__exception_propagation__propagate_gcc_exceptionXn(gcc_e > xception=0x000144a022a0) at a-exexpr.adb:322 > frame #6: 0x000100066683 > gnat1`ada__exceptions__exception_propagation__propagate_exceptionXn(excep= available>) at a-exexpr.adb:354 > frame #7: 0x000100066af9 > gnat1`ada__exceptions__complete_and_propagate_occurrence(x=) at > a-except.adb:937 > frame #8: 0x000100066b2e > gnat1`__gnat_raise_exception(e=, message=) at > a-except.adb:978 > frame #9: 0x0001001fbf9a gnat1`rtsfind__load_fail(s=const > string___XUP @ 0x7fe1c0cf6f50, u_id=, id=) at > rtsfind.adb:851 OK, thanks, there is a problem in exception propagation on the Darwin host. I'm not exactly a specialist here, so CCing Tristan.
[Bug c++/80567] New: bogus fixit hint for undeclared memset: else
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80567 Bug ID: 80567 Summary: bogus fixit hint for undeclared memset: else Product: gcc Version: 7.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: msebor at gcc dot gnu.org Target Milestone: --- For the test case below where the program uses memset without first including G++ suggests as an alternative the "else" keyword. $ cat z.C && gcc -O2 -Wall -Wextra -Wpedantic z.C void f (void *p) { memset (p, 0, 4); } z.C: In function ‘void f(void*)’: z.C:3:3: error: ‘memset’ was not declared in this scope memset (p, 0, 4); ^~ z.C:3:3: note: suggested alternative: ‘else’ memset (p, 0, 4); ^~ else In C mode, GCC prints the far more helpful (though not perfect): z.C:3:3: warning: implicit declaration of function ‘memcpy’ [-Wimplicit-function-declaration] memcpy (p, "1234", 4); ^~ z.C:3:3: warning: incompatible implicit declaration of built-in function ‘memcpy’ z.C:3:3: note: include ‘’ or provide a declaration of ‘memcpy’ Although in C++ it's possible to declare one's own overloads of memset and other library functions, I think it would be helpful to have G++ issue a hint similar to the GCC note (i.e., in the absence of any other memset, assume that the name refers to the standard library function and suggest the user #include ). Ditto for any other C standard library functions. In any case, it seems that to avoid obviously incorrect suggestions like the one above, G++ needs to consider more of the context in which an undeclared identifier is used.
[Bug c++/80560] warn on undefined memory operations involving non-trivial types
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80560 Martin Sebor changed: What|Removed |Added Keywords||patch --- Comment #2 from Martin Sebor --- Patch posted for review: https://gcc.gnu.org/ml/gcc-patches/2017-04/msg01571.html
[Bug target/80568] New: x86 -mavx256-split-unaligned-load (and store) is affecting AVX2 code, but probably shouldn't be.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80568 Bug ID: 80568 Summary: x86 -mavx256-split-unaligned-load (and store) is affecting AVX2 code, but probably shouldn't be. Product: gcc Version: 7.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- Created attachment 41285 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=41285&action=edit bswap16.cc gcc7 (or at least the gcc8 snapshot on https://godbolt.org/g/ZafCE0) is now splitting unaligned loads/stores even for AVX2 integer, where gcc6.3 didn't. I think this is undesirable by default, because some projects probably build with -mavx2 but fail to use -mtune=haswell (or broadwell or skylake). For now, Intel CPUs that do well with 32B unaligned loads are probably the most common AVX2-supporting CPUs. IDK what's optimal for Excavator or Zen. Was this an intentional change to make those tune options work better for those CPUs? I would suggest that -mavx2 should imply -mno-avx256-split-unaligned-load (and -store) for -mtune=generic. Or if that's too ugly (having insn set selection affect tuning), then maybe just revert to the previous behaviour of having integer loads/store not be split the way FP loads/stores are. The conventional wisdom is that unaligned loads are just as fast as aligned when the data does happen to be aligned at run-time. Splitting this way badly breaks that assumption. It's inconvenient/impossible to portably communicate to the compiler that it should optimize for the case where the data is aligned, even if that's not guaranteed, so loadu / storeu are probably used in lots of code that normally runs on aligned data. Also, gcc doesn't always figure out that a hand-written scalar prologue does leave the pointer aligned for a vector loop. (And since programmers expect it not to matter, they may still use `_mm256_loadu_si256`). I reduced some real existing code that a colleague wrote into a test-case for this: https://godbolt.org/g/ZafCE0, also attached.If using potentially-overlapping first/last vectors instead of scalar loops, it might use loadu just to avoid duplicating a helper function. For an example of affected code, consider an endian-swap function that uses this (inline) function in its inner loop. The code inside the loop matches what we get for compiling it stand-alone, so I'll just show that: #include // static inline void swap256(char* addr, __m256i mask) { __m256i vec = _mm256_loadu_si256((__m256i*)addr); vec = _mm256_shuffle_epi8(vec, mask); _mm256_storeu_si256((__m256i*)addr, vec); } gcc6.3 -O3 -mavx2: vmovdqu (%rdi), %ymm1 vpshufb %ymm0, %ymm1, %ymm0 vmovdqu %ymm0, (%rdi) g++ (GCC-Explorer-Build) 8.0.0 20170429 (experimental) -O3 -mavx2 vmovdqu (%rdi), %xmm1 vinserti128 $0x1, 16(%rdi), %ymm1, %ymm1 vpshufb %ymm0, %ymm1, %ymm0 vmovups %xmm0, (%rdi) vextracti128$0x1, %ymm0, 16(%rdi)
[Bug target/79964] Cortex A53 codegen still not optimal
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79964 --- Comment #2 from PeteVine --- I can confirm the first part of the issue gets fixed with this patch: https://gcc.gnu.org/ml/gcc-patches/2017-04/msg01415.html but there's a regression in gcc8 concerning the second part. (or rather the workarounds don't work any more) http://openbenchmarking.org/result/1704298-RI-CRAYREGRE13 ("basic flags" didn't deactivate -mfix-cortex-a53-843419, hence the difference)
[Bug target/80569] New: i686: "shrx" instruction generated in 16-bit mode
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80569 Bug ID: 80569 Summary: i686: "shrx" instruction generated in 16-bit mode Product: gcc Version: 6.3.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: davmac at davmac dot org Target Milestone: --- The following code, compiled with -m16 -O2 -c, fails at assembly: --- begin --- void load_kernel(void *setup_addr) { unsigned int seg = (unsigned int)setup_addr >> 4; asm("movl %0, %%es" : : "r"(seg)); } --- end --- $ gcc -m16 -O2 -c shrxdtestcase.i /tmp/ccGS34WK.s: Assembler messages: /tmp/ccGS34WK.s:11: Error: instruction `shrx' isn't supported in 16-bit mode.
[Bug target/80569] i686: "shrx" instruction generated in 16-bit mode
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80569 --- Comment #1 from Davin McCall --- (Prevents building Qemu).
[Bug bootstrap/80565] [8 Regression] ICE at -O2 and -O3 in 32-bit mode (not 64-bit) on x86_64-linux-gnu (in edge_badness, at ipa-inline.c:1028)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80565 Martin Liška changed: What|Removed |Added Status|UNCONFIRMED |NEW Known to work||7.0.1 Keywords||ice-on-valid-code Last reconfirmed||2017-04-30 CC||hubicka at ucw dot cz, ||marxin at gcc dot gnu.org Ever confirmed|0 |1 Summary|ICE at -O2 and -O3 in |[8 Regression] ICE at -O2 |32-bit mode (not 64-bit) on |and -O3 in 32-bit mode (not |x86_64-linux-gnu (in|64-bit) on x86_64-linux-gnu |edge_badness, at|(in edge_badness, at |ipa-inline.c:1028) |ipa-inline.c:1028) Target Milestone|--- |8.0 Known to fail||8.0 --- Comment #1 from Martin Liška --- Confirmed, started with r247380.
[Bug target/80570] New: auto-vectorizing int->double conversion should use half-width memory operands to avoid shuffles, instead of load+extract
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80570 Bug ID: 80570 Summary: auto-vectorizing int->double conversion should use half-width memory operands to avoid shuffles, instead of load+extract Product: gcc Version: 8.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- Target: x86_64-*-*, i?86-*-* When auto-vectorizing int->double conversion, gcc loads a full-width vector into a register and then unpacks the upper half to feed (v)cvtdq2pd. e.g. with AVX, we get a 256b load and then vextracti128. It's even worse with an unaligned src pointer with -mavx256-split-unaligned-load, where it does vinsertf128 -> vextractf128, without ever doing anything with the full 256b vector! On Intel SnB-family CPUs, this will bottleneck the loop on port5 throughput, because VCVTDQ2PD reg -> reg needs a port5 uop as well as a port1 uop. (And vextracti128 can only run on the shuffle unit on port5). VCVTDQ2PD with a memory source operand doesn't need the shuffle port at all on Intel Haswell and later, just the FP-add unit and a load, so it's a much better choice. (Throughput of one per clock on Sandybridge and Haswell, 2 per clock on Skylake). It's still 2 fused-domain uops, though, so I guess it can't micro-fuse the load according to Agner Fog's testing. (Or 3 on SnB). I'm pretty sure using twice as many half-width memory operands is not worse on other AVX CPUs either (AMD BD-family or Zen, or KNL), vs. max-width loads and extracting the high half. void cvti32f64_loop(double *dp, int *ip) { // ICC avoids the mistake when it doesn't emit a prologue to align the pointers #ifdef __GNUC__ dp = __builtin_assume_aligned(dp, 64); ip = __builtin_assume_aligned(ip, 64); #endif for (int i=0; i<1 ; i++) { double tmp = ip[i]; dp[i] = tmp; } } https://godbolt.org/g/329C3P gcc.godbolt.org's "gcc7" snapshot: g++ (GCC-Explorer-Build) 8.0.0 20170429 (experimental) gcc -O3 -march=sandybridge cvti32f64_loop: xorl%eax, %eax .L2: vmovdqa (%rsi,%rax), %ymm0 vcvtdq2pd %xmm0, %ymm1 vextractf128$0x1, %ymm0, %xmm0 vmovapd %ymm1, (%rdi,%rax,2) vcvtdq2pd %xmm0, %ymm0 vmovapd %ymm0, 32(%rdi,%rax,2) addq$32, %rax cmpq$4, %rax jne .L2 vzeroupper ret gcc does the same thing for -march=haswell, but uses vextracti128. This is obviously really silly. For comparison, clang 4.0 -O3 -march=sandybridge -fno-unroll-loops emits: xorl%eax, %eax .LBB0_1: vcvtdq2pd (%rsi,%rax,4), %ymm0 vmovaps %ymm0, (%rdi,%rax,8) addq$4, %rax cmpq$1, %rax# imm = 0x2710 jne .LBB0_1 vzeroupper retq This should come close to one 256b store per clock (on Haswell), even with unrolling disabled. With -march=nehalem, gcc gets away with it for this simple not-unrolled loop (without hurting throughput I think), but only because this strategy effectively unrolls the loop (doing two stores per add + cmp/jne), and Nehalem can run shuffles on two execution ports (so the pshufd can run on port1, while the cvtdq2pd can run on ports 1+5). So it's 10 fused-domain uops per 2 stores instead of 5 per 1 store. Depending on how the loop buffer handles non-multiple-of-4 uop counts, this might be a wash. (Of course, with any other work in the loop, or with unrolling, the memory-operand strategy is much better). CVTDQ2PD's memory operand is only 64 bits, so even the non-AVX version doesn't fault if misaligned. -- It's even more horrible without aligned pointers, when the sandybridge version (which splits unaligned 256b loads/stores) uses vinsertf128 to emulate a 256b load, and then does vextractf128 right away: inner_loop: # gcc8 -march=sandybridge without __builtin_assume_aligned vmovdqu (%r8,%rax), %xmm0 vinsertf128 $0x1, 16(%r8,%rax), %ymm0, %ymm0 vcvtdq2pd %xmm0, %ymm1 vextractf128$0x1, %ymm0, %xmm0 vmovapd %ymm1, (%rcx,%rax,2) vcvtdq2pd %xmm0, %ymm0 vmovapd %ymm0, 32(%rcx,%rax,2) This is obviously really really bad, and should probably be checked for and avoided in case there are things other than int->double autovec that could lead to doing this. --- With -march=skylake-avx512, gcc does the AVX512 version of the same thing: zmm load and then extra the upper 256b