[Bug c++/80561] New: Missed optimization: std::array data is aligned if array is aligned

2017-04-29 Thread jzwinck at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80561

Bug ID: 80561
   Summary: Missed optimization: std::array data is aligned if
array is aligned
   Product: gcc
   Version: 6.3.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: jzwinck at gmail dot com
  Target Milestone: ---

In the following code, GCC fails to recognize that the data inside a std::array
has the same alignment guarantees as the array itself.  The result is that
using std::array instead of a C-style array carries a significant runtime
penalty, as the alignment is checked unnecessarily and code is generated for
the unaligned case which should never be used.  I tested this using:

g++ -std=c++14 -O3 -march=haswell

GCC 6.1, 6.3 and 7 all fail to optimize this.  Clang 3.7 through 4.0 optimizes
it as expected.

In the code below, you can swap the comment on the two typedefs to confirm that
GCC properly optimizes the C-style array.

The optimal code is 4 vmovupd, 2 vaddpd, and 1 vzeroupper.  The suboptimal code
is 73 instructions including 7 branches.

This was discussed on Stack Overflow:
http://stackoverflow.com/questions/43651923

---

#include 

static constexpr size_t my_elements = 8;

typedef std::array Vec __attribute__((aligned(32)));
// typedef double Vec[my_elements] __attribute__((aligned(32)));

void func(Vec& __restrict__ v1, const Vec& v2)
{
for (unsigned i = 0; i < my_elements; ++i)
{
v1[i] += v2[i];
}
}

[Bug target/80556] [8 Regression] bootstrap failure for Ada compiler

2017-04-29 Thread ebotcazou at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80556

Eric Botcazou  changed:

   What|Removed |Added

 Target||x86_64-apple-darwin16
 Status|NEW |WAITING
 CC||ebotcazou at gcc dot gnu.org
  Component|ada |target
   Host||x86_64-apple-darwin16
Summary|[8 Regression] Ada breaks   |[8 Regression] bootstrap
   |bootstrap on|failure for Ada compiler
   |x86_64-apple-darwin16   |
  Build||x86_64-apple-darwin16

--- Comment #2 from Eric Botcazou  ---
Other native platforms seem fine, so please post a backtrace.

[Bug tree-optimization/79697] unused realloc(0, n) not eliminated

2017-04-29 Thread prathamesh3492 at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79697

--- Comment #8 from prathamesh3492 at gcc dot gnu.org ---
Author: prathamesh3492
Date: Sat Apr 29 10:05:13 2017
New Revision: 247407

URL: https://gcc.gnu.org/viewcvs?rev=247407&root=gcc&view=rev
Log:
2017-04-29  Prathamesh Kulkarni  

PR tree-optimization/79697
* tree-ssa-dce.c (mark_stmt_if_obviously_necessary): Check if callee
is BUILT_IN_STRDUP, BUILT_IN_STRNDUP, BUILT_IN_REALLOC.
(propagate_necessity): Check if def_callee is BUILT_IN_STRDUP or
BUILT_IN_STRNDUP.
* gimple-fold.c (gimple_fold_builtin_realloc): New function.
(gimple_fold_builtin): Call gimple_fold_builtin_realloc.

testsuite/
* gcc.dg/tree-ssa/pr79697.c: New test.

Added:
trunk/gcc/testsuite/gcc.dg/tree-ssa/pr79697.c
Modified:
trunk/gcc/ChangeLog
trunk/gcc/gimple-fold.c
trunk/gcc/testsuite/ChangeLog
trunk/gcc/tree-ssa-dce.c

[Bug c++/80562] New: ICE using if constexpr with nonconstant expression in function template

2017-04-29 Thread g...@arne-mertz.de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80562

Bug ID: 80562
   Summary: ICE using if constexpr with nonconstant expression in
function template
   Product: gcc
   Version: 8.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: g...@arne-mertz.de
  Target Milestone: ---

Build: GCC v8.0.0 (built from source 20170429)

The following code:


struct T {
  constexpr auto foo() { return false; }
};

template 
constexpr auto bf(T t) {
if constexpr(t.foo()) {
return false;
}
return true;
}


Yields the following error, and ends with mmap() failing to allocate memory:

: In function 'constexpr auto bf(T)':
:7:25: internal compiler error: in cxx_eval_constant_expression, at
cp/constexpr.c:4312
 if constexpr(t.foo()) {
 ^
mmap: Cannot allocate memory
Please submit a full bug report,
with preprocessed source if appropriate.
See <https://gcc.gnu.org/bugs/> for instructions.
Compiler exited with result code 1


see https://godbolt.org/g/tTpkeD

[Bug fortran/80563] New: [cleanup] handle allocatable DT intent(out) arguments in init_intent_out_dt instead of gfc_conv_procedure_call

2017-04-29 Thread janus at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80563

Bug ID: 80563
   Summary: [cleanup] handle allocatable DT intent(out) arguments
in init_intent_out_dt instead of
gfc_conv_procedure_call
   Product: gcc
   Version: 8.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: fortran
  Assignee: unassigned at gcc dot gnu.org
  Reporter: janus at gcc dot gnu.org
  Target Milestone: ---

Carry-over from PR 80121 comment 7:

> > In trans-decl.c there is a function called 'init_intent_out_dt', which takes
> > care of deallocating the allocatable components of intent(out) derived-type
> > dummies. However, it has a comment saying:
> > 
> > /* Note: Allocatables are excluded as they are already handled
> >by the caller.  */
> 
> 
> Apparently 'gfc_conv_procedure_call' in trans-expr.c does that.

My feeling is that it would be a good idea to handle allocatable derived types
inside of the callee as well. I can see at least two advantages:
* It would avoid code duplication if the procedure is called several times.
* It would take some complexity out of gfc_conv_procedure_call, which is quite
a monster.

>From the technical side a treatment in the callee should be possible AFAICS. I
wonder why it is being done in the caller at all?

[Bug libstdc++/80564] New: bind on SFINAE unfriendly generic lambda

2017-04-29 Thread colu...@gmx-topmail.de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80564

Bug ID: 80564
   Summary: bind on SFINAE unfriendly generic lambda
   Product: gcc
   Version: 8.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: libstdc++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: colu...@gmx-topmail.de
  Target Milestone: ---

Related to https://gcc.gnu.org/bugzilla/show_bug.cgi?id=49058

---
#include 

int main() {
int i;
std::bind([] (auto& x) {x = 1;}, i)();
}

---

This is rejected because, during overload resolution, _Bind::operator() const's
default template argument is spuriously instantiated.

[Bug target/80556] [8 Regression] bootstrap failure for Ada compiler

2017-04-29 Thread dominiq at lps dot ens.fr
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80556

--- Comment #3 from Dominique d'Humieres  ---
> Other native platforms seem fine, so please post a backtrace.

The best I can do without further directives:

[Book15] ada/rts% lldb /opt/gcc/build_a/gcc/gnat1
(lldb) run -O2 g-exptty.adb
Process 95815 launched: '/opt/gcc/build_a/gcc/gnat1' (x86_64)
Process 95815 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGABRT
frame #0: 0x7fffa7e8fd42 libsystem_kernel.dylib`__pthread_kill + 10
libsystem_kernel.dylib`__pthread_kill:
->  0x7fffa7e8fd42 <+10>: jae0x7fffa7e8fd4c; <+20>
0x7fffa7e8fd44 <+12>: movq   %rax, %rdi
0x7fffa7e8fd47 <+15>: jmp0x7fffa7e88caf; cerror_nocancel
0x7fffa7e8fd4c <+20>: retq   
(lldb) bt
error: gnat1 {0x00179120}: unhandled type tag 0x0021 (DW_TAG_subrange_type),
please file a bug and attach the file at the start of this error message
...
a bunch of similar errors
...
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGABRT
  * frame #0: 0x7fffa7e8fd42 libsystem_kernel.dylib`__pthread_kill + 10
frame #1: 0x7fffa7f7d5bf libsystem_pthread.dylib`pthread_kill + 90
frame #2: 0x7fffa7df5420 libsystem_c.dylib`abort + 129
frame #3: 0x000100ff88c1 gnat1`uw_init_context_1(context=,
outer_cfa=, outer_ra=) at unwind-dw2.c:1579
frame #4: 0x000100ff8f2e
gnat1`_Unwind_RaiseException(exc=0x000144a022a0) at unwind.inc:88
frame #5: 0x00010006663f
gnat1`ada__exceptions__exception_propagation__propagate_gcc_exceptionXn(gcc_exception=0x000144a022a0)
at a-exexpr.adb:322
frame #6: 0x000100066683
gnat1`ada__exceptions__exception_propagation__propagate_exceptionXn(excep=)
at a-exexpr.adb:354
frame #7: 0x000100066af9
gnat1`ada__exceptions__complete_and_propagate_occurrence(x=) at
a-except.adb:937
frame #8: 0x000100066b2e gnat1`__gnat_raise_exception(e=,
message=) at a-except.adb:978
frame #9: 0x0001001fbf9a gnat1`rtsfind__load_fail(s=const string___XUP
@ 0x7fe1c0cf6f50, u_id=, id=) at rtsfind.adb:851
frame #10: 0x0001001fc316 gnat1`rtsfind__load_rtu(u_id=,
id=, use_setting=) at rtsfind.adb:987
frame #11: 0x0001001fc74e gnat1`rtsfind__rte at rtsfind.adb:1380
frame #12: 0x0001001fcab8 gnat1`rtsfind__rte_available(e=)
at rtsfind.adb:1462
frame #13: 0x00010011d4ad
gnat1`exp_ch9__expand_n_delay_relative_statement(n=) at
exp_ch9.adb:8068
frame #14: 0x00010017078f gnat1`expander__expand(n=) at expander.adb:214
frame #15: 0x0001002124d8 gnat1`sem__analyze(n=)
at sem.adb:753
frame #16: 0x00010029d347 gnat1`sem_ch5__analyze_statements(l=) at sem_ch5.adb:3613
frame #17: 0x00010029f06e gnat1`sem_ch5__analyze_if_statement(n=) at sem_ch5.adb:1665
frame #18: 0x000100212bf0 gnat1`sem__analyze(n=)
at sem.adb:306
frame #19: 0x00010029d347 gnat1`sem_ch5__analyze_statements(l=) at sem_ch5.adb:3613
frame #20: 0x0001002396ee
gnat1`sem_ch11__analyze_handled_statements(n=) at
sem_ch11.adb:426
frame #21: 0x000100212882 gnat1`sem__analyze(n=)
at sem.adb:297
frame #22: 0x0001002ad694
gnat1`sem_ch6__analyze_subprogram_body(n=) at sem_ch6.adb:4245
frame #23: 0x000100212ace gnat1`sem__analyze(n=)
at sem.adb:547
frame #24: 0x00010027767b gnat1`sem_ch3__analyze_declarations(l=) at sem_ch3.adb:2608
frame #25: 0x0001002b2dbe gnat1`sem_ch7__analyze_package_body(n=) at sem_ch7.adb:786
frame #26: 0x000100212ada gnat1`sem__analyze(n=)
at sem.adb:444
frame #27: 0x000100236c22
gnat1`sem_ch10__analyze_compilation_unit(n=) at
sem_ch10.adb:897
frame #28: 0x000100212713 gnat1`sem__analyze(n=)
at sem.adb:180
frame #29: 0x000100213863 gnat1`sem__semantics at sem.adb:1338
frame #30: 0x0001002137e6 gnat1`sem__semantics
frame #31: 0x000100182fd4 gnat1`_ada_frontend at frontend.adb:407
frame #32: 0x00010037a6b1 gnat1`_ada_gnat1drv at gnat1drv.adb:1127
frame #33: 0x00010001daff gnat1`::gnat_parse_file() at misc.c:122
frame #34: 0x000100c583ca gnat1`::compile_file() at toplev.c:467
frame #35: 0x000100ffd717 gnat1`toplev::main(int, char**) at
toplev.c:2003
frame #36: 0x000100ffd227 gnat1`toplev::main(this=0x7fff5fbff2fe,
argc=, argv=)
frame #37: 0x000100fff2fe gnat1`main(argc=3, argv=0x7fff5fbff330)
at main.c:39
frame #38: 0x7fffa7d61235 libdyld.dylib`start + 1

[Bug tree-optimization/80487] redundant memset/strncpy calls not eliminated

2017-04-29 Thread glisse at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80487

--- Comment #3 from Marc Glisse  ---
Author: glisse
Date: Sat Apr 29 14:39:25 2017
New Revision: 247408

URL: https://gcc.gnu.org/viewcvs?rev=247408&root=gcc&view=rev
Log:
Add st[pr]ncpy to stmt_kills_ref_p

2017-04-29  Marc Glisse  

PR tree-optimization/80487
gcc/
* tree-ssa-alias.c (stmt_kills_ref_p): Handle stpncpy and strncpy.

gcc/testsuite/
* gcc.dg/tree-ssa/strncpy-1.c: New file.

Added:
trunk/gcc/testsuite/gcc.dg/tree-ssa/strncpy-1.c
Modified:
trunk/gcc/ChangeLog
trunk/gcc/testsuite/ChangeLog
trunk/gcc/tree-ssa-alias.c

[Bug tree-optimization/80487] redundant memset/strncpy calls not eliminated

2017-04-29 Thread glisse at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80487

Marc Glisse  changed:

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution|--- |FIXED

--- Comment #4 from Marc Glisse  ---
.

[Bug rtl-optimization/80491] [6/7/8 Regression] Compiler regression for long-add case.

2017-04-29 Thread jakub at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80491

--- Comment #9 from Jakub Jelinek  ---
Author: jakub
Date: Sat Apr 29 16:17:13 2017
New Revision: 247409

URL: https://gcc.gnu.org/viewcvs?rev=247409&root=gcc&view=rev
Log:
PR rtl-optimization/80491
* alias.c (memory_modified_in_insn_p): Return true for CALL_INSNs.

Modified:
trunk/gcc/ChangeLog
trunk/gcc/alias.c

[Bug bootstrap/80565] New: ICE at -O2 and -O3 in 32-bit mode (not 64-bit) on x86_64-linux-gnu (in edge_badness, at ipa-inline.c:1028)

2017-04-29 Thread chengniansun at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80565

Bug ID: 80565
   Summary: ICE at -O2 and -O3 in 32-bit mode (not 64-bit) on
x86_64-linux-gnu (in edge_badness, at
ipa-inline.c:1028)
   Product: gcc
   Version: 8.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: bootstrap
  Assignee: unassigned at gcc dot gnu.org
  Reporter: chengniansun at gmail dot com
  Target Milestone: ---

$ gcc-trunk -v
Using built-in specs.
COLLECT_GCC=gcc-trunk
COLLECT_LTO_WRAPPER=/usr/local/gcc-trunk/libexec/gcc/x86_64-pc-linux-gnu/8.0.0/lto-wrapper
Target: x86_64-pc-linux-gnu
Configured with: ../gcc-source-trunk/configure --enable-languages=c,c++,lto
--prefix=/usr/local/gcc-trunk --disable-bootstrap
Thread model: posix
gcc version 8.0.0 20170429 (experimental) [trunk revision 247405] (GCC) 
$ gcc-trunk -m32 -O2 small.c
small.c: In function ‘fn2’:
small.c:4:14: warning: type of ‘p1’ defaults to ‘int’ [-Wimplicit-int]
 static short fn2(p1) {
  ^~~
small.c: At top level:
small.c:39:1: internal compiler error: in edge_badness, at ipa-inline.c:1028
 }
 ^
0x139f133 edge_badness
../../gcc-source-trunk/gcc/ipa-inline.c:1028
0x13a037b update_edge_key
../../gcc-source-trunk/gcc/ipa-inline.c:1224
0x13a08da update_caller_keys
../../gcc-source-trunk/gcc/ipa-inline.c:1351
0x13a269f inline_small_functions
../../gcc-source-trunk/gcc/ipa-inline.c:2045
0x13a269f ipa_inline
../../gcc-source-trunk/gcc/ipa-inline.c:2438
0x13a269f execute
../../gcc-source-trunk/gcc/ipa-inline.c:2849
Please submit a full bug report,
with preprocessed source if appropriate.
Please include the complete backtrace with any bug report.
See <https://gcc.gnu.org/bugs/> for instructions.
$ cat small.c
int a, b, c, e, h, j;
char d;
short f, g;
static short fn2(p1) {
  for (;;)
for (; g; g++)
  if (p1)
break;
}

static short fn3();
static char fn4(char p1) {
  int i;
  for (; d;)
f = 8;
  for (; f; f = 0)
for (; i; i++) {
  j = 0;
  for (; j; j++)
;
}
}

static short fn1(short p1) { fn2(b || fn3()); }

short fn3() {
  if (c) {
fn4(e);
h = 0;
for (;; h++)
  ;
  }
}

int main() {
  for (; a;)
fn1(c);
  return 0;
}

[Bug rtl-optimization/80491] [6/7/8 Regression] Compiler regression for long-add case.

2017-04-29 Thread jakub at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80491

--- Comment #10 from Jakub Jelinek  ---
Author: jakub
Date: Sat Apr 29 16:18:11 2017
New Revision: 247410

URL: https://gcc.gnu.org/viewcvs?rev=247410&root=gcc&view=rev
Log:
PR rtl-optimization/80491
* ifcvt.c (noce_process_if_block): When looking for x setter
with missing else_bb, don't check only the insn right before
cond_earliest, but look for the last insn that x is modified in
within the same bb.

Modified:
trunk/gcc/ChangeLog
trunk/gcc/ifcvt.c

[Bug target/80566] New: no use of avx vmovups on ymm registry in set and copy

2017-04-29 Thread vincenzo.innocente at cern dot ch
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80566

Bug ID: 80566
   Summary: no use of avx vmovups on ymm registry in set and copy
   Product: gcc
   Version: 8.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: vincenzo.innocente at cern dot ch
  Target Milestone: ---

in this example
#include 
int * foo() {
  int * p = new int[16];
  memset(p,0,16*sizeof(int));
  return p;
}
int * foo(int * q) {
  int * p = new int[16];
  memcpy(q,p,16*sizeof(int));
  return p;
}

gcc does not make use of vmovups on ymm registry 
( c++ -O3 -Wall -march=haswell -S)
while (according to gcc.godbolt.org) clang 4.0 does
https://godbolt.org/g/qnX975

[Bug target/80556] [8 Regression] bootstrap failure for Ada compiler

2017-04-29 Thread ebotcazou at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80556

Eric Botcazou  changed:

   What|Removed |Added

 Status|WAITING |NEW
 CC||gingold at gcc dot gnu.org

--- Comment #4 from Eric Botcazou  ---
> a bunch of similar errors
> ...
> * thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGABRT
>   * frame #0: 0x7fffa7e8fd42 libsystem_kernel.dylib`__pthread_kill + 10
> frame #1: 0x7fffa7f7d5bf libsystem_pthread.dylib`pthread_kill + 90
> frame #2: 0x7fffa7df5420 libsystem_c.dylib`abort + 129
> frame #3: 0x000100ff88c1
> gnat1`uw_init_context_1(context=, outer_cfa=,
> outer_ra=) at unwind-dw2.c:1579
> frame #4: 0x000100ff8f2e
> gnat1`_Unwind_RaiseException(exc=0x000144a022a0) at unwind.inc:88
> frame #5: 0x00010006663f
> gnat1`ada__exceptions__exception_propagation__propagate_gcc_exceptionXn(gcc_e
> xception=0x000144a022a0) at a-exexpr.adb:322
> frame #6: 0x000100066683
> gnat1`ada__exceptions__exception_propagation__propagate_exceptionXn(excep= available>) at a-exexpr.adb:354
> frame #7: 0x000100066af9
> gnat1`ada__exceptions__complete_and_propagate_occurrence(x=) at
> a-except.adb:937
> frame #8: 0x000100066b2e
> gnat1`__gnat_raise_exception(e=, message=) at
> a-except.adb:978
> frame #9: 0x0001001fbf9a gnat1`rtsfind__load_fail(s=const
> string___XUP @ 0x7fe1c0cf6f50, u_id=, id=) at
> rtsfind.adb:851

OK, thanks, there is a problem in exception propagation on the Darwin host.
I'm not exactly a specialist here, so CCing Tristan.

[Bug c++/80567] New: bogus fixit hint for undeclared memset: else

2017-04-29 Thread msebor at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80567

Bug ID: 80567
   Summary: bogus fixit hint for undeclared memset: else
   Product: gcc
   Version: 7.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: msebor at gcc dot gnu.org
  Target Milestone: ---

For the test case below where the program uses memset without first including
 G++ suggests as an alternative the "else" keyword.

$ cat z.C && gcc -O2 -Wall -Wextra -Wpedantic z.C
void f (void *p)
{
  memset (p, 0, 4);
}
z.C: In function ‘void f(void*)’:
z.C:3:3: error: ‘memset’ was not declared in this scope
   memset (p, 0, 4);
   ^~
z.C:3:3: note: suggested alternative: ‘else’
   memset (p, 0, 4);
   ^~
   else


In C mode, GCC prints the far more helpful (though not perfect):

z.C:3:3: warning: implicit declaration of function ‘memcpy’
[-Wimplicit-function-declaration]
   memcpy (p, "1234", 4);
   ^~
z.C:3:3: warning: incompatible implicit declaration of built-in function
‘memcpy’
z.C:3:3: note: include ‘’ or provide a declaration of ‘memcpy’


Although in C++ it's possible to declare one's own overloads of memset and
other library functions, I think it would be helpful to have G++ issue a hint
similar to the GCC note (i.e., in the absence of any other memset, assume that
the name refers to the standard library function and suggest the user #include
).  Ditto for any other C standard library functions.

In any case, it seems that to avoid obviously incorrect suggestions like the
one above, G++ needs to consider more of the context in which an undeclared
identifier is used.

[Bug c++/80560] warn on undefined memory operations involving non-trivial types

2017-04-29 Thread msebor at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80560

Martin Sebor  changed:

   What|Removed |Added

   Keywords||patch

--- Comment #2 from Martin Sebor  ---
Patch posted for review:
https://gcc.gnu.org/ml/gcc-patches/2017-04/msg01571.html


[Bug target/80568] New: x86 -mavx256-split-unaligned-load (and store) is affecting AVX2 code, but probably shouldn't be.

2017-04-29 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80568

Bug ID: 80568
   Summary: x86 -mavx256-split-unaligned-load (and store) is
affecting AVX2 code, but probably shouldn't be.
   Product: gcc
   Version: 7.0
Status: UNCONFIRMED
  Keywords: missed-optimization
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: peter at cordes dot ca
  Target Milestone: ---

Created attachment 41285
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=41285&action=edit
bswap16.cc

gcc7 (or at least the gcc8 snapshot on https://godbolt.org/g/ZafCE0) is now
splitting unaligned loads/stores even for AVX2 integer, where gcc6.3 didn't.

I think this is undesirable by default, because some projects probably build
with -mavx2 but fail to use -mtune=haswell (or broadwell or skylake).  For now,
Intel CPUs that do well with 32B unaligned loads are probably the most common
AVX2-supporting CPUs.

IDK what's optimal for Excavator or Zen.  Was this an intentional change to
make those tune options work better for those CPUs?

I would suggest that -mavx2 should imply -mno-avx256-split-unaligned-load (and
-store) for -mtune=generic.  Or if that's too ugly (having insn set selection
affect tuning), then maybe just revert to the previous behaviour of having
integer loads/store not be split the way FP loads/stores are.

 The conventional wisdom is that unaligned loads are just as fast as aligned
when the data does happen to be aligned at run-time.  Splitting this way badly
breaks that assumption.  It's inconvenient/impossible to portably communicate
to the compiler that it should optimize for the case where the data is aligned,
even if that's not guaranteed, so loadu / storeu are probably used in lots of
code that normally runs on aligned data.

Also, gcc doesn't always figure out that a hand-written scalar prologue does
leave the pointer aligned for a vector loop.  (And since programmers expect it
not to matter, they may still use `_mm256_loadu_si256`).  I reduced some real
existing code that a colleague wrote into a test-case for this:
https://godbolt.org/g/ZafCE0, also attached.If using
potentially-overlapping first/last vectors instead of scalar loops, it might
use loadu just to avoid duplicating a helper function.




For an example of affected code, consider an endian-swap function that uses
this (inline) function in its inner loop.  The code inside the loop matches
what we get for compiling it stand-alone, so I'll just show that:

#include 
// static inline
void swap256(char* addr, __m256i mask) {
__m256i vec = _mm256_loadu_si256((__m256i*)addr);
vec = _mm256_shuffle_epi8(vec, mask);
_mm256_storeu_si256((__m256i*)addr, vec);
}


gcc6.3 -O3 -mavx2:
vmovdqu (%rdi), %ymm1
vpshufb %ymm0, %ymm1, %ymm0
vmovdqu %ymm0, (%rdi)

g++ (GCC-Explorer-Build) 8.0.0 20170429 (experimental)  -O3 -mavx2
vmovdqu (%rdi), %xmm1
vinserti128 $0x1, 16(%rdi), %ymm1, %ymm1
vpshufb %ymm0, %ymm1, %ymm0
vmovups %xmm0, (%rdi)
vextracti128$0x1, %ymm0, 16(%rdi)

[Bug target/79964] Cortex A53 codegen still not optimal

2017-04-29 Thread tulipawn at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79964

--- Comment #2 from PeteVine  ---
I can confirm the first part of the issue gets fixed with this patch:

https://gcc.gnu.org/ml/gcc-patches/2017-04/msg01415.html

but there's a regression in gcc8 concerning the second part. (or rather the
workarounds don't work any more) 

http://openbenchmarking.org/result/1704298-RI-CRAYREGRE13

("basic flags" didn't deactivate -mfix-cortex-a53-843419, hence the difference)

[Bug target/80569] New: i686: "shrx" instruction generated in 16-bit mode

2017-04-29 Thread davmac at davmac dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80569

Bug ID: 80569
   Summary: i686: "shrx" instruction generated in 16-bit mode
   Product: gcc
   Version: 6.3.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: davmac at davmac dot org
  Target Milestone: ---

The following code, compiled with -m16 -O2 -c, fails at assembly:

--- begin ---
void load_kernel(void *setup_addr)
{
unsigned int seg = (unsigned int)setup_addr >> 4;
asm("movl %0, %%es" : : "r"(seg));
}
--- end ---

$ gcc -m16 -O2 -c shrxdtestcase.i 
/tmp/ccGS34WK.s: Assembler messages:
/tmp/ccGS34WK.s:11: Error: instruction `shrx' isn't supported in 16-bit mode.

[Bug target/80569] i686: "shrx" instruction generated in 16-bit mode

2017-04-29 Thread davmac at davmac dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80569

--- Comment #1 from Davin McCall  ---
(Prevents building Qemu).

[Bug bootstrap/80565] [8 Regression] ICE at -O2 and -O3 in 32-bit mode (not 64-bit) on x86_64-linux-gnu (in edge_badness, at ipa-inline.c:1028)

2017-04-29 Thread marxin at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80565

Martin Liška  changed:

   What|Removed |Added

 Status|UNCONFIRMED |NEW
  Known to work||7.0.1
   Keywords||ice-on-valid-code
   Last reconfirmed||2017-04-30
 CC||hubicka at ucw dot cz,
   ||marxin at gcc dot gnu.org
 Ever confirmed|0   |1
Summary|ICE at -O2 and -O3 in   |[8 Regression] ICE at -O2
   |32-bit mode (not 64-bit) on |and -O3 in 32-bit mode (not
   |x86_64-linux-gnu (in|64-bit) on x86_64-linux-gnu
   |edge_badness, at|(in edge_badness, at
   |ipa-inline.c:1028)  |ipa-inline.c:1028)
   Target Milestone|--- |8.0
  Known to fail||8.0

--- Comment #1 from Martin Liška  ---
Confirmed, started with r247380.

[Bug target/80570] New: auto-vectorizing int->double conversion should use half-width memory operands to avoid shuffles, instead of load+extract

2017-04-29 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80570

Bug ID: 80570
   Summary: auto-vectorizing int->double conversion should use
half-width memory operands to avoid shuffles, instead
of load+extract
   Product: gcc
   Version: 8.0
Status: UNCONFIRMED
  Keywords: missed-optimization
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: peter at cordes dot ca
  Target Milestone: ---
Target: x86_64-*-*, i?86-*-*

When auto-vectorizing int->double conversion, gcc loads a full-width vector
into a register and then unpacks the upper half to feed (v)cvtdq2pd.  e.g. with
AVX, we get a 256b load and then vextracti128.

It's even worse with an unaligned src pointer with
-mavx256-split-unaligned-load, where it does vinsertf128 -> vextractf128,
without ever doing anything with the full 256b vector!

On Intel SnB-family CPUs, this will bottleneck the loop on port5 throughput,
because VCVTDQ2PD reg -> reg needs a port5 uop as well as a port1 uop.  (And
vextracti128 can only run on the shuffle unit on port5).

VCVTDQ2PD with a memory source operand doesn't need the shuffle port at all on
Intel Haswell and later, just the FP-add unit and a load, so it's a much better
choice.  (Throughput of one per clock on Sandybridge and Haswell, 2 per clock
on Skylake).  It's still 2 fused-domain uops, though, so I guess it can't
micro-fuse the load according to Agner Fog's testing.  (Or 3 on SnB).

I'm pretty sure using twice as many half-width memory operands is not worse on
other AVX CPUs either (AMD BD-family or Zen, or KNL), vs. max-width loads and
extracting the high half.


void cvti32f64_loop(double *dp, int *ip) {
// ICC avoids the mistake when it doesn't emit a prologue to align the pointers
#ifdef __GNUC__
dp = __builtin_assume_aligned(dp, 64);
ip = __builtin_assume_aligned(ip, 64);
#endif
for (int i=0; i<1 ; i++) {
double tmp = ip[i];
dp[i] = tmp;
}
}

https://godbolt.org/g/329C3P
gcc.godbolt.org's "gcc7" snapshot: g++ (GCC-Explorer-Build) 8.0.0 20170429
(experimental)

gcc -O3 -march=sandybridge
cvti32f64_loop:
xorl%eax, %eax
.L2:
vmovdqa (%rsi,%rax), %ymm0
vcvtdq2pd   %xmm0, %ymm1
vextractf128$0x1, %ymm0, %xmm0
vmovapd %ymm1, (%rdi,%rax,2)
vcvtdq2pd   %xmm0, %ymm0
vmovapd %ymm0, 32(%rdi,%rax,2)
addq$32, %rax
cmpq$4, %rax
jne .L2
vzeroupper
ret

gcc does the same thing for -march=haswell, but uses vextracti128.  This is
obviously really silly.

For comparison, clang 4.0 -O3 -march=sandybridge -fno-unroll-loops emits:
xorl%eax, %eax
.LBB0_1:
vcvtdq2pd   (%rsi,%rax,4), %ymm0
vmovaps %ymm0, (%rdi,%rax,8)
addq$4, %rax
cmpq$1, %rax# imm = 0x2710
jne .LBB0_1
vzeroupper
retq

This should come close to one 256b store per clock (on Haswell), even with
unrolling disabled.



With -march=nehalem, gcc gets away with it for this simple not-unrolled loop
(without hurting throughput I think), but only because this strategy
effectively unrolls the loop (doing two stores per add + cmp/jne), and Nehalem
can run shuffles on two execution ports (so the pshufd can run on port1, while
the cvtdq2pd can run on ports 1+5).  So it's 10 fused-domain uops per 2 stores
instead of 5 per 1 store.  Depending on how the loop buffer handles
non-multiple-of-4 uop counts, this might be a wash.  (Of course, with any other
work in the loop, or with unrolling, the memory-operand strategy is much
better).

CVTDQ2PD's memory operand is only 64 bits, so even the non-AVX version doesn't
fault if misaligned.

--

It's even more horrible without aligned pointers, when the sandybridge version
(which splits unaligned 256b loads/stores) uses vinsertf128 to emulate a 256b
load, and then does vextractf128 right away:

 inner_loop:   # gcc8 -march=sandybridge without __builtin_assume_aligned
vmovdqu (%r8,%rax), %xmm0
vinsertf128 $0x1, 16(%r8,%rax), %ymm0, %ymm0
vcvtdq2pd   %xmm0, %ymm1
vextractf128$0x1, %ymm0, %xmm0
vmovapd %ymm1, (%rcx,%rax,2)
vcvtdq2pd   %xmm0, %ymm0
vmovapd %ymm0, 32(%rcx,%rax,2)

This is obviously really really bad, and should probably be checked for and
avoided in case there are things other than int->double autovec that could lead
to doing this.

---

With -march=skylake-avx512, gcc does the AVX512 version of the same thing: zmm
load and then extra the upper 256b