[Bug c/84261] New: gcc fails to call a simd-vectorized function
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84261 Bug ID: 84261 Summary: gcc fails to call a simd-vectorized function Product: gcc Version: 7.3.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c Assignee: unassigned at gcc dot gnu.org Reporter: marcin.krotkiewski at gmail dot com Target Milestone: --- Consider the following vectorized functions (omp declare simd): #include #pragma omp declare simd simdlen(4) double test1(double v1) { for(int i=0; i<4; i++){ v1 = exp(v1); } return v1; } #pragma omp declare simd simdlen(4) double test2(double v1) { v1 = exp(v1); v1 = exp(v1); v1 = exp(v1); v1 = exp(v1); return v1; } I used GCC versions up to 7.3. Code compiled with -fopenmp -O3 -ffast-math -march=haswell. In the test1 case GCC generates scalar calls to exp, while in the second case (test2) calls to exp are vectorized, as expected. Is there a reason for such behaviour? Because it sure looks like something is wrong. Would some flag convince gcc to vectorize?
[Bug c/84261] gcc fails to vectorize a function call
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84261 --- Comment #2 from Marcin Krotkiewski --- (In reply to Richard Biener from comment #1) > This is because reduction operations like 'exp' are not supported. Also > vectorization of loops with using vectors isn't supported. > > So not sure what you are expecting. Both functions are vectorized in the OpenMP sense, i.e., vector versions are generated that take vector arguments, e.g., 0820 T _ZGVeN4v_test1 0320 T _ZGVeN4v_test2 While test2 uses exp implementation from libmvec to compute on simd vectors: call_ZGVdN4v___exp_finite test1 uses a non-vectorized exp call in a loop: call__exp_finite Both cases do the same computation. Since there is no data dependence between the simd lanes, gcc should also generate a call to _ZGVdN4v___exp_finite instead of 4 calls to __exp_finite in the first case.
[Bug c/84261] gcc fails to vectorize a function call
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84261 --- Comment #3 from Marcin Krotkiewski --- (In reply to Richard Biener from comment #1) > Also vectorization of loops with using vectors isn't supported. Not sure what you mean. If instead of a function call I do some floating point computations in test1, e.g., #pragma omp declare simd simdlen(4) double test1(double v1) { for(int i=0; i<4; i++){ v1 = v1*1.1; } return v1; } then the loop is 'vectorized' as expected, i.e., executed for short vector arguments, and not for scalars.
[Bug tree-optimization/84261] gcc fails to vectorize a function call
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84261 --- Comment #5 from Marcin Krotkiewski --- (In reply to Jakub Jelinek from comment #4) > The declare simd on the functions is essentially an implicit loop around the > whole body, so the function in this cases is passed a V4DFmode argument and > V4DFmode result and for each of the elements performs the body of the loop. > Guess we don't vectorize test1 because we fail to unroll the loop for some > reason. Not sure unrolling is the answer. The number of iterations might be arbitrary - how deep would you unroll it? In either case, each iteration of the loop should call a vector version of the function, even if no unrolling is done. It seems that if functions are called inside loops nested within enclosing vectorized functions, gcc never tries to search for suitable vector candidates.
[Bug c/84903] New: internal compiler error: in convert_move, at expr.c:229
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84903 Bug ID: 84903 Summary: internal compiler error: in convert_move, at expr.c:229 Product: gcc Version: 7.2.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c Assignee: unassigned at gcc dot gnu.org Reporter: marcin.krotkiewski at gmail dot com Target Milestone: --- Created attachment 43680 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=43680&action=edit pre-processed source that triggers the error The attached code fails to compile using gcc 7.2.0 (ubuntu 16.04 and centos 7.4.1708): gcc -Wimplicit-fallthrough=0 -Wall -Wextra -save-temps -fopenmp -O3 -ftree-vectorize -ffast-math -c cop.c -o cop.o cop.c: In function ‘HE1OP.simdclone.6’: cop.c:21:8: internal compiler error: in convert_move, at expr.c:229 double HE1OP(double XHE1, double FREQ, double FREQLG, double TKEV, double TLOG) ^ 0x760b50 convert_move(rtx_def*, rtx_def*, int) ../../src/gcc/expr.c:229 0x7672b3 store_expr_with_bounds(tree_node*, rtx_def*, int, bool, bool, tree_node*) ../../src/gcc/expr.c:5629 0x76775e expand_assignment(tree_node*, tree_node*, bool) ../../src/gcc/expr.c:5321 0x67a841 expand_call_stmt ../../src/gcc/cfgexpand.c:2656 0x67a841 expand_gimple_stmt_1 ../../src/gcc/cfgexpand.c:3571 0x67a841 expand_gimple_stmt ../../src/gcc/cfgexpand.c:3737 0x67ba1f expand_gimple_basic_block ../../src/gcc/cfgexpand.c:5744 0x680ba6 execute ../../src/gcc/cfgexpand.c:6357 Please submit a full bug report, with preprocessed source if appropriate. Please include the complete backtrace with any bug report. See for instructions. I tried on centos with gcc 6.4, and it also fails in the same way. The error is not there if I do not try to generate an OpenMP-vectorized function, i.e., I remove -fopenmp, -ffast-math, or the #pragma omp simd. The attached is a simplest case I managed to get down to from the production code I'm looking at.
[Bug c/84903] internal compiler error: in convert_move, at expr.c:229
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84903 --- Comment #2 from Marcin Krotkiewski --- Great, the trunk works for both the test, and the production code. Thanks!
[Bug tree-optimization/85050] New: Vectorized function - suboptimal gather
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85050 Bug ID: 85050 Summary: Vectorized function - suboptimal gather Product: gcc Version: 7.2.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: marcin.krotkiewski at gmail dot com Target Milestone: --- I compile the following function with gcc 7.2 and 8.0.1, with -march=broadwell -O3 -ftree-vectorize -ffast-math -fopenmp #pragma omp declare simd notinbranch double testfun(double arg) { static const double A[11] = {5.53,5.49,5.46,5.43,5.40,5.25,5.00,4.69,4.48,4.16,3.85}; int iidx = int(arg); return A[iidx]; } Also here https://godbolt.org/g/wo7pdv For some reason GCC's 4-wide vectorized function contains asm that works on two 2-wide vectors instead of a single 4-wide vector: _ZGVdN4v__Z7testfund: [...] vmovapd %ymm0, -32(%rsp) vmovapd .LC0(%rip), %xmm2 vinsertf128 $0x1, -16(%rsp), %ymm0, %ymm0 vmovapd %xmm2, %xmm4 vcvttpd2dqy %ymm0, %xmm0 vgatherdpd %xmm4, (%rax,%xmm0,8), %xmm3 vpshufd $238, %xmm0, %xmm0 vgatherdpd %xmm2, (%rax,%xmm0,8), %xmm1 vmovaps %xmm3, -64(%rsp) vmovaps %xmm1, -48(%rsp) vmovapd -64(%rsp), %ymm0 [...] Code generated using ICC looks like expected: _ZGVYN4v__Z7testfund: vcvttpd2dq xmm1, ymm0 #11.18 vpcmpeqd ymm2, ymm2, ymm2 #12.10 vxorpd ymm0, ymm0, ymm0 #12.10 vgatherdpd ymm0, QWORD PTR [A.5.0.1+xmm1*8], ymm2 #12.10 I don't see anything wrong with my compiler options. Is this behaviour in GCC expected, and a result of a different vectorization cost model?
[Bug c++/85232] New: gcc fails to vectorize a nested simd function call
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85232 Bug ID: 85232 Summary: gcc fails to vectorize a nested simd function call Product: gcc Version: 8.0.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: marcin.krotkiewski at gmail dot com Target Milestone: --- I'm compiling the following code using gcc version 8.0.1 20180319 (experimental) (also earlier versions), with flags "-march=haswell -O3 -ftree-vectorize -fopt-info-vec -ffast-math -fopenmp": #pragma omp declare simd notinbranch double fun1(double arg) __attribute__ ((noinline)); // double fun1(double arg) // { // return 2.0*arg; // } #pragma omp declare simd notinbranch // double fun2(double arg) __attribute__ ((noinline)); double fun2(double arg) { // if statement is the cause of trouble if(arg < 0) return 0.; return 2.0*arg; } #pragma omp declare simd notinbranch double test(double arg) { double H = 0; H = 0; H += fun1(arg); H += fun2(arg); return H; } Also here: https://godbolt.org/g/ZJmVzJ Function test calls two other omp simd vectorized functions. fun1 is declared here, and should be defined in a different compilation unit. fun2 is defined here, should be inline'd, and contains an if statement on the argument. As implemented above, gcc does not vectorize the function call to fun1. There is a few things I have to do to arrive at a vectorized call: 1. move the definition of fun1 to the same compilation unit (uncomment in the code above). fun1 is still noinline, so it will not be inline'd and a correct vectorized call is issued 2. declare fun2 with attribute noinline. Then both fun1 and fun2 calls are vectorized, but fun2 is not inlined. The source of all trouble seems to be the if statement in fun2: if I remove it, all works as expected. What is the reason of this behaviour? Can I do something to avoid the problem? I would like fun1 to be defined elsewhere, and fun2 to be inlined.
[Bug rtl-optimization/60086] New: suboptimal asm generated for a loop (store/load false aliasing)
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=60086 Bug ID: 60086 Summary: suboptimal asm generated for a loop (store/load false aliasing) Product: gcc Version: 4.7.3 Status: UNCONFIRMED Severity: normal Priority: P3 Component: rtl-optimization Assignee: unassigned at gcc dot gnu.org Reporter: marcin.krotkiewski at gmail dot com Created attachment 32060 --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=32060&action=edit source code that compiles Hello, I am seeing suboptimal performance of the following loop compiled with gcc 4.7.3 (but also 4.4.7, Ubuntu, full test code attached): for(i=0; i
[Bug rtl-optimization/60086] suboptimal asm generated for a loop (store/load false aliasing)
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=60086 --- Comment #2 from Marcin Krotkiewski --- Jakub, thank you for your comments. > GCC right now only handles __restrict on function parameters, so in this > case the aliasing info isn't known. While the loop is versioned for > aliasing at runtime, the info about that is only known during the > vectorizer, therefore e.g. scheduler can hardly know it. Does it mean that __restrict is not necessary in order to have a vectorized code path? I see that if I compile your modified test.c, the loop is vectorized regardless of whether I use __restrict, or not (runtime versioning). On the other hand, using __restrict causes gcc to invoke memset for initialization, while leaving it out results in two paths with a loop. On the interesting side. Your test.c works indeed if compiled with additional -fschedule-insns flag. However, if I now remove the __restrict keyword from function arguments, I do see a vectorized path, but the flag has no effect and instructions are again not reordered. > The pointers to > overaligned memory is something you should generally avoid, > __builtin_assume_aligned is what can be used to tell the compiler about the > alignment instead, overaligned types often actually hurt generated code > instead of improving it. Thanks. Could you suggest what is the preferred way to use it in a portable manner? e.g. make it suitable for icc, which has a __assume_aligned builtin? Should I wrap it in a macro? > And the way you are calling posix_memalign is IMHO > a strict aliasing violation. Could be, gcc des not show a warning with -Wall. Thanks for pointing it out.
[Bug rtl-optimization/60086] suboptimal asm generated for a loop (store/load false aliasing)
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=60086 --- Comment #8 from Marcin Krotkiewski --- (In reply to Andrey Belevantsev from comment #5) > At this point insn 461 is dead but we do not notice, and it doesn't look > easy. I think there was some suggestion in the original research for > killing dead insn copies left after renaming but I don't remember offhand. Following Alexanders suggestion, I compiled the test code with -mavx -O3 -fselective-scheduling2 -frename-registers. This seems to get rid of the dead instructions and yields the desired scheduling: .L5: vmovapd(%rbx,%rdi), %ymm0 addq$1, %rsi vmovapd(%r12,%rdi), %ymm3 vaddpd0(%r13,%rdi), %ymm0, %ymm2 vaddpd(%r14,%rdi), %ymm3, %ymm4 vmovapd%ymm2, (%rbx,%rdi) vmovapd%ymm4, (%r12,%rdi) addq$32, %rdi cmpq%rsi, %rdx ja.L5 Alexander, I should maybe clarify that the 'good' code was prepared by hand, modifying the 'bad' asm I got from gcc 4.7. Asm generated by gcc 4.4 was the same. If that is what you were refering to. I am a bit confused now. It seems that all fine and the desired asm can be generated, so there is no real bug. But why is the original code compiled with -O3 -mavx bad then? Is -fschedule-insns not enabled at -O2?
[Bug c/67167] New: cilkplus vectorization problems
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67167 Bug ID: 67167 Summary: cilkplus vectorization problems Product: gcc Version: 5.1.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c Assignee: unassigned at gcc dot gnu.org Reporter: marcin.krotkiewski at gmail dot com Target Milestone: --- I think there is a problem with vectorization of arithmetic operations in the cilkplus implementation in gcc. I have inspected generated asm of the following two implementations of vector addition (a = a + b). The code is compiled with 'gcc -O3 -mavx -ftree-vectorize -fopt-info-vec -fcilkplus test.c'. // ICC compatibility - alignment hint #ifdef __GNUC__ #define __assume_aligned(lvalueptr, align) lvalueptr = __builtin_assume_aligned (lvalueptr, align) #endif #define RESTRICT __restrict__ typedef double Double; void test(Double * RESTRICT a, Double * RESTRICT b, int size) { int i; __assume_aligned(a, 64); __assume_aligned(b, 64); for(i=0; i