from:"marcin.krotkiewski at gmail dot com"

[Bug c/84261] New: gcc fails to call a simd-vectorized function

2018-02-07 Thread marcin.krotkiewski at gmail dot com

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84261

Bug ID: 84261
   Summary: gcc fails to call a simd-vectorized function
   Product: gcc
   Version: 7.3.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c
  Assignee: unassigned at gcc dot gnu.org
  Reporter: marcin.krotkiewski at gmail dot com
  Target Milestone: ---

Consider the following vectorized functions (omp declare simd):

#include 

#pragma omp declare simd simdlen(4)
double test1(double v1)
{
for(int i=0; i<4; i++){
v1 = exp(v1);
}
return v1;
}

#pragma omp declare simd simdlen(4)
double test2(double v1)
{
v1 = exp(v1);
v1 = exp(v1);
v1 = exp(v1);
v1 = exp(v1);
return v1;
}

I used GCC versions up to 7.3. Code compiled with -fopenmp -O3  -ffast-math
-march=haswell.

In the test1 case GCC generates scalar calls to exp, while in the second case
(test2) calls to exp are vectorized, as expected. 

Is there a reason for such behaviour? Because it sure looks like something is
wrong. Would some flag convince gcc to vectorize?

[Bug c/84261] gcc fails to vectorize a function call

2018-02-07 Thread marcin.krotkiewski at gmail dot com

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84261

--- Comment #2 from Marcin Krotkiewski  ---
(In reply to Richard Biener from comment #1)
> This is because reduction operations like 'exp' are not supported.  Also
> vectorization of loops with using vectors isn't supported.
> 
> So not sure what you are expecting.

Both functions are vectorized in the OpenMP sense, i.e., vector versions are
generated that take vector arguments, e.g.,

0820 T _ZGVeN4v_test1
0320 T _ZGVeN4v_test2

While test2 uses exp implementation from libmvec to compute on simd vectors:

call_ZGVdN4v___exp_finite

test1 uses a non-vectorized exp call in a loop:

call__exp_finite

Both cases do the same computation. Since there is no data dependence between
the simd lanes, gcc should also generate a call to _ZGVdN4v___exp_finite
instead of 4 calls to __exp_finite in the first case.

[Bug c/84261] gcc fails to vectorize a function call

2018-02-07 Thread marcin.krotkiewski at gmail dot com

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84261

--- Comment #3 from Marcin Krotkiewski  ---
(In reply to Richard Biener from comment #1)
> Also vectorization of loops with using vectors isn't supported.

Not sure what you mean. If instead of a function call I do some floating point
computations in test1, e.g., 


#pragma omp declare simd simdlen(4)
double test1(double v1)
{
for(int i=0; i<4; i++){
v1 = v1*1.1;
}
return v1;
}


then the loop is 'vectorized' as expected, i.e., executed for short vector
arguments, and not for scalars.

[Bug tree-optimization/84261] gcc fails to vectorize a function call

2018-02-08 Thread marcin.krotkiewski at gmail dot com

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84261

--- Comment #5 from Marcin Krotkiewski  ---
(In reply to Jakub Jelinek from comment #4)
> The declare simd on the functions is essentially an implicit loop around the
> whole body, so the function in this cases is passed a V4DFmode argument and
> V4DFmode result and for each of the elements performs the body of the loop.
> Guess we don't vectorize test1 because we fail to unroll the loop for some
> reason.

Not sure unrolling is the answer. The number of iterations might be arbitrary -
how deep would you unroll it? In either case, each iteration of the loop should
call a vector version of the function, even if no unrolling is done.

It seems that if functions are called inside loops nested within enclosing
vectorized functions, gcc never tries to search for suitable vector candidates.

[Bug c/84903] New: internal compiler error: in convert_move, at expr.c:229

2018-03-16 Thread marcin.krotkiewski at gmail dot com

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84903

Bug ID: 84903
   Summary: internal compiler error: in convert_move, at
expr.c:229
   Product: gcc
   Version: 7.2.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c
  Assignee: unassigned at gcc dot gnu.org
  Reporter: marcin.krotkiewski at gmail dot com
  Target Milestone: ---

Created attachment 43680
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=43680&action=edit
pre-processed source that triggers the error

The attached code fails to compile using gcc 7.2.0 (ubuntu 16.04 and centos
7.4.1708):

gcc -Wimplicit-fallthrough=0 -Wall -Wextra -save-temps -fopenmp -O3
-ftree-vectorize -ffast-math -c cop.c -o cop.o
cop.c: In function ‘HE1OP.simdclone.6’:
cop.c:21:8: internal compiler error: in convert_move, at expr.c:229
 double  HE1OP(double XHE1, double FREQ, double FREQLG, double TKEV, double
TLOG)
^
0x760b50 convert_move(rtx_def*, rtx_def*, int)
../../src/gcc/expr.c:229
0x7672b3 store_expr_with_bounds(tree_node*, rtx_def*, int, bool, bool,
tree_node*)
../../src/gcc/expr.c:5629
0x76775e expand_assignment(tree_node*, tree_node*, bool)
../../src/gcc/expr.c:5321
0x67a841 expand_call_stmt
../../src/gcc/cfgexpand.c:2656
0x67a841 expand_gimple_stmt_1
../../src/gcc/cfgexpand.c:3571
0x67a841 expand_gimple_stmt
../../src/gcc/cfgexpand.c:3737
0x67ba1f expand_gimple_basic_block
../../src/gcc/cfgexpand.c:5744
0x680ba6 execute
../../src/gcc/cfgexpand.c:6357
Please submit a full bug report,
with preprocessed source if appropriate.
Please include the complete backtrace with any bug report.
See  for instructions.


I tried on centos with gcc 6.4, and it also fails in the same way. The error is
not there if I do not try to generate an OpenMP-vectorized function, i.e., I
remove -fopenmp, -ffast-math, or the #pragma omp simd.

The attached is a simplest case I managed to get down to from the production
code I'm looking at.

[Bug c/84903] internal compiler error: in convert_move, at expr.c:229

2018-03-20 Thread marcin.krotkiewski at gmail dot com

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84903

--- Comment #2 from Marcin Krotkiewski  ---
Great, the trunk works for both the test, and the production code. Thanks!

[Bug tree-optimization/85050] New: Vectorized function - suboptimal gather

2018-03-23 Thread marcin.krotkiewski at gmail dot com

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85050

Bug ID: 85050
   Summary: Vectorized function - suboptimal gather
   Product: gcc
   Version: 7.2.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: marcin.krotkiewski at gmail dot com
  Target Milestone: ---

I compile the following function with gcc 7.2 and 8.0.1, with -march=broadwell
-O3 -ftree-vectorize -ffast-math -fopenmp

#pragma omp declare simd notinbranch
double testfun(double arg)  
{   
  static const double A[11] =
{5.53,5.49,5.46,5.43,5.40,5.25,5.00,4.69,4.48,4.16,3.85};   
  int iidx = int(arg);  
  return A[iidx];   
}   

Also here https://godbolt.org/g/wo7pdv

For some reason GCC's 4-wide vectorized function contains asm that works on two
2-wide vectors instead of a single 4-wide vector:

_ZGVdN4v__Z7testfund:
[...]
vmovapd %ymm0, -32(%rsp)
vmovapd .LC0(%rip), %xmm2
vinsertf128 $0x1, -16(%rsp), %ymm0, %ymm0
vmovapd %xmm2, %xmm4
vcvttpd2dqy %ymm0, %xmm0
vgatherdpd  %xmm4, (%rax,%xmm0,8), %xmm3
vpshufd $238, %xmm0, %xmm0
vgatherdpd  %xmm2, (%rax,%xmm0,8), %xmm1
vmovaps %xmm3, -64(%rsp)
vmovaps %xmm1, -48(%rsp)
vmovapd -64(%rsp), %ymm0
[...]

Code generated using ICC looks like expected:

_ZGVYN4v__Z7testfund:
  vcvttpd2dq xmm1, ymm0 #11.18
  vpcmpeqd ymm2, ymm2, ymm2 #12.10
  vxorpd ymm0, ymm0, ymm0 #12.10
  vgatherdpd ymm0, QWORD PTR [A.5.0.1+xmm1*8], ymm2 #12.10

I don't see anything wrong with my compiler options. Is this behaviour in GCC
expected, and a result of a different vectorization cost model?

[Bug c++/85232] New: gcc fails to vectorize a nested simd function call

2018-04-05 Thread marcin.krotkiewski at gmail dot com

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85232

Bug ID: 85232
   Summary: gcc fails to vectorize a nested simd function call
   Product: gcc
   Version: 8.0.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: marcin.krotkiewski at gmail dot com
  Target Milestone: ---

I'm compiling the following code using gcc version 8.0.1 20180319
(experimental) (also earlier versions), with flags "-march=haswell -O3
-ftree-vectorize -fopt-info-vec -ffast-math -fopenmp":

#pragma omp declare simd notinbranch
double fun1(double arg) __attribute__ ((noinline));
// double fun1(double arg)
// {
//   return 2.0*arg;
// }

#pragma omp declare simd notinbranch
// double fun2(double arg) __attribute__ ((noinline));
double fun2(double arg)
{
  // if statement is the cause of trouble
  if(arg < 0) return 0.;
  return 2.0*arg;
}

#pragma omp declare simd notinbranch
double test(double arg)
{
  double H = 0;
  H = 0;
  H += fun1(arg);
  H += fun2(arg); 
  return H;
}

Also here: https://godbolt.org/g/ZJmVzJ

Function test calls two other omp simd vectorized functions. fun1 is declared
here, and should be defined in a different compilation unit. fun2 is defined
here, should be inline'd, and contains an if statement on the argument.

As implemented above, gcc does not vectorize the function call to fun1. There
is a few things I have to do to arrive at a vectorized call:

1. move the definition of fun1 to the same compilation unit (uncomment in the
code above). fun1 is still noinline, so it will not be inline'd and a correct
vectorized call is issued

2. declare fun2 with attribute noinline. Then both fun1 and fun2 calls are
vectorized, but fun2 is not inlined.

The source of all trouble seems to be the if statement in fun2: if I remove it,
all works as expected.

What is the reason of this behaviour? Can I do something to avoid the problem?
I would like fun1 to be defined elsewhere, and fun2 to be inlined.

[Bug rtl-optimization/60086] New: suboptimal asm generated for a loop (store/load false aliasing)

2014-02-05 Thread marcin.krotkiewski at gmail dot com

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=60086

Bug ID: 60086
   Summary: suboptimal asm generated for a loop (store/load false
aliasing)
   Product: gcc
   Version: 4.7.3
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: rtl-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: marcin.krotkiewski at gmail dot com

Created attachment 32060
  --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=32060&action=edit
source code that compiles

Hello,

I am seeing suboptimal performance of the following loop compiled with
gcc 4.7.3 (but also 4.4.7, Ubuntu, full test code attached):

for(i=0; i

[Bug rtl-optimization/60086] suboptimal asm generated for a loop (store/load false aliasing)

2014-02-06 Thread marcin.krotkiewski at gmail dot com

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=60086

--- Comment #2 from Marcin Krotkiewski  ---
Jakub, thank you for your comments.

> GCC right now only handles __restrict on function parameters, so in this
> case the aliasing info isn't known.  While the loop is versioned for
> aliasing at runtime, the info about that is only known during the
> vectorizer, therefore e.g. scheduler can hardly know it. 

Does it mean that __restrict is not necessary in order to have a vectorized
code path? I see that if I compile your modified test.c, the loop is vectorized
regardless of whether I use __restrict, or not (runtime versioning). On the
other hand, using __restrict causes gcc to invoke memset for initialization,
while leaving it out results in two paths with a loop.

On the interesting side. Your test.c works indeed if compiled with additional
-fschedule-insns flag. However, if I now remove the __restrict keyword from
function arguments, I do see a vectorized path, but the flag has no effect and
instructions are again not reordered.

> The pointers to
> overaligned memory is something you should generally avoid,
> __builtin_assume_aligned is what can be used to tell the compiler about the
> alignment instead, overaligned types often actually hurt generated code
> instead of improving it.  

Thanks. Could you suggest what is the preferred way to use it in a portable
manner? e.g. make it suitable for icc, which has a __assume_aligned builtin?
Should I wrap it in a macro?

> And the way you are calling posix_memalign is IMHO
> a strict aliasing violation.

Could be,  gcc des not show a warning with -Wall. Thanks for pointing it out.

[Bug rtl-optimization/60086] suboptimal asm generated for a loop (store/load false aliasing)

2014-02-07 Thread marcin.krotkiewski at gmail dot com

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=60086

--- Comment #8 from Marcin Krotkiewski  ---
(In reply to Andrey Belevantsev from comment #5)
> At this point insn 461 is dead but we do not notice, and it doesn't look
> easy.  I think there was some suggestion in the original research for
> killing dead insn copies left after renaming but I don't remember offhand.

Following Alexanders suggestion, I compiled the test code with -mavx -O3
-fselective-scheduling2 -frename-registers. This seems to get rid of the dead
instructions and yields the desired scheduling:

.L5:
vmovapd(%rbx,%rdi), %ymm0
 addq$1, %rsi
vmovapd(%r12,%rdi), %ymm3
 vaddpd0(%r13,%rdi), %ymm0, %ymm2
vaddpd(%r14,%rdi), %ymm3, %ymm4
 vmovapd%ymm2, (%rbx,%rdi)
 vmovapd%ymm4, (%r12,%rdi)
 addq$32, %rdi
cmpq%rsi, %rdx
ja.L5

Alexander, I should maybe clarify that the 'good' code was prepared by hand,
modifying the 'bad' asm I got from gcc 4.7. Asm generated by gcc 4.4 was the
same. If that is what you were refering to.

I am a bit confused now. It seems that all fine and the desired asm can be
generated, so there is no real bug. But why is the original code compiled with
-O3 -mavx bad then? Is -fschedule-insns not enabled at -O2?

[Bug c/67167] New: cilkplus vectorization problems

2015-08-10 Thread marcin.krotkiewski at gmail dot com

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67167

Bug ID: 67167
   Summary: cilkplus vectorization problems
   Product: gcc
   Version: 5.1.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c
  Assignee: unassigned at gcc dot gnu.org
  Reporter: marcin.krotkiewski at gmail dot com
  Target Milestone: ---

I think there is a problem with vectorization of arithmetic operations in the
cilkplus implementation in gcc. I have inspected generated asm of the following
two implementations of vector addition (a = a + b). The code is compiled with
'gcc -O3 -mavx -ftree-vectorize -fopt-info-vec -fcilkplus test.c'.


// ICC compatibility - alignment hint
#ifdef __GNUC__

#define __assume_aligned(lvalueptr, align) lvalueptr = __builtin_assume_aligned
(lvalueptr, align)

#endif
#define RESTRICT __restrict__

typedef double Double;

void test(Double * RESTRICT a, Double * RESTRICT b, int size)
{
  int i;

  __assume_aligned(a, 64);
  __assume_aligned(b, 64);

  for(i=0; i

[Bug c/84261] New: gcc fails to call a simd-vectorized function

[Bug c/84261] gcc fails to vectorize a function call

[Bug c/84261] gcc fails to vectorize a function call

[Bug tree-optimization/84261] gcc fails to vectorize a function call

[Bug c/84903] New: internal compiler error: in convert_move, at expr.c:229

[Bug c/84903] internal compiler error: in convert_move, at expr.c:229

[Bug tree-optimization/85050] New: Vectorized function - suboptimal gather

[Bug c++/85232] New: gcc fails to vectorize a nested simd function call

[Bug rtl-optimization/60086] New: suboptimal asm generated for a loop (store/load false aliasing)

[Bug rtl-optimization/60086] suboptimal asm generated for a loop (store/load false aliasing)

[Bug rtl-optimization/60086] suboptimal asm generated for a loop (store/load false aliasing)

[Bug c/67167] New: cilkplus vectorization problems

12 matches

Site Navigation

Mail list logo

Footer information