[Bug c++/62080] New: Suboptimal code generation with eigen library

2014-08-10 Thread beschindler at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62080

Bug ID: 62080
   Summary: Suboptimal code generation with eigen library
   Product: gcc
   Version: 4.8.3
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: beschindler at gmail dot com

Created attachment 33281
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=33281&action=edit
Source code used to get the provided assembly

I'm currently optimizing some code using the eigen library and I'm stumbling
over an interesting problem. 
I have a function, which I wrote in two different ways (the attributes are
there to provide some optimization barriers, dimEigen is a member variable of
the containing class): 


void eigenClamp(Eigen::Vector4i& vec) __attribute__((noinline, noclone))
{
vec = vec.array().min(dimEigen.array()).max(Eigen::Array4i::Zero());
}

void eigenClamp2(Eigen::Vector4i& vec) __attribute__((noinline, noclone))
{
vec = vec.array().min(dimEigen.array());
vec = vec.array().max(Eigen::Array4i::Zero());
}

I'm compiling this on a core i7 920 using -O2 -fno-exceptions -fno-rtti
-std=c++11 -march=native

The first function generates this assembly, which looks great: 

movdqu(%rsi), %xmm1
movdqu(%rdi), %xmm0
pminsd%xmm1, %xmm0
pxor%xmm1, %xmm1
pmaxsd%xmm1, %xmm0
movdqa%xmm0, (%rsi)

The second version does this: 

movdqa(%rsi), %xmm0
pminsd(%rdi), %xmm0
movdqa%xmm0, (%rsi) <-- 
pxor%xmm0, %xmm0
movdqu(%rsi), %xmm1 <-- 
pmaxsd%xmm1, %xmm0
movdqa%xmm0, (%rsi)

It seems, because there are two lines in the original source code, the result
of the first expression is written to memory and then two instructions later,
read back from memory. This makes this function almost 50% slower in what I can
measure. As I find the latter code much easier to read as the former, it would
be great if the same assembly would be generated. 

Also, I note that in the second version, the pminsd is executed directly from
the memory source, while in the first version, it is read to a register and
then pminsd is called. Thus, I'd love to see this code: 

movdqu(%rsi), %xmm1
pminsd(%rdi), %xmm1
pxor%xmm1, %xmm1
pmaxsd%xmm1, %xmm0
movdqa%xmm0, (%rsi)

As a reference, I'm attaching the complete source code and the generated
assembly


[Bug c++/62080] Suboptimal code generation with eigen library

2014-08-10 Thread beschindler at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62080

--- Comment #1 from Benjamin Schindler  ---
Created attachment 33282
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=33282&action=edit
Generated assembly in full


[Bug c++/62080] Suboptimal code generation with eigen library

2014-08-10 Thread beschindler at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62080

--- Comment #3 from Benjamin Schindler  ---
I just looked at what gcc-4.9.1 does and it does vary:

movdqu(%rsi), %xmm1
movdqu(%rdi), %xmm0 <-- 
pminsd%xmm1, %xmm0 <-- 
pxor%xmm1, %xmm1
pmaxsd%xmm1, %xmm0
movaps%xmm0, (%rsi)

So, the first version still has a needless movdqu (for which I don't know how
much it hurts). Second version

movdqa(%rsi), %xmm0
pminsd(%rdi), %xmm0 <-- good
pxor%xmm1, %xmm1
movdqu%xmm0, %xmm0 <-- bad?
pmaxsd%xmm1, %xmm0
movaps%xmm0, (%rsi)

So, gcc-4.9 fares better such that it does not go to memory, but it emits an
odd mov instruction. May be this is a separate issue?


[Bug c++/62080] Suboptimal code generation with eigen library

2014-08-11 Thread beschindler at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62080

--- Comment #4 from Benjamin Schindler  ---
(In reply to Marc Glisse from comment #2)
> (note that a minimal, self-contained testcase would be much better and
> shouldn't be hard to produce)


I don't mind doing so, but I don't quite know what is required to trigger this
isssue. 

After chatting with a friend, I realized yet another issue with the generated
assembly: it makes a lot of use of unaligned reads (movdqu) as opposed to
movdqa. Eigen types are by design aligned and thus, it should be possible to
use the (from what I've been told) faster aligned reads

Cheers