Re: SLP-based reduction vectorization

2019-01-24 Thread Richard Biener
On Mon, Jan 21, 2019 at 2:20 PM Anton Youdkevitch
 wrote:
>
> Here is the prototype for doing vectorized reduction
> using SLP approach. I would appreciate feedback if this
> is a feasible approach and if overall the direction is
> right.
>
> The idea is to vectorize reduction like this
>
> S = A[0]+A[1]+...A[N];
>
> into
>
> Sv = Av[0]+Av[1]+...+Av[N/VL];
>
>
> So that, for instance, the following code:
>
> typedef double T;
> T sum;
>
> void foo (T*  __restrict__ a)
> {
> sum = a[0]+ a[1] + a[2]+ a[3] + a[4]+ a[5] + a[6]+ a[7];
> }
>
>
> instead of:
>
> foo:
> .LFB23:
> .cfi_startproc
> movsd   (%rdi), %xmm0
> movsd   16(%rdi), %xmm1
> addsd   8(%rdi), %xmm0
> addsd   24(%rdi), %xmm1
> addsd   %xmm1, %xmm0
> movsd   32(%rdi), %xmm1
> addsd   40(%rdi), %xmm1
> addsd   %xmm1, %xmm0
> movsd   48(%rdi), %xmm1
> addsd   56(%rdi), %xmm1
> addsd   %xmm1, %xmm0
> movsd   %xmm0, sum2(%rip)
> ret
> .cfi_endproc
>
>
> be compiled into:
>
> foo:
> .LFB11:
> .cfi_startproc
> movupd  32(%rdi), %xmm0
> movupd  48(%rdi), %xmm3
> movupd  (%rdi), %xmm1
> movupd  16(%rdi), %xmm2
> addpd   %xmm3, %xmm0
> addpd   %xmm2, %xmm1
> addpd   %xmm1, %xmm0
> haddpd  %xmm0, %xmm0
> movlpd  %xmm0, sum(%rip)
> ret
> .cfi_endproc
>
>
> As this is a very crude prototype there are some things
> to consider.
>
> 1. As the current SLP framework assumes presence of
> group stores I cannot use directly it as reduction
> does not require group stores (or even stores at all),
> so, I'm partially using the existing functionality but
> sometimes I have to create a stripped down version
> of it for my own needs;
>
> 2. The current version considers only PLUS reduction
> as it is encountered most often and therefore is the
> most practical;
>
> 3. While normally SLP transformation should operate
> inside single basic block this requirement greatly
> restricts it's practical application as in a code
> complex enough there will be vectorizable subexpressions
> defined in basic block(s) different from that where the
> reduction result resides. However, for the sake of
> simplicity only single uses in the same block are
> considered now;
>
> 4. For the same sake the current version does not deal
> with partial reductions which would require partial sum
> merging and careful removal of the scalars that participate
> in the vector part. The latter gets done automatically
> by DCE in the case of full reduction vectorization;
>
> 5. There is no cost model yet for the reasons mentioned
> in the paragraphs 3 and 4.

First sorry for the late response.

No, I don't think your approach of bypassing the "rest"
is OK.  I've once started to implement BB reduction
support but somehow got distracted IIRC.  Your testcase
(and the prototype I worked on) still has a (scalar non-grouped)
store which can be keyed on in SLP discovery phase.

You should be able to "re-use" (by a lot of refactoring I guess)
the reduction finding code (vect_is_slp_reduction) to see
wheter such a store is fed by a reduction chain.  Note that
if you adjust the testcase to have

 sum[0] = a[0] + ... + a[n];
 sum[1] = b[0] + ... + b[n];

then you'll have a grouped store fed by reductions.  You
can also consider

 for (i = ...)
  {
 sum[i] = a[i*4] + a[i*4+1] + a[i*4+2] + a[i*4+3];
  }

which we should be able to handle.

So the whole problem of doing BB reductions boils down
to SLP tree discovery, the rest should be more straight-forward
(of course code-gen needs to be adapted for the non-loop case
as well).

It's not the easiest problem you try to tackle btw ;)  May
I suggest you become familiar with the code by BB vectorizing
vector CONSTRUCTORs instead?

typedef int v4si __attribute__((vector_size(16)));

v4si foo (int *i, *j)
{
  return (v4si) { i[0] + j[0], i[1] + j[1], i[2] + j[2], i[3] + j[3] };
}

it has the same SLP discovery "issue", this time somewhat
easier as a CONSTRUCTOR directly plays the role of the
"grouped store".

Richard.

> Thanks in advance.
>
> --
>   Anton


Re: __builtin_dynamic_object_size

2019-01-24 Thread Richard Biener
On Wed, Jan 23, 2019 at 12:33 PM Jakub Jelinek  wrote:
>
> On Wed, Jan 23, 2019 at 10:40:43AM +, Jonathan Wakely wrote:
> > There's a patch to add __builtin_dynamic_object_size to clang:
> > https://reviews.llvm.org/D56760
> >
> > It was suggested that this could be done via a new flag bit for
> > __builtin_object_size, but only if GCC would support that too
> > (otherwise it would be done as a separate builtin).
> >
> > Is there any interest in adding that as an option to __builtin_object_size?
> >
> > I know Jakub is concerned about arbitrarily complex expressions, when
> > __builtin_object_size is supposed to always be efficient and always
> > evaluate at compile time (which would imply the dynamic behaviour
> > should be a separate builtin, if it exists at all).
>
> The current modes (0-3) certainly must not be changed and must return a
> constant, that is what huge amounts of code in the wild relies on.

I wouldn't overload _bos but only use a new builtin.

> The reason to choose constants only was the desire to make _FORTIFY_SOURCE
> cheap at runtime.  For the dynamically computed expressions, the question
> is how far it should go, how complex expressions it wants to build and how
> much code and runtime can be spent on computing that.
>
> The rationale for __builtin_dynamic_object_size lists only very simple
> cases, where the builtin is just called on result of malloc, so that is
> indeed easy, the argument is already evaluated before the malloc call, so
> you can just save it in a temporary and use later.  Slightly more complex
> is calloc, where you need to multiply two numbers (do you handle overflow
> somehow, or not?).  But in real world, it can be arbitrarily more complex,
> there can be pointer arithmetics with constant or variable offsets,
> there can be conditional adjustments of pointers or PHIs with multiple
> "dynamic" expressions for the sizes (shall the dynamic expression evaluate
> as max over expressions for different phi arguments (that is essentially
> what is done for the constant __builtin_object_size, but for dynamic
> expressions max needs to be computed at runtime, or shall it reconstruct
> the conditional or remember it and compute whatever ? val1 : val2),
> loops which adjust pointers, etc. and all that can be done many times in
> between where the objects are allocated and where the builtin is used.

Which means I'd like to see a thorough specification of the builtin.
If it is allowed to return "failure" in any event then of what use is
the builtin in practice?

Richard.

> Jakub


How do i contribute to the GCC.

2019-01-24 Thread akshay bhapkar
I am very  exicted to be the participant of the gsoc for this year.
   I want to be a part of GSoC'19. Also, how do I contribute to the
organisations or connect with them before GSoC begins?
Can you people point out to me some projects which could be of interest for
this year's GSoC ?
Also a mentor whom I could contact.


Good news.Chinabrands dropshipping app has been published on Shopify app store

2019-01-24 Thread chinabrands.com
http://email.chinabrands.com/x/?S7Y1NP_fa2tkZv4.x9bIwNj0f5GtkbmFiYEBAAA21

(https://apps.shopify.com/chinabrands-3)
Please don't reply to this email ,as the mailbox is not monitored .If you 
'd rather future emails.you can unsubscribe 
(http://email.chinabrands.com/x/plugin/?pName=unsubscribe&MIDRID=S7Y1NP_fa2tkZv4.x9bIwNj0f5GtkbmFiYEBAAA21&Z=-417069433)
 here.Click here to subscribe 
(https://chinabrands.webpower.asia/x/plugin/?pName=subscribe&MIDRID=S7Y1NAIAA24&pLang=cn&Z=116230595)
 our e-mails.


Re: SLP-based reduction vectorization

2019-01-24 Thread Anton Youdkevitch

Richard,

Thanks a lot for the response! I will definitely
try the constructor approach.

You mentioned non-grouped store. Is the handling
of such stores actually there and I just missed
it? It was a big hassle for me as I didn't manage
to find it and assumed there isn't one.

I have a question (a lot of them, though, but this
one is bothering me most). It is actually paragraph
4 of my previous letter. In real world code there can
be a case that the loading of the elements and the use
of them (for a reduction) are in different BBs (I saw
this myself). So, not only it complicates the things
in general but for me it breaks some SLP code assuming
single BB operation (IIRC, some dataref analysis phase
assumes single BB). Did anybody consider this before?

OK, I know I start looking kind of stubborn here but
what about the case:

foo(A[0]+...A[n])

There won't be any store here by definition while a
reduction will. Or is it something too rarely seen?

--
  Thanks,
  Anton


On 24/1/2019 13:47, Richard Biener wrote:

On Mon, Jan 21, 2019 at 2:20 PM Anton Youdkevitch
 wrote:


Here is the prototype for doing vectorized reduction
using SLP approach. I would appreciate feedback if this
is a feasible approach and if overall the direction is
right.

The idea is to vectorize reduction like this

S = A[0]+A[1]+...A[N];

into

Sv = Av[0]+Av[1]+...+Av[N/VL];


So that, for instance, the following code:

typedef double T;
T sum;

void foo (T*  __restrict__ a)
{
 sum = a[0]+ a[1] + a[2]+ a[3] + a[4]+ a[5] + a[6]+ a[7];
}


instead of:

foo:
.LFB23:
 .cfi_startproc
 movsd   (%rdi), %xmm0
 movsd   16(%rdi), %xmm1
 addsd   8(%rdi), %xmm0
 addsd   24(%rdi), %xmm1
 addsd   %xmm1, %xmm0
 movsd   32(%rdi), %xmm1
 addsd   40(%rdi), %xmm1
 addsd   %xmm1, %xmm0
 movsd   48(%rdi), %xmm1
 addsd   56(%rdi), %xmm1
 addsd   %xmm1, %xmm0
 movsd   %xmm0, sum2(%rip)
 ret
 .cfi_endproc


be compiled into:

foo:
.LFB11:
 .cfi_startproc
 movupd  32(%rdi), %xmm0
 movupd  48(%rdi), %xmm3
 movupd  (%rdi), %xmm1
 movupd  16(%rdi), %xmm2
 addpd   %xmm3, %xmm0
 addpd   %xmm2, %xmm1
 addpd   %xmm1, %xmm0
 haddpd  %xmm0, %xmm0
 movlpd  %xmm0, sum(%rip)
 ret
 .cfi_endproc


As this is a very crude prototype there are some things
to consider.

1. As the current SLP framework assumes presence of
group stores I cannot use directly it as reduction
does not require group stores (or even stores at all),
so, I'm partially using the existing functionality but
sometimes I have to create a stripped down version
of it for my own needs;

2. The current version considers only PLUS reduction
as it is encountered most often and therefore is the
most practical;

3. While normally SLP transformation should operate
inside single basic block this requirement greatly
restricts it's practical application as in a code
complex enough there will be vectorizable subexpressions
defined in basic block(s) different from that where the
reduction result resides. However, for the sake of
simplicity only single uses in the same block are
considered now;

4. For the same sake the current version does not deal
with partial reductions which would require partial sum
merging and careful removal of the scalars that participate
in the vector part. The latter gets done automatically
by DCE in the case of full reduction vectorization;

5. There is no cost model yet for the reasons mentioned
in the paragraphs 3 and 4.


First sorry for the late response.

No, I don't think your approach of bypassing the "rest"
is OK.  I've once started to implement BB reduction
support but somehow got distracted IIRC.  Your testcase
(and the prototype I worked on) still has a (scalar non-grouped)
store which can be keyed on in SLP discovery phase.

You should be able to "re-use" (by a lot of refactoring I guess)
the reduction finding code (vect_is_slp_reduction) to see
wheter such a store is fed by a reduction chain.  Note that
if you adjust the testcase to have

  sum[0] = a[0] + ... + a[n];
  sum[1] = b[0] + ... + b[n];

then you'll have a grouped store fed by reductions.  You
can also consider

  for (i = ...)
   {
  sum[i] = a[i*4] + a[i*4+1] + a[i*4+2] + a[i*4+3];
   }

which we should be able to handle.

So the whole problem of doing BB reductions boils down
to SLP tree discovery, the rest should be more straight-forward
(of course code-gen needs to be adapted for the non-loop case
as well).

It's not the easiest problem you try to tackle btw ;)  May
I suggest you become familiar with the code by BB vectorizing
vector CONSTRUCTORs instead?

typedef int v4si __attribute__((vector_size(16)));

v4si foo (int *i, *j)
{
   return (v4si) { i[0] + j[0], i[1] + j[1], i[2] + j[2], i[3] + j[3] };
}

it has the same SLP discovery "issue", this time somewhat
easier as a CONSTRUCTOR directly

Re: SLP-based reduction vectorization

2019-01-24 Thread Richard Biener
On Thu, Jan 24, 2019 at 1:04 PM Anton Youdkevitch
 wrote:
>
> Richard,
>
> Thanks a lot for the response! I will definitely
> try the constructor approach.
>
> You mentioned non-grouped store. Is the handling
> of such stores actually there and I just missed
> it? It was a big hassle for me as I didn't manage
> to find it and assumed there isn't one.

No, it isn't there.  On a branch I'm working on I'm
just doing sth like

+  /* Find SLP sequences starting from single stores.  */
+  data_reference_p dr;
+  FOR_EACH_VEC_ELT (vinfo->shared->datarefs, i, dr)
+   if (DR_IS_WRITE (dr))
+ {
+   stmt_vec_info stmt_info = vinfo->lookup_dr (dr)->stmt;
+   if (STMT_SLP_TYPE (stmt_info))
+ continue;
+   if (STMT_VINFO_GROUPED_ACCESS (stmt_info))
+ continue;
+   vect_analyze_slp_instance (vinfo, stmt_info, max_tree_size);
+ }

note that this alone won't work since you have to actually
build a first set of scalar stmts (like vect_analyze_slp_instance
does for groups just by gathering its elements) from the
reduction.  But here you could (and IIRC I did back in time
when prototyping reduction BB vect support) hack in
some ad-hoc pattern-matching of a series of PLUS.

> I have a question (a lot of them, though, but this
> one is bothering me most). It is actually paragraph
> 4 of my previous letter. In real world code there can
> be a case that the loading of the elements and the use
> of them (for a reduction) are in different BBs (I saw
> this myself). So, not only it complicates the things
> in general but for me it breaks some SLP code assuming
> single BB operation (IIRC, some dataref analysis phase
> assumes single BB). Did anybody consider this before?

Sure I considered this but usually restricting one to a
single BB works quite well and simplifies dependence
analysis a lot.

> OK, I know I start looking kind of stubborn here but
> what about the case:
>
> foo(A[0]+...A[n])
>
> There won't be any store here by definition while a
> reduction will. Or is it something too rarely seen?

You are right - in principle a reduction can be "rooted"
at any point.  But you need to come up with an
algorithm with sensible cost (in time and memory)
to detect the reduction group.  The greedy matching
I talked about above can be applied anywhere, not
just at stores.

> --
>Thanks,
>Anton
>
>
> On 24/1/2019 13:47, Richard Biener wrote:
> > On Mon, Jan 21, 2019 at 2:20 PM Anton Youdkevitch
> >  wrote:
> >>
> >> Here is the prototype for doing vectorized reduction
> >> using SLP approach. I would appreciate feedback if this
> >> is a feasible approach and if overall the direction is
> >> right.
> >>
> >> The idea is to vectorize reduction like this
> >>
> >> S = A[0]+A[1]+...A[N];
> >>
> >> into
> >>
> >> Sv = Av[0]+Av[1]+...+Av[N/VL];
> >>
> >>
> >> So that, for instance, the following code:
> >>
> >> typedef double T;
> >> T sum;
> >>
> >> void foo (T*  __restrict__ a)
> >> {
> >>  sum = a[0]+ a[1] + a[2]+ a[3] + a[4]+ a[5] + a[6]+ a[7];
> >> }
> >>
> >>
> >> instead of:
> >>
> >> foo:
> >> .LFB23:
> >>  .cfi_startproc
> >>  movsd   (%rdi), %xmm0
> >>  movsd   16(%rdi), %xmm1
> >>  addsd   8(%rdi), %xmm0
> >>  addsd   24(%rdi), %xmm1
> >>  addsd   %xmm1, %xmm0
> >>  movsd   32(%rdi), %xmm1
> >>  addsd   40(%rdi), %xmm1
> >>  addsd   %xmm1, %xmm0
> >>  movsd   48(%rdi), %xmm1
> >>  addsd   56(%rdi), %xmm1
> >>  addsd   %xmm1, %xmm0
> >>  movsd   %xmm0, sum2(%rip)
> >>  ret
> >>  .cfi_endproc
> >>
> >>
> >> be compiled into:
> >>
> >> foo:
> >> .LFB11:
> >>  .cfi_startproc
> >>  movupd  32(%rdi), %xmm0
> >>  movupd  48(%rdi), %xmm3
> >>  movupd  (%rdi), %xmm1
> >>  movupd  16(%rdi), %xmm2
> >>  addpd   %xmm3, %xmm0
> >>  addpd   %xmm2, %xmm1
> >>  addpd   %xmm1, %xmm0
> >>  haddpd  %xmm0, %xmm0
> >>  movlpd  %xmm0, sum(%rip)
> >>  ret
> >>  .cfi_endproc
> >>
> >>
> >> As this is a very crude prototype there are some things
> >> to consider.
> >>
> >> 1. As the current SLP framework assumes presence of
> >> group stores I cannot use directly it as reduction
> >> does not require group stores (or even stores at all),
> >> so, I'm partially using the existing functionality but
> >> sometimes I have to create a stripped down version
> >> of it for my own needs;
> >>
> >> 2. The current version considers only PLUS reduction
> >> as it is encountered most often and therefore is the
> >> most practical;
> >>
> >> 3. While normally SLP transformation should operate
> >> inside single basic block this requirement greatly
> >> restricts it's practical application as in a code
> >> complex enough there will be vectorizable subexpressions
> >> defined in basic block(s) different from that where the
> >> reduction result resides. H

Re: __builtin_dynamic_object_size

2019-01-24 Thread Richard Smith
[Please CC me; I'm not subscribed]

On Thu, 24 Jan 2019 11:59 at Richard Biener wrote:
> On Wed, Jan 23, 2019 at 12:33 PM Jakub Jelinek  wrote:
> > On Wed, Jan 23, 2019 at 10:40:43AM +, Jonathan Wakely wrote:
> > > There's a patch to add __builtin_dynamic_object_size to clang:
> > > https://reviews.llvm.org/D56760
> > >
> > > It was suggested that this could be done via a new flag bit for
> > > __builtin_object_size, but only if GCC would support that too
> > > (otherwise it would be done as a separate builtin).
> > >
> > > Is there any interest in adding that as an option to 
> > > __builtin_object_size?
> > >
> > > I know Jakub is concerned about arbitrarily complex expressions, when
> > > __builtin_object_size is supposed to always be efficient and always
> > > evaluate at compile time (which would imply the dynamic behaviour
> > > should be a separate builtin, if it exists at all).
> >
> > The current modes (0-3) certainly must not be changed and must return a
> > constant, that is what huge amounts of code in the wild relies on.
>
> I wouldn't overload _bos but only use a new builtin.

Clang provides another useful attribute in this space:
__attribute__((pass_object_size(N))), when applied to a pointer
parameter to a function, causes the compiler to evaluate
__builtin_object_size in the caller, and pass the result to the callee
(see https://clang.llvm.org/docs/AttributeReference.html#pass-object-size).
This allows many of the _FORTIFY_SOURCE checks to be implemented
without forcibly inlining. One reason we'd like to see this added as a
flag bit rather than as a separate builtin is that it would then also
naturally extend to the pass_object_size attribute. But we certainly
don't want to conflict with any potential future GCC extension.

If the GCC devs aren't interested in adding similar functionality,
could we reserve a bit for this behavior? Eg, if GCC would promise to
never use the values 4-7 for any other purpose, I expect that'd work
for us, but I'd certainly understand if you don't want to guarantee
that.

> > The reason to choose constants only was the desire to make _FORTIFY_SOURCE
> > cheap at runtime.  For the dynamically computed expressions, the question
> > is how far it should go, how complex expressions it wants to build and how
> > much code and runtime can be spent on computing that.
> >
> > The rationale for __builtin_dynamic_object_size lists only very simple
> > cases, where the builtin is just called on result of malloc, so that is
> > indeed easy, the argument is already evaluated before the malloc call, so
> > you can just save it in a temporary and use later.  Slightly more complex
> > is calloc, where you need to multiply two numbers (do you handle overflow
> > somehow, or not?).  But in real world, it can be arbitrarily more complex,
> > there can be pointer arithmetics with constant or variable offsets,
> > there can be conditional adjustments of pointers or PHIs with multiple
> > "dynamic" expressions for the sizes (shall the dynamic expression evaluate
> > as max over expressions for different phi arguments (that is essentially
> > what is done for the constant __builtin_object_size, but for dynamic
> > expressions max needs to be computed at runtime, or shall it reconstruct
> > the conditional or remember it and compute whatever ? val1 : val2),
> > loops which adjust pointers, etc. and all that can be done many times in
> > between where the objects are allocated and where the builtin is used.
>
> Which means I'd like to see a thorough specification of the builtin.

How far you go is a quality of implementation issue, just like with
all the existing __builtin_object_size modes -- you're always
permitted to return a conservatively-correct answer, for any mode.
(Ignoring the new flag bit would be a correct implementation, for
example.) But the intent is for this to be used in cases where users
want the security benefits more than they want a modicum of
performance.

> If it is allowed to return "failure" in any event then of what use is
> the builtin in practice?

The same question can be asked of the existing builtin, which has the
same property that it is allowed to return "failure" in any event.
That's already useful in practice, and it's hopefully also apparent
that catching just the easy non-constant cases (eg, a parameter passed
to malloc also gets used as the object size of the returned pointer)
would also be useful in practice.


gcc-7-20190124 is now available

2019-01-24 Thread gccadmin
Snapshot gcc-7-20190124 is now available on
  ftp://gcc.gnu.org/pub/gcc/snapshots/7-20190124/
and on various mirrors, see http://gcc.gnu.org/mirrors.html for details.

This snapshot has been generated from the GCC 7 SVN branch
with the following options: svn://gcc.gnu.org/svn/gcc/branches/gcc-7-branch 
revision 268252

You'll find:

 gcc-7-20190124.tar.xzComplete GCC

  SHA256=9fea5a31d116aaafbf99abcba352640aed69d7712ecc844bb1f8797863eb327f
  SHA1=b2311a408b109c4d75ef07b9aad964f5012ad6d5

Diffs from 7-20190117 are available in the diffs/ subdirectory.

When a particular snapshot is ready for public consumption the LATEST-7
link is updated and a message is sent to the gcc list.  Please do not use
a snapshot before it has been announced that way.


Join us at America's Largest Bar Expo!

2019-01-24 Thread F&B Magazine
Night Club & Bar Show
  Special subscriber offer to the largest bar expo – 2019 Nightclub & Bar 
Show in Las Vegas! 
Exclusive Subscriber Offer  to  NCBSHOW19    |    

Register NowSubscriber OfferSave $30 off Join us at America's Largest Bar Expo!

Nightclub & Bar Show is the largest expo & conference for bar owners, managers 
and bartenders in the country! 2019 is packed with hundreds of new events 
including Bar Management and Nightlife Management Conference Sessions, 6 
Workshops, Products & Technology on the Expo Floor, city-wide Training Courses 
and co-located with the F&B Innovation Conference.


   
Take your business to the next level and attend the only industry event in the 
nation. Plus, have fun doing it! Nightclub & Bar Show is known for throwing 
unforgettable parties at Las Vegas' hottest nightlife concepts.

Exclusive Subscriber Offer!
Use code FB101 to receive $30 off any pass. Expires 3/17/2019. 

Register NowWhat To ExpectNever attended? Get a taste for the event in the 2019 
Nightclub & Bar Show preview video. You'll be blown away by what you experience 
in just 3 days.





Watch Now Join UsDon't miss the biggest event in the industry!


Nightclub & Bar Show 2019

March 25-27, 2019 | Las Vegas Convention Center


Save when you register in advance!


Register Now

 

 

 

Follow along at #NCBSHOW19

The 2019 event will take place March 25-27 at the Las Vegas Convention Center. 
Conference begins on Monday March 25, and the Expo Floor opens Tuesday March 
26. 

Advance registration is available through March 17, 2019. Nightclub & Bar Show 
is the industry's biggest & most important event of the year for professionals 
in the bar, nightclub & restaurant industry. 

For more information, use the links below.





View Event Site   |   

View Schedule   |   

View Packages & Pricing 
© 2019 Questex LLC - Nightclub & Bar is a property of Questex LLC. 3 Speen 
Street, Suite 300 Framingham, MA 01701.  All Rights Reserved. 



MICHAEL POLITZ, 1930 Village Center Cir 3197, Las Vegas, Nevada, 89134, United 
States

UNSUBSCRIBE

Enabling LTO for target libraries (e.g., libgo, libstdc++)

2019-01-24 Thread Nikhil Benesch
I am attempting to convince GCC to build target libraries with link-time
optimizations enabled. I am primarily interested in libgo, but this discussion
seems like it would be applicable to libstdc++, libgfortran, etc. The
benchmarking I've done suggests that LTOing libgo yields a 5-20% speedup on
various Go programs, which is quite substantial.

The trouble is convincing GCC's build system to apply the various LTO flags to
the correct places. Ian Taylor suggested the following to plumb -flto into
libgo compilation:

$ make GOCFLAGS_FOR_TARGET="-g -O3 -flto"

This nearly works, and I believe there are analogous options that would apply to
the other target libraries that GCC builds.

The trouble is that while building libgo, the build system uses ar and ranlib
directly from binutils, without providing them with the LTO plugin that was
built earlier. This means that the LTO information is dropped on the floor, and
attempting to link with the built libgo archive will fail.

I have a simple patch to the top-level configure.ac that resolves the issue by
teaching the build system to use the gcc-ar and gcc-ranlib wrappers which were
built earlier and know how to pass the linker plugin to the underlying ar/ranlib
commands. The patch is small enough that I've included it at the end of this
email.

My question is whether this is a reasonable thing to do. It seems like using
the gcc-ar and gcc-ranlib wrappers strictly improves the situation, and won't
impact compilations that don't specify -flto. But I'm not familiar enough with
the build system to say for sure.

Does anyone have advice to offer? Has anyone tried convincing the build system
to compile some of the other target libraries (like libstdc++ or libgfortran)
with -flto?

diff --git a/configure.ac b/configure.ac
index 87f2aee05008..1c38ac5979ff 100644
--- a/configure.ac
+++ b/configure.ac
@@ -3400,7 +3400,8 @@ ACX_CHECK_INSTALLED_TARGET_TOOL(WINDMC_FOR_TARGET, windmc)
 
 RAW_CXX_FOR_TARGET="$CXX_FOR_TARGET"
 
-GCC_TARGET_TOOL(ar, AR_FOR_TARGET, AR, [binutils/ar])
+GCC_TARGET_TOOL(ar, AR_FOR_TARGET, AR,
+   [gcc/gcc-ar -B$$r/$(HOST_SUBDIR)/gcc/])
 GCC_TARGET_TOOL(as, AS_FOR_TARGET, AS, [gas/as-new])
 GCC_TARGET_TOOL(cc, CC_FOR_TARGET, CC, [gcc/xgcc -B$$r/$(HOST_SUBDIR)/gcc/])
 dnl see comments for CXX_FOR_TARGET_FLAG_TO_PASS
@@ -3424,7 +3425,8 @@ GCC_TARGET_TOOL(nm, NM_FOR_TARGET, NM, [binutils/nm-new])
 GCC_TARGET_TOOL(objcopy, OBJCOPY_FOR_TARGET, OBJCOPY, [binutils/objcopy])
 GCC_TARGET_TOOL(objdump, OBJDUMP_FOR_TARGET, OBJDUMP, [binutils/objdump])
 GCC_TARGET_TOOL(otool, OTOOL_FOR_TARGET, OTOOL)
-GCC_TARGET_TOOL(ranlib, RANLIB_FOR_TARGET, RANLIB, [binutils/ranlib])
+GCC_TARGET_TOOL(ranlib, RANLIB_FOR_TARGET, RANLIB,
+   [gcc/gcc-ranlib -B$$r/$(HOST_SUBDIR)/gcc/])
 GCC_TARGET_TOOL(readelf, READELF_FOR_TARGET, READELF, [binutils/readelf])
 GCC_TARGET_TOOL(strip, STRIP_FOR_TARGET, STRIP, [binutils/strip-new])
 GCC_TARGET_TOOL(windres, WINDRES_FOR_TARGET, WINDRES, [binutils/windres])