vec_ld versus vec_vsx_ld on power8

2015-03-13 Thread Ewart Timothée
Hello all,

I have a issue/question using VMX/VSX on Power8 processor on a little endian 
system.
Using intrinsics function, if I perform an operation with vec_vsx_ld(…) - 
vet_vsx_st(), the compiler will add
a permutation, and then perform an operations (memory correctly aligned)

lxvd2x …
xxpermdi …
operations ….
xxpermdi
stxvd2x …

If I use vec_ld() - vec_st()

lvx
operations …
stvx

Reading the ISA, I do not see a real difference between this 2 instructions ( 
or I miss it)

So my 3 questions are:
 
Why do I have permutations ?
What is the cost of these permutations ?
What is the difference vet_vsx_ld and vec_ld  for the performance ?


Best

Tim



Timothée Ewart, Ph. D. 
http://www.linkedin.com/in/tewart
timothee.ew...@epfl.ch








Re: Undefined behavior due to 6.5.16.1p3

2015-03-13 Thread Vincent Lefevre
On 2015-03-12 13:55:50 -0600, Martin Sebor wrote:
> On 03/12/2015 03:10 AM, Vincent Lefevre wrote:
> >Well, this depends on the interpretation of effective types in the
> >case of a union. For instance, when writing
> >
> >   union { char a[16]; int b; } u;
> >   u.b = 1;
> >
> >you don't set the member only (an int), but the whole union object is
> >affected, even bytes that are not parts of the int. So, one may see
> >the effective type as being the union type.
> 
> The purpose of the term /effective type/ is to make it possible
> to talk about types of allocated objects (those with no declared
> type). In the example above, u.b is declared to have the type
> int and assigning to it doesn't change the type of other members
> of the union. But because u.a has a character type the value of
> u.b can be accessed via u.a (or any other lvalue of that type).

But if an object is declared with type T, the effective type of this
object is T, right? So, above, the effective type of u is the union
{ char a[16]; int b; } type.

-- 
Vincent Lefèvre  - Web: 
100% accessible validated (X)HTML - Blog: 
Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)


Why not implementation of interrupt attribute on IA32/x86-64

2015-03-13 Thread Didier Garcin

Hi,

many OS hobbyist developpers would be pleased GCC implements the 
interrupt or interrupt_handler attribute for Intel architecture.


Would it be so difficult to implement for this architecture ?

Could you plan it ?

Thanks a lot for answer.
Best regards
Didier


Re: Why not implementation of interrupt attribute on IA32/x86-64

2015-03-13 Thread Andi Kleen
Didier Garcin  writes:

> many OS hobbyist developpers would be pleased GCC implements the
> interrupt or interrupt_handler attribute for Intel architecture.
>
> Would it be so difficult to implement for this architecture ?

There are lots of different ways to implement interrupts on x86
(e.g. what state to save, what registers to set up). It would
be unlikely that gcc would select a subset that worked for most
people.

You're better off with a assembler wrapper that does exactly the
setup you want.

-Andi


PR65416, alloca on xtensa

2015-03-13 Thread Max Filippov
Hi Sterling,

I've got an issue building gdb for xtensa linux with gcc, reported it here:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65416

Looking at it I've got two questions, can you help me with them?

1. in windowed ABI stack pointer update is always split into two opcodes:
  add and movsp. How gcc optimization passes are supposed to know that
  'movsp' is related to 'add' and that stack allocation is complete only after
  movsp?

2. alloca seems to make an additional 16-bytes padding to each stack
  allocation: alloca(1) results in moving sp down by 32 bytes, alloca(17)
  moves it by 48 bytes, etc. This padding looks unnecessary to me: either
  this space is not used (previous register frame is not spilled), or alloca
  exception handler will take care about reloading or moving spilled registers
  to a new location. In both cases after movsp this space is just wasted.
  Do you know why this padding may be needed?

-- 
Thanks.
-- Max


vec_ld versus vec_vsx_ld on power8

2015-03-13 Thread Bill Schmidt
Hi Tim,

I'll discuss the loads here for simplicity; the situation for stores is
analogous.

There are a couple of differences between lvx and lxvd2x.  The most
important one is that lxvd2x supports unaligned loads, while lvx does
not.  You'll note that lvx will zero out the lower 4 bits of the
effective address in order to force an aligned load.

lxvd2x loads two doublewords into a vector register using big-endian
element order, regardless of whether the processor is running in
big-endian or little-endian mode.  That is, the first doubleword from
memory goes into the high-order bits of the vector register, and the
second doubleword goes into the low-order bits.  This is semantically
incorrect for little-endian, so the xxpermdi swaps the doublewords in
the register to correct for this.

At optimization -O1 and higher, gcc will remove many of the xxpermdi
instructions that are added to correct for LE semantics.  In many vector
computations, the lanes where the computations are performed do not
matter, so we don't have to perform the swaps.

For unaligned loads where we are unable to remove the swaps, this is
still better than the alternative using lvx.  An unaligned load requires
a four-instruction sequence to load the two aligned quadwords that
contain the desired data, set up a permutation control vector, and
combine the desired pieces of the two aligned quadwords into a vector
register.  This can be pipelined in a loop so that only one load occurs
per loop iteration, but that requires additional vector copies.  The
four-instruction sequence takes longer and increases vector register
pressure more than an lxvd2x/xxpermdi.

When the data is known to be aligned, lvx is equivalent to lxvd2x
performance if we are able to remove the permutes, and is preferable to
lxvd2x if not.

There are cases where we do not yet use lvx in lieu of lxvd2x when we
could do so and improve performance.  For example, saving and restoring
of vector parameters in a function prolog and epilog does not yet always
use lvx.  This is a performance opportunity we plan to improve in the
future.

A rule of thumb for your purposes is that if you can guarantee that you
are using aligned data, you should use vec_ld and vec_st, and otherwise
you should use vec_vsx_ld and vec_vsx_st.  Depending on your
application, it may be worthwhile to copy your data into an aligned
buffer before performing vector calculations on it.  GCC provides
attributes that will allow you to specify alignment on a 16-byte
boundary.

Note that the above discussion presumes POWER8, which is the only POWER
hardware that currently supports little-endian distributions and
applications.  Unaligned load/store performance on earlier processors
was less efficient, so the tradeoffs differ.

I hope this is helpful!

Bill Schmidt, Ph.D.
IBM Linux Technology Center

You wrote:

> I have a issue/question using VMX/VSX on Power8 processor on a little endian 
> system.
> Using intrinsics function, if I perform an operation with vec_vsx_ld(â) - 
> vet_vsx_st(), the compiler will add
> a permutation, and then perform an operations (memory correctly aligned)

> lxvd2x â
> xxpermdi â
> operations â.
> xxpermdi
> stxvd2x â

> If I use vec_ld() - vec_st()

> lvx
> operations â
> stvx

> Reading the ISA, I do not see a real difference between this 2 instructions ( 
> or I miss it)

> So my 3 questions are:
 
> Why do I have permutations ?
> What is the cost of these permutations ?
> What is the difference vet_vsx_ld and vec_ld  for the performance ?





Re: vec_ld versus vec_vsx_ld on power8

2015-03-13 Thread Ewart Timothée
thank you very much for this answer.
I know my memory is aligned so I will use vec_ld/st only.

best

Tim







Re: PR65416, alloca on xtensa

2015-03-13 Thread augustine.sterl...@gmail.com
On Fri, Mar 13, 2015 at 7:54 AM, Max Filippov  wrote:
> 1. in windowed ABI stack pointer update is always split into two opcodes:
>   add and movsp. How gcc optimization passes are supposed to know that
>   'movsp' is related to 'add' and that stack allocation is complete only after
>   movsp?

The movsp has a data dependency on the add, so that is not the
problem. The real problem (which you identify) is that the optimizer
isn't recognizing that the stack allocation isn't complete until the
movsp is done.

In the bug, the optimizer recognizes that a3 = sp +  32, and so it
uses a3 before it updates sp to get a better schedule, but doesn't
realize that the stack space hasn't been allocated yet.

It looks to me like the optimizer has finally gotten smart enough that
we need to implement one of the save_stack_* patterns instead of what
we have now.


> 2. alloca seems to make an additional 16-bytes padding to each stack
>   allocation: alloca(1) results in moving sp down by 32 bytes, alloca(17)
>   moves it by 48 bytes, etc. This padding looks unnecessary to me: either
>   this space is not used (previous register frame is not spilled), or alloca
>   exception handler will take care about reloading or moving spilled registers
>   to a new location. In both cases after movsp this space is just wasted.
>   Do you know why this padding may be needed?

Answering this question definitively requires some time with the ABI
manual, which I don't have. You may be right, but I would check what
XCC does in this case. It is far better tested.


Re: vec_ld versus vec_vsx_ld on power8

2015-03-13 Thread Bill Schmidt
Hi Tim,

Actually, I left out another very good reason why you may want to use
vec_vsx_ld/st.  Sorry for forgetting this.

As you saw, vec_ld translates into the lvx instruction.  This
instruction loads a sequence of 16 bytes into a vector register.  For
big endian, the first byte in memory is loaded into the high order byte
of the register.  For little endian, the first byte in memory is loaded
into the low order byte of the register.

This is fine if the data you are loading is arrays of characters, but is
not so fine if you are loading arrays of larger items.  Suppose you are
loading four integers {1, 2, 3, 4} into a register with lvx.  In big
endian you will see:

  00 00 00 01  00 00 00 02  00 00 00 03  00 00 00 04

In little endian you will see:

  04 00 00 00  03 00 00 00  02 00 00 00  01 00 00 00

But for this to be interpreted as a vector of integers ordered for
little endian, what you really want is:

  00 00 00 04  00 00 00 03  00 00 00 02  00 00 00 01

If you use vec_vsx_ld, the compiler will generate a lxvw2x instruction
followed by an xxpermdi that swaps the doublewords.  After the lxvw2x
you will have:

  00 00 00 02  00 00 00 01  00 00 00 04  00 00 00 03

because the two LE doublewords are loaded in BE (reversed) order.
Swapping the two doublewords restores sanity:

  00 00 00 04  00 00 00 03  00 00 00 02  00 00 00 01

So, even if your data is properly aligned, the use of vec_ld = lvx is
only correct if you are loading arrays of bytes.  Arrays of anything
larger must use vec_vsx_ld to avoid errors.

Again, sorry for my previous omission!

Thanks,

Bill Schmidt, Ph.D.
IBM Linux Technology Center

On Fri, 2015-03-13 at 15:42 +, Ewart Timothée wrote:
> thank you very much for this answer.
> I know my memory is aligned so I will use vec_ld/st only.
> 
> best
> 
> Tim
> 
> 
> 
> 
> 




RE: PR65416, alloca on xtensa

2015-03-13 Thread Marc Gauthier
augustine.sterl...@gmail.com wrote:
> On Fri, Mar 13, 2015 at 7:54 AM, Max Filippov  wrote:
[...]
> > 2. alloca seems to make an additional 16-bytes padding to each stack
> >   allocation: alloca(1) results in moving sp down by 32 bytes,
> >   alloca(17)
> >   moves it by 48 bytes, etc. This padding looks unnecessary to me:
> >   either
> >   this space is not used (previous register frame is not spilled), or
> >   alloca
> >   exception handler will take care about reloading or moving spilled
> >   registers
> >   to a new location. In both cases after movsp this space is just
> >   wasted.
> >   Do you know why this padding may be needed?
> 
> Answering this question definitively requires some time with the ABI
> manual, which I don't have. You may be right, but I would check what
> XCC does in this case. It is far better tested.

Other than the required 16-byte stack alignment, there's nothing in
the ABI that requires these extra 16 bytes.  Perhaps there was a bad
implementation of the alloca exception handler at some point a long
time ago that prompted the extra 16 bytes?

Today XCC doesn't add the extra 16 bytes.  alloca(n) with n in a2
comes out as this:

   0x6490 <+12>:movi.n  a8, -16
   0x6492 <+14>:addi.n  a3, a2, 15
   0x6494 <+16>:and a3, a3, a8
   0x6497 <+19>:sub a3, a1, a3
   0x649a <+22>:movsp   a1, a3

which just rounds up to 16 bytes.

-Marc


Re: PR65416, alloca on xtensa

2015-03-13 Thread augustine.sterl...@gmail.com
On Fri, Mar 13, 2015 at 10:04 AM, Marc Gauthier  wrote:
> Other than the required 16-byte stack alignment, there's nothing in
> the ABI that requires these extra 16 bytes.  Perhaps there was a bad
> implementation of the alloca exception handler at some point a long
> time ago that prompted the extra 16 bytes?

The alloca handler has been rewritten at least twice since this code
was last updated, so that wouldn't surprise me at all. I would approve
a change to eliminate it.


Re: vec_ld versus vec_vsx_ld on power8

2015-03-13 Thread Ewart Timothée
Hello,

I am super confuse now

scenario 1, what I have in m code:
machine boots in LE.

1) memory: LE
2) I load (ld_vec)
3) register : LE
4) VSU compute in LE
5) I store (st_vec)
6) memory: LE

scenario 2: ( I did not test but it is what I get if I order gcc to compiler in 
BE)
machine boot in BE

1) memory: BE
2) I load (ld_vsx_vec)
3) register : BE
4) VSU compute in BE 
5) I store (st_vsx_vec)
6) memory: BE

At this point the VUS compute in both order

chimera scenario 3, what I understand:

machine boot in LE

1) memory: LE
2) I load (ld_vsx_vec)  (the load swap the element)
3) register : BE
4) swap : LE
5) VSU compute in LE
6) swap : BE 
5) I store (st_vsx_vec) (the store swap the element)
6) memory: BE

I understand  ld/st_vsx_vec load/store from LE/BE, but as the VXU can compute
in both mode what should I swap (I precise I am working with 32/64 bits float)

Best,

Tim

Timothée Ewart, Ph. D. 
http://www.linkedin.com/in/tewart
timothee.ew...@epfl.ch






> Le 13 Mar 2015 à 17:50, Bill Schmidt  a écrit :
> 
> Hi Tim,
> 
> Actually, I left out another very good reason why you may want to use
> vec_vsx_ld/st.  Sorry for forgetting this.
> 
> As you saw, vec_ld translates into the lvx instruction.  This
> instruction loads a sequence of 16 bytes into a vector register.  For
> big endian, the first byte in memory is loaded into the high order byte
> of the register.  For little endian, the first byte in memory is loaded
> into the low order byte of the register.
> 
> This is fine if the data you are loading is arrays of characters, but is
> not so fine if you are loading arrays of larger items.  Suppose you are
> loading four integers {1, 2, 3, 4} into a register with lvx.  In big
> endian you will see:
> 
>  00 00 00 01  00 00 00 02  00 00 00 03  00 00 00 04
> 
> In little endian you will see:
> 
>  04 00 00 00  03 00 00 00  02 00 00 00  01 00 00 00
> 
> But for this to be interpreted as a vector of integers ordered for
> little endian, what you really want is:
> 
>  00 00 00 04  00 00 00 03  00 00 00 02  00 00 00 01
> 
> If you use vec_vsx_ld, the compiler will generate a lxvw2x instruction
> followed by an xxpermdi that swaps the doublewords.  After the lxvw2x
> you will have:
> 
>  00 00 00 02  00 00 00 01  00 00 00 04  00 00 00 03
> 
> because the two LE doublewords are loaded in BE (reversed) order.
> Swapping the two doublewords restores sanity:
> 
>  00 00 00 04  00 00 00 03  00 00 00 02  00 00 00 01
> 
> So, even if your data is properly aligned, the use of vec_ld = lvx is
> only correct if you are loading arrays of bytes.  Arrays of anything
> larger must use vec_vsx_ld to avoid errors.
> 
> Again, sorry for my previous omission!
> 
> Thanks,
> 
> Bill Schmidt, Ph.D.
> IBM Linux Technology Center
> 
> On Fri, 2015-03-13 at 15:42 +, Ewart Timothée wrote:
>> thank you very much for this answer.
>> I know my memory is aligned so I will use vec_ld/st only.
>> 
>> best
>> 
>> Tim
>> 
>> 
>> 
>> 
>> 
> 
> 



Re: vec_ld versus vec_vsx_ld on power8

2015-03-13 Thread Bill Schmidt
Hi Tim,

Sorry to have confused you.  This stuff is a bit boggling the first 200
times you look at it...

For both 32-bit and 64-bit floating-point, you should use ld_vsx_vec on
both BE and LE machines, and the compiler will take care of doing the
right thing for you in both cases.  You do not have to add any swaps
yourself.

When compiling for big-endian, ld_vsx_vec will translate into either
lxvw4x (for 32-bit floating-point) or lxvd2x (for 64-bit
floating-point).  The values will be loaded into the register from
left-to-right (BE ordering).

When compiling for little-endian, ld_vsx_vec will translate into lxvd2x
followed by xxpermdi for both 32-bit and 64-bit floating-point.  This
does the right thing in both cases.  The values will be loaded into the
register from right-to-left (LE ordering).

The vector programming model is set up to allow you to usually code the
same way for both BE and LE.  This is discussed more in Chapter 6 of the
ELFv2 ABI manual, which can be obtained from the OpenPOWER Connect
website (free registration required):

https://www-03.ibm.com/technologyconnect/tgcm/TGCMServlet.wss?alias=OpenPOWER&linkid=1n

Bill


On Fri, 2015-03-13 at 17:11 +, Ewart Timothée wrote:
> Hello,
> 
> I am super confuse now
> 
> scenario 1, what I have in m code:
> machine boots in LE.
> 
> 1) memory: LE
> 2) I load (ld_vec)
> 3) register : LE
> 4) VSU compute in LE
> 5) I store (st_vec)
> 6) memory: LE
> 
> scenario 2: ( I did not test but it is what I get if I order gcc to compiler 
> in BE)
> machine boot in BE
> 
> 1) memory: BE
> 2) I load (ld_vsx_vec)
> 3) register : BE
> 4) VSU compute in BE 
> 5) I store (st_vsx_vec)
> 6) memory: BE
> 
> At this point the VUS compute in both order
> 
> chimera scenario 3, what I understand:
> 
> machine boot in LE
> 
> 1) memory: LE
> 2) I load (ld_vsx_vec)  (the load swap the element)
> 3) register : BE
> 4) swap : LE
> 5) VSU compute in LE
> 6) swap : BE 
> 5) I store (st_vsx_vec) (the store swap the element)
> 6) memory: BE
> 
> I understand  ld/st_vsx_vec load/store from LE/BE, but as the VXU can compute
> in both mode what should I swap (I precise I am working with 32/64 bits float)
> 
> Best,
> 
> Tim
> 
> Timothée Ewart, Ph. D. 
> http://www.linkedin.com/in/tewart
> timothee.ew...@epfl.ch
> 
> 
> 
> 
> 
> 
> > Le 13 Mar 2015 à 17:50, Bill Schmidt  a écrit :
> > 
> > Hi Tim,
> > 
> > Actually, I left out another very good reason why you may want to use
> > vec_vsx_ld/st.  Sorry for forgetting this.
> > 
> > As you saw, vec_ld translates into the lvx instruction.  This
> > instruction loads a sequence of 16 bytes into a vector register.  For
> > big endian, the first byte in memory is loaded into the high order byte
> > of the register.  For little endian, the first byte in memory is loaded
> > into the low order byte of the register.
> > 
> > This is fine if the data you are loading is arrays of characters, but is
> > not so fine if you are loading arrays of larger items.  Suppose you are
> > loading four integers {1, 2, 3, 4} into a register with lvx.  In big
> > endian you will see:
> > 
> >  00 00 00 01  00 00 00 02  00 00 00 03  00 00 00 04
> > 
> > In little endian you will see:
> > 
> >  04 00 00 00  03 00 00 00  02 00 00 00  01 00 00 00
> > 
> > But for this to be interpreted as a vector of integers ordered for
> > little endian, what you really want is:
> > 
> >  00 00 00 04  00 00 00 03  00 00 00 02  00 00 00 01
> > 
> > If you use vec_vsx_ld, the compiler will generate a lxvw2x instruction
> > followed by an xxpermdi that swaps the doublewords.  After the lxvw2x
> > you will have:
> > 
> >  00 00 00 02  00 00 00 01  00 00 00 04  00 00 00 03
> > 
> > because the two LE doublewords are loaded in BE (reversed) order.
> > Swapping the two doublewords restores sanity:
> > 
> >  00 00 00 04  00 00 00 03  00 00 00 02  00 00 00 01
> > 
> > So, even if your data is properly aligned, the use of vec_ld = lvx is
> > only correct if you are loading arrays of bytes.  Arrays of anything
> > larger must use vec_vsx_ld to avoid errors.
> > 
> > Again, sorry for my previous omission!
> > 
> > Thanks,
> > 
> > Bill Schmidt, Ph.D.
> > IBM Linux Technology Center
> > 
> > On Fri, 2015-03-13 at 15:42 +, Ewart Timothée wrote:
> >> thank you very much for this answer.
> >> I know my memory is aligned so I will use vec_ld/st only.
> >> 
> >> best
> >> 
> >> Tim
> >> 
> >> 
> >> 
> >> 
> >> 
> > 
> > 
> 




[PATCH] jit docs: Add "Packaging notes" section

2015-03-13 Thread David Malcolm
On Wed, 2015-03-04 at 11:09 -0500, David Malcolm wrote:
> On Tue, 2015-03-03 at 11:49 +0100, Matthias Klose wrote:
> > Both gccjit and gnat now use sphinx to build the documentation.  While not a
> > direct part of the build process, it would be nice to document the 
> > requirements
> > on sphinx, and agree on a common version used to generate that 
> > documentation.
> > 
> > Coming from a distro background where I have to "build from source", I know 
> > that
> > sphinx is a bit less stable than say doxygen and texinfo.  So some kind of
> > version information, about not using sphinx plugins, etc. would be 
> > appreciated.

[...]

> On the subject of packaging: when building libgccjit,
> --enable-host-shared is needed, to get position-independent code, which
> will slow down the regular compiler by a few percent.  Hence when
> packaging gcc with libgccjit, please configure and build twice: once
> without --enable-host-shared for most languages, and once with
> --enable-host-shared for the jit (this is what Jakub's done for Fedora
> packages of gcc 5).  AIUI, one should "make install" both
> configurations, presumably installing the configuration with
> --enable-host-shared, *then* the one without, so that the faster build
> of "cc1" et al overwrites the slower build.
> 
> (assuming all of the above is correct, I'll write it up for the jit
> docs).

I've committed the following to trunk as r221425.

gcc/jit/ChangeLog:
* docs/internals/index.rst (Packaging notes): New section.
* docs/_build/texinfo/libgccjit.texi: Regenerate.
---
 gcc/jit/docs/internals/index.rst | 48 
 1 file changed, 48 insertions(+)

diff --git a/gcc/jit/docs/internals/index.rst b/gcc/jit/docs/internals/index.rst
index cf024f3..d0852f9 100644
--- a/gcc/jit/docs/internals/index.rst
+++ b/gcc/jit/docs/internals/index.rst
@@ -236,6 +236,54 @@ variables:
   ./jit-hello-world
   hello world
 
+Packaging notes
+---
+The configure-time option :option:`--enable-host-shared` is needed when
+building the jit in order to get position-independent code.  This will
+slow down the regular compiler by a few percent.  Hence when packaging gcc
+with libgccjit, please configure and build twice:
+
+  * once without :option:`--enable-host-shared` for most languages, and
+
+  * once with :option:`--enable-host-shared` for the jit
+
+For example:
+
+.. code-block:: bash
+
+  # Configure and build with --enable-host-shared
+  # for the jit:
+  mkdir configuration-for-jit
+  pushd configuration-for-jit
+  $(SRCDIR)/configure \
+--enable-host-shared \
+--enable-languages=jit \
+--prefix=$(DESTDIR)
+  make
+  popd
+
+  # Configure and build *without* --enable-host-shared
+  # for maximum speed:
+  mkdir standard-configuration
+  pushd standard-configuration
+  $(SRCDIR)/configure \
+--enable-languages=all \
+--prefix=$(DESTDIR)
+  make
+  popd
+
+  # Both of the above are configured to install to $(DESTDIR)
+  # Install the configuration with --enable-host-shared first
+  # *then* the one without, so that the faster build
+  # of "cc1" et al overwrites the slower build.
+  pushd configuration-for-jit
+  make install
+  popd
+
+  pushd standard-configuration
+  make install
+  popd
+
 Overview of code structure
 --
 
-- 
1.8.5.3



Re: PR65416, alloca on xtensa

2015-03-13 Thread Max Filippov
On Fri, Mar 13, 2015 at 8:08 PM, augustine.sterl...@gmail.com
 wrote:
> On Fri, Mar 13, 2015 at 10:04 AM, Marc Gauthier  wrote:
>> Other than the required 16-byte stack alignment, there's nothing in
>> the ABI that requires these extra 16 bytes.  Perhaps there was a bad
>> implementation of the alloca exception handler at some point a long
>> time ago that prompted the extra 16 bytes?
>
> The alloca handler has been rewritten at least twice since this code
> was last updated, so that wouldn't surprise me at all. I would approve
> a change to eliminate it.

Ok, thanks to both of you. I'll try to come up with fixes.

-- 
Thanks.
-- Max


RE: Proposal for adding splay_tree_find (to find elements without updating the nodes).

2015-03-13 Thread Aditya K
---
> Date: Tue, 10 Mar 2015 11:20:07 +0100
> Subject: Re: Proposal for adding splay_tree_find (to find elements without 
> updating the nodes).
> From: richard.guent...@gmail.com
> To: stevenb@gmail.com
> CC: hiradi...@msn.com; gcc@gcc.gnu.org
>
> On Mon, Mar 9, 2015 at 11:59 PM, Steven Bosscher  
> wrote:
>> On Mon, Mar 9, 2015 at 7:59 PM, vax mzn wrote:
>>> w.r.t, https://gcc.gnu.org/wiki/Speedup_areas where we want to improve the 
>>> performance of splay trees.
>>>
>>> The function `splay_tree_node splay_tree_lookup (splay_tree, 
>>> splay_tree_key);'
>>> updates the nodes every time a lookup is done.
>>>
>>> IIUC, There are places where we call this function in a loop i.e., we 
>>> lookup different elements every time.
>>> e.g.,
>>> In this exaple we are looking for a different `t' in each iteration.
>>
>>
>> If that's really true, then a splay tree is a horrible choice of data
>> structure. The splay tree will simply degenerate to a linked list. The
>> right thing to do would be, not to "break" one of the key features of
>> splay trees (i.e. the latest lookup is always on top), but to use
>> another data structure.
>
> I agree with Steven here and wanted to say the same. If you don't
> benefit from splay trees LRU scheme then use a different data structure.
>
> Richard.
>
>> Ciao!
>> Steven


Thanks Richard and Steven for the feedback. I tried to replace the use of splay 
tree with std::map, and got it to bootstrap.
The compile time on a few runs shows improvement but I wont trust those data 
because it is too good to be true.
Maybe, I dont have enough data points and this machine runs other things too.

  1                                                                             
                                 Baseline: 1426175470 (SHA:7386c9d) 
Patch:1426258562 (SHA:592c06f)
  2 Program                                                                     
                                        |  CC_Time  CC_Real_Time                
        CC_Time  CC_Real_Time
  3 MultiSource/Applications/ALAC/decode/alacconvert-decode                     
     |   4.2605   4.9840                                     2.6455   3.0826
  4 MultiSource/Applications/ALAC/encode/alacconvert-encode                     
     |   4.3124   4.9600                                      2.7138   3.0725
  5 MultiSource/Applications/Burg/burg                                          
                      |   5.6053   6.4204                                       
3.5663   4.0837
  6 MultiSource/Applications/JM/ldecod/ldecod                                   
                |  33.3773  35.9444                                     20.7180 
 22.4260
  7 MultiSource/Applications/JM/lencod/lencod                                   
                |  74.2836  78.6588                                     47.3016 
 50.0484
  8 MultiSource/Applications/SIBsim4/SIBsim4                                    
                  |   5.5832   5.8932                                        
3.0524   3.2456
  9 MultiSource/Applications/SPASS/SPASS                                        
                   |  67.4992  72.1056                                    
43.2258  45.7996
 10 MultiSource/Applications/aha/aha                                            
                       |   0.5019   0.5894                                      
  0.3860   0.4406
 11 MultiSource/Applications/d/make_dparser                                     
                 |  13.4930  14.5575                                      
8.5084   9.2331
 12 MultiSource/Applications/hbd/hbd                                            
                      |   4.7727   5.9225                                       
 2.9896   3.8366
 13 MultiSource/Applications/hexxagon/hexxagon                                  
              |   3.1735   3.6957                                        1.8297 
  2.2171
 14 MultiSource/Applications/kimwitu++/kc                                       
                   |  29.6117  31.8364                                     
18.0744  19.5862
 15 MultiSource/Applications/lambda-0.1.3/lambda                                
              |   4.5274   4.9125                                          
2.7136   2.9241


I have attached my patch. Please give feedback/suggestions for improvement.

Thanks
-Aditya



  

splay.patch
Description: Binary data


Re: Proposal for adding splay_tree_find (to find elements without updating the nodes).

2015-03-13 Thread Jonathan Wakely
Are you sure your compare_variables functor is correct?

Subtracting the two values seems very strange for a strict weak ordering.

(Also "compare_variables" is a pretty poor name!)


RE: Proposal for adding splay_tree_find (to find elements without updating the nodes).

2015-03-13 Thread Aditya K
You're right. I'll change this to:

/* A stable comparison functor to sort trees.  */
struct tree_compare_decl_uid {
  bool  operator ()(const tree &xa, const tree &xb) const
  {
    return DECL_UID (xa) < DECL_UID (xb);
  }
};

New patch attached.


Thanks,
-Aditya



> Date: Fri, 13 Mar 2015 19:02:11 +
> Subject: Re: Proposal for adding splay_tree_find (to find elements without 
> updating the nodes).
> From: jwakely@gmail.com
> To: hiradi...@msn.com
> CC: richard.guent...@gmail.com; stevenb@gmail.com; gcc@gcc.gnu.org
>
> Are you sure your compare_variables functor is correct?
>
> Subtracting the two values seems very strange for a strict weak ordering.
>
> (Also "compare_variables" is a pretty poor name!)
  

splay.patch
Description: Binary data


Re: PR65416, alloca on xtensa

2015-03-13 Thread Segher Boessenkool
On Fri, Mar 13, 2015 at 05:54:48PM +0300, Max Filippov wrote:
> 2. alloca seems to make an additional 16-bytes padding to each stack
>   allocation: alloca(1) results in moving sp down by 32 bytes, alloca(17)
>   moves it by 48 bytes, etc.

This sounds like PR 50938, 47353, 34548, maybe more?  Happens on most
targets.


Segher


Re: PR65416, alloca on xtensa

2015-03-13 Thread Max Filippov
On Fri, Mar 13, 2015 at 11:18 PM, Segher Boessenkool
 wrote:
> On Fri, Mar 13, 2015 at 05:54:48PM +0300, Max Filippov wrote:
>> 2. alloca seems to make an additional 16-bytes padding to each stack
>>   allocation: alloca(1) results in moving sp down by 32 bytes, alloca(17)
>>   moves it by 48 bytes, etc.
>
> This sounds like PR 50938, 47353, 34548, maybe more?  Happens on most
> targets.

Exactly! (And I was wrong about 1 byte, it needs at least 2 for 32
bytes result).
But...those PRs are marked as duplicates of a fixed PR 34548.
Looks like it's not fixed or there's a regression?

-- 
Thanks.
-- Max


Re: Re: Why not implementation of interrupt attribute on IA32/x86-64

2015-03-13 Thread David Fernandez

Hi,

This is slightly off-topic, but there seems to be lots of different 
interrupt attributes in gcc, one for each different processor, which, in 
many instances, seem almost the same with different names. also, gcc 
could decide on the attribute behaviour depending on the target it 
compiles for.


I wonder if there are plans to revise and clean them, or, as there might 
be code already using them that should not be broken, define some 
unified ones that make more sense.


Just mentioning, as I've programmed for several hardware platforms and 
is the kind of thing that looks really ugly in gcc.


Greetings
David F.

On 13/03/15 14:08, Andi Kleen wrote:

Didier Garcin  writes:


many OS hobbyist developpers would be pleased GCC implements the
interrupt or interrupt_handler attribute for Intel architecture.

Would it be so difficult to implement for this architecture ?

There are lots of different ways to implement interrupts on x86
(e.g. what state to save, what registers to set up). It would
be unlikely that gcc would select a subset that worked for most
people.

You're better off with a assembler wrapper that does exactly the
setup you want.

-Andi




Re: PR65416, alloca on xtensa

2015-03-13 Thread Segher Boessenkool
On Fri, Mar 13, 2015 at 11:36:47PM +0300, Max Filippov wrote:
> >> 2. alloca seems to make an additional 16-bytes padding to each stack
> >>   allocation: alloca(1) results in moving sp down by 32 bytes, alloca(17)
> >>   moves it by 48 bytes, etc.
> >
> > This sounds like PR 50938, 47353, 34548, maybe more?  Happens on most
> > targets.
> 
> Exactly! (And I was wrong about 1 byte, it needs at least 2 for 32
> bytes result).

Yeah, same bug :-)

> But...those PRs are marked as duplicates of a fixed PR 34548.
> Looks like it's not fixed or there's a regression?

It's not fixed.  Removing the STACK_POINTER_OFFSET check from
allocate_dynamic_stack_check fixes it, but is probably not good
for all targets ;-)

50938 is still open.


Segher


Re: PR65416, alloca on xtensa

2015-03-13 Thread Segher Boessenkool
On Fri, Mar 13, 2015 at 03:56:38PM -0500, Segher Boessenkool wrote:
> On Fri, Mar 13, 2015 at 11:36:47PM +0300, Max Filippov wrote:
> > >> 2. alloca seems to make an additional 16-bytes padding to each stack
> > >>   allocation: alloca(1) results in moving sp down by 32 bytes, alloca(17)
> > >>   moves it by 48 bytes, etc.
> > >
> > > This sounds like PR 50938, 47353, 34548, maybe more?  Happens on most
> > > targets.
> > 
> > Exactly! (And I was wrong about 1 byte, it needs at least 2 for 32
> > bytes result).
> 
> Yeah, same bug :-)
> 
> > But...those PRs are marked as duplicates of a fixed PR 34548.
> > Looks like it's not fixed or there's a regression?
> 
> It's not fixed.  Removing the STACK_POINTER_OFFSET check from
> allocate_dynamic_stack_check fixes it, but is probably not good

allocate_dynamic_stack_space I mean.

> for all targets ;-)
> 
> 50938 is still open.
> 
> 
> Segher