Re: Just what are rtx costs?

2011-08-22 Thread Richard Sandiford
Georg-Johann Lay  writes:
>>>IMO a clean approach would be to query the costs of a whole insn (resp. 
>>>it's pattern) rather than the cost of an RTX.  COSTS_N_INSNS already 
>>>indicates that the costs are compared to *insn* costs i.e. cost of the 
>>>whole pattern (modulo clobbers).
>> 
>> The problem is that we sometimes want the cost of something that cannot
>> be done using a single instruction.  E.g. some CONST_INTs take several
>> instructions to create on MIPS.  In this case the costs are really
>> measuring the cost of an emit_move_insn sequence, not a single insn.
>> 
>> I suppose we could use emit_move_insn to create a temporary sequence
>> and sum the cost of each individual instruction.  But that's potentially
>> expensive.
>
> No, that complexity is not needed.  For (set (reg) (const_int)) the BE 
> can just return the cost of the expanded sequence because it knows how 
> it will be expanded and how much it will cost.  There's no need to 
> really expand the sequence.

Sorry, I'd misunderstood your suggestion.  I thought you were suggesting
that the rtx costs functions should only be presented with SETs that are
valid instructions.  I hadn't realised that you were still allowing these
SETs to be arbitrary ones that have been cooked up by the optimisers.

So are you saying that we should remove the recursive nature of the
rtx_cost/targetm.rtx_costs interface, and have the backend handle any
recursion itself?  I.e. targetm.rtx_costs only ever sees a complete
(but perhaps invalid) instruction pattern?  Or would you still keep
the current recursion?

Richard


Re: Just what are rtx costs?

2011-08-22 Thread Richard Guenther
On Mon, Aug 22, 2011 at 10:19 AM, Richard Sandiford
 wrote:
> Georg-Johann Lay  writes:
IMO a clean approach would be to query the costs of a whole insn (resp.
it's pattern) rather than the cost of an RTX.  COSTS_N_INSNS already
indicates that the costs are compared to *insn* costs i.e. cost of the
whole pattern (modulo clobbers).
>>>
>>> The problem is that we sometimes want the cost of something that cannot
>>> be done using a single instruction.  E.g. some CONST_INTs take several
>>> instructions to create on MIPS.  In this case the costs are really
>>> measuring the cost of an emit_move_insn sequence, not a single insn.
>>>
>>> I suppose we could use emit_move_insn to create a temporary sequence
>>> and sum the cost of each individual instruction.  But that's potentially
>>> expensive.
>>
>> No, that complexity is not needed.  For (set (reg) (const_int)) the BE
>> can just return the cost of the expanded sequence because it knows how
>> it will be expanded and how much it will cost.  There's no need to
>> really expand the sequence.
>
> Sorry, I'd misunderstood your suggestion.  I thought you were suggesting
> that the rtx costs functions should only be presented with SETs that are
> valid instructions.  I hadn't realised that you were still allowing these
> SETs to be arbitrary ones that have been cooked up by the optimisers.
>
> So are you saying that we should remove the recursive nature of the
> rtx_cost/targetm.rtx_costs interface, and have the backend handle any
> recursion itself?  I.e. targetm.rtx_costs only ever sees a complete
> (but perhaps invalid) instruction pattern?  Or would you still keep
> the current recursion?

I would say yes to that - kill the recursion.

Richard.

> Richard
>


Re: Just what are rtx costs?

2011-08-22 Thread Georg-Johann Lay
Richard Sandiford wrote:
> Georg-Johann Lay  writes:
 IMO a clean approach would be to query the costs of a whole insn (resp. 
 it's pattern) rather than the cost of an RTX.  COSTS_N_INSNS already 
 indicates that the costs are compared to *insn* costs i.e. cost of the 
 whole pattern (modulo clobbers).
>>> The problem is that we sometimes want the cost of something that cannot
>>> be done using a single instruction.  E.g. some CONST_INTs take several
>>> instructions to create on MIPS.  In this case the costs are really
>>> measuring the cost of an emit_move_insn sequence, not a single insn.
>>>
>>> I suppose we could use emit_move_insn to create a temporary sequence
>>> and sum the cost of each individual instruction.  But that's potentially
>>> expensive.
>> No, that complexity is not needed.  For (set (reg) (const_int)) the BE 
>> can just return the cost of the expanded sequence because it knows how 
>> it will be expanded and how much it will cost.  There's no need to 
>> really expand the sequence.
> 
> Sorry, I'd misunderstood your suggestion.  I thought you were suggesting
> that the rtx costs functions should only be presented with SETs that are
> valid instructions.  I hadn't realised that you were still allowing these
> SETs to be arbitrary ones that have been cooked up by the optimisers.

RTX costs only make sense if the rtx eventually results in insns.
This can basically happen in two ways:

* expander which transforms insn-like expression to a sequence of
  insns.  Example is x << y in some backend that cannot do it natively
  and expand it into loop.  Similar example is X + big_const which
  cannot be handled by target, i.e. insn predicate denies it.

* cooking up new insns like in insn combine.  It only makes sense to
  query costs for insns that actually match, i.e. pass recog or
  recog_for_combine or whatever.

> So are you saying that we should remove the recursive nature of the
> rtx_cost/targetm.rtx_costs interface, and have the backend handle any
> recursion itself?  I.e. targetm.rtx_costs only ever sees a complete
> (but perhaps invalid) instruction pattern?  Or would you still keep
> the current recursion?

I don't see benefit of recursion because every step removes information.

E.g. in the example you gave in
   http://gcc.gnu.org/ml/gcc-patches/2011-08/msg01264.html
which is cost of shifts like
   x << ?
the operand number does not help because you need the second operand
to determine the cost: shift by constant offset in general has different
cost than shift by variable.  Thus, only the complete RTX makes sense for
cost computation.

In general it's not possible to separate the cost function
because
   cost (f(a,b)) != cost (f) + cost (a,0) + cost (b,1)
resp. you cannot represent costs in that orthogonal way and such an ansatz
must fail.

There are also cases where costs are paradoxical, i.e. a more complex
expression has lower costs than a simpler one. Example is bit extraction
which might be cheaper than shifting the mask plus oring/anding.

BTW, avr BE does recursion inside rtx_costs which is bad idea, imo.
But that's up to the target.

Johann

> Richard



Re: i370 port

2011-08-22 Thread Ulrich Weigand
Paul Edwards wrote:

>   if (operands[1] == const0_rtx)
>   {
> CC_STATUS_INIT;
> mvs_check_page (0, 6, 8);
> return \"MVC%O0(8,%R0),=XL8'00'\";
>   }
>   mvs_check_page (0, 6, 8);
>   return \"MVC%O0(8,%R0),%1\";
> }"
>[(set_attr "length" "8")]
> )
> 
> forces it to use XL8'00' instead of the default F'0' and that
> seems to work.  Does that seem like a proper solution to
> you?

Well, there isn't really anything special about const0_rtx.
*Any* CONST_INT that shows up as second operand to the movdi
pattern must be emitted into an 8 byte literal at this point.

You can do that inline; but the more usual way would be to
define an operand print format that encodes the fact that
a 64-bit operand is requested.

In fact, looking at the i370.h PRINT_OPERAND, there already
seems to be such a format: 'W'.  (Maybe not quite; since 'W'
sign-extends a 32-bit operand to 64-bit.  But since 'W'
doesn't seem to be used anyway, maybe this can be changed.)

Bye,
Ulrich

-- 
  Dr. Ulrich Weigand
  GNU Toolchain for Linux on System z and Cell BE
  ulrich.weig...@de.ibm.com


Re: [named address] ice-on-valid: in postreload.c:reload_cse_simplify_operands

2011-08-22 Thread Ulrich Weigand
Georg-Johann Lay wrote:
> Ulrich Weigand schrieb:
> > Georg-Johann Lay wrote:
> > 
> >>http://gcc.gnu.org/ml/gcc/2011-08/msg00131.html
> >>
> >>Are you going to install that patch? Or maybe you already installed it?
> > 
> > No, it isn't approved yet (in fact, it isn't even posted for approval).
> > Usually, patches that add new target macros, or new arguments to target
> > macros, but do not actually add any *exploiter* of the new features,
> > are frowned upon ...
> 
> I thought about implementing a "hidden" named AS first and not exposing 
> it to user land, e.g. to be able to do optimizations like
> http://gcc.gnu.org/PR49857
> http://gcc.gnu.org/PR43745
> which need named AS to express that some pointers/accesses are different.
> 
> The most prominent drawback of named AS at the moment is that AVR has 
> few address registers and register allocation often regerates unpleasant 
> code or even runs into spill fails.
> 
> The AS in question can only be accessed by means of post-increment 
> addressing via one single hard register.

Well, it doesn't really matter whether you want to expose the AS externally
or just use it internally.  Either way, I'll be happy to propose my patch
for inclusion once you have a patch ready that depends on it ...

Bye,
Ulrich

-- 
  Dr. Ulrich Weigand
  GNU Toolchain for Linux on System z and Cell BE
  ulrich.weig...@de.ibm.com


Re: Just what are rtx costs?

2011-08-22 Thread Joern Rennecke

Quoting Richard Guenther :


So are you saying that we should remove the recursive nature of the
rtx_cost/targetm.rtx_costs interface, and have the backend handle any
recursion itself?  I.e. targetm.rtx_costs only ever sees a complete
(but perhaps invalid) instruction pattern?  Or would you still keep
the current recursion?


I would say yes to that - kill the recursion.


But the recursion is already optional.  If you don't want to use recursion
for your port, just make the rtx_costs hook return true.
There is no need to break ports that are OK to use the recursion in rtlanal.c
partially or in whole.





Re: Just what are rtx costs?

2011-08-22 Thread David Edelsohn
On Mon, Aug 22, 2011 at 9:08 AM, Joern Rennecke  wrote:
> Quoting Richard Guenther :
>
>>> So are you saying that we should remove the recursive nature of the
>>> rtx_cost/targetm.rtx_costs interface, and have the backend handle any
>>> recursion itself?  I.e. targetm.rtx_costs only ever sees a complete
>>> (but perhaps invalid) instruction pattern?  Or would you still keep
>>> the current recursion?
>>
>> I would say yes to that - kill the recursion.
>
> But the recursion is already optional.  If you don't want to use recursion
> for your port, just make the rtx_costs hook return true.
> There is no need to break ports that are OK to use the recursion in
> rtlanal.c
> partially or in whole.

Exactly.  I don't understand the disagreement about recursion.  For
instance, the rs6000 port explicitly returns true or false for
rtx_costs as necessary for its computation.  If a port wants to
compute rtx_costs without recursion, it already has that control.

Thanks, David


Re: Trunk LTO Bootstrap of Sun Aug 21 18:01:01 UTC 2011 (revision 177942) FAILED

2011-08-22 Thread Toon Moene

On 08/21/2011 08:19 PM, Toon Moene wrote:


See:

http://gcc.gnu.org/ml/gcc-testresults/2011-08/msg02361.html

The configure line is:

../gcc/configure \
--prefix=/tmp/lto \
--enable-languages=c++ \
--with-build-config=bootstrap-lto \
--with-gnu-ld \
--disable-multilib \
--disable-nls \
--with-arch=native \
--with-tune=native

on x86_64-unknown-linux-gnu


After studying this a bit more, I almost convinced this is due to the 
upgrade of Debian Testing I did at 12:15 UTC, Sunday the 21st of August.


Apparently, the install of libc6-2.13-16 does some evil things to the 
/usr/include/bits directory ...


I'll turn off the daily builds until this problem is solved.

Cheers,

--
Toon Moene - e-mail: t...@moene.org - phone: +31 346 214290
Saturnushof 14, 3738 XG  Maartensdijk, The Netherlands
At home: http://moene.org/~toon/; weather: http://moene.org/~hirlam/
Progress of GNU Fortran: http://gcc.gnu.org/wiki/GFortran#news


Re: Trunk LTO Bootstrap of Sun Aug 21 18:01:01 UTC 2011 (revision 177942) FAILED

2011-08-22 Thread Marc Glisse

On Mon, 22 Aug 2011, Toon Moene wrote:

After studying this a bit more, I almost convinced this is due to the upgrade 
of Debian Testing I did at 12:15 UTC, Sunday the 21st of August.


Apparently, the install of libc6-2.13-16 does some evil things to the 
/usr/include/bits directory ...


Ah, then I guess this patch will solve it:
http://gcc.gnu.org/ml/gcc-patches/2011-08/msg01674.html

--
Marc Glisse


Re: Fwd: C6X fails to build in FSF mainline

2011-08-22 Thread Bernd Schmidt
On 08/18/11 03:45, Andrew Pinski wrote:
> Forwarding this to the gcc list.  Also Adding RTH to the CC since he
> helped Bernd to get the dwarf2 parts working correctly.

>  You probably know this already.  The c6x-elf target fails to build
>  libgcc with the current FSF mainline sources:
> 
>  gcc/libgcc2.c: In function ‘__gnu_mulsc3’:
>  gcc/libgcc2.c:1928:1: internal compiler error: in scan_trace, at
> dwarf2cfi.c:2433
>  Please submit a full bug report,

Thanks Richard for fixing this (I've been on vacation).

There are some testsuite failures at -O3 in another part of dwarf2cfi,
which are caused by computed_jump_p returning 0 for the
indirect_jump_shadow pattern. There isn't really a sensible way to
represent this pattern in RTL, but we can take advantage of the fact
that computed_jump_p returns true for constants. I committed the
following patch.


Bernd
Index: gcc/ChangeLog
===
--- gcc/ChangeLog   (revision 177967)
+++ gcc/ChangeLog   (working copy)
@@ -1,3 +1,8 @@
+2011-08-22  Bernd Schmidt  
+
+   * config/c6x/c6x.md (indirect_jump_shadow): Tweak representation
+   to make computed_jump_p return true.
+
 2011-08-22  Rainer Orth  
 
* configure.ac (GCC_PICFLAG_FOR_TARGET): Call it.
Index: gcc/config/c6x/c6x.md
===
--- gcc/config/c6x/c6x.md   (revision 177952)
+++ gcc/config/c6x/c6x.md   (working copy)
@@ -1427,8 +1427,10 @@ (define_insn "real_ret"
(set_attr "cross" "y,n")
(set_attr "dest_regfile" "b")])
 
+;; computed_jump_p returns true if it finds a constant; so use one in the
+;; unspec.
 (define_insn "indirect_jump_shadow"
-  [(set (pc) (unspec [(pc)] UNSPEC_JUMP_SHADOW))]
+  [(set (pc) (unspec [(const_int 1)] UNSPEC_JUMP_SHADOW))]
   ""
   ";; indirect jump occurs"
   [(set_attr "type" "shadow")])


[GSOC] Optimising GCC, conclusion

2011-08-22 Thread Dimitrios Apostolou

Monday 22nd of August, 2011: pencils down.

Today my GSOC adventure comes to an end. For whoever doesn't know, this 
summer I've been trying to make GCC faster. A task that proved much harder 
than I initially thought.


My proposal was about doing many small improvements in various parts of 
the compiler, both in CPU and memory utilisation. All in all I touched 
parts from the back-end and the middle-end, to the C frontend, but only 
regarding CPU utilisation. Unfortunately improvements were much less 
significant than I expected and many things I tried turned out fruitless. 
Also I didn't have any time left to profile C++ frontend which most people 
really needed, hopefully it will benefit a tiny bit from the generic 
changes I have introduced, until I do some actual profiling in the future.


No matter the difficulties, the experience has been very positive for me. 
I have certainly learned many things about GCC and how to work with the 
open source community. I even managed to speed-up GCC a little and

finished with a 3-page long TODO list with ideas.

Various results were measured after applying all of my final patches, and 
making sure the resulting tree (mytrunk) passes all tests on both i386 and 
x86_64. For anyone that wants to reproduce the tree that I used for final 
measurements, he should apply all patches I sent the last couple of days, 
in particular:


http://gcc.gnu.org/ml/gcc-patches/2011-08/msg01711.html
http://gcc.gnu.org/ml/gcc-patches/2011-08/msg01712.html
http://gcc.gnu.org/ml/gcc-patches/2011-08/msg01713.html
http://gcc.gnu.org/ml/gcc-patches/2011-08/msg01714.html
http://gcc.gnu.org/ml/gcc-patches/2011-08/msg01717.html
http://gcc.gnu.org/ml/gcc-patches/2011-08/msg01719.html
http://gcc.gnu.org/ml/gcc-patches/2011-08/msg01722.html
http://gcc.gnu.org/ml/gcc-patches/2011-08/msg01723.html
http://gcc.gnu.org/ml/gcc-patches/2011-08/msg01729.html
http://gcc.gnu.org/ml/gcc-patches/2011-08/msg01740.html
http://gcc.gnu.org/ml/gcc-patches/2011-08/msg01752.html
http://gcc.gnu.org/ml/gcc-patches/2011-08/msg01782.html
http://gcc.gnu.org/ml/gcc-patches/2011-08/msg01796.html


Time and instruction count measurements:

Example compilation of ext4's super.c, in linux-3.0 on x86_64 (-O2 -g):

trunk:  3.177s  7996.5 M instr
mytrunk:3.059s  7645.0 M instr

Example compilation of tcp_ipv4.c on i386, with flags changed to -O0 and 
no debug symbols:


trunk:  0.622s  1438.4 M instr
mytrunk:0.592s  1368.5 M instr

Compiling the whole linux-3.0 tarball on a ramdrive, using make -j NCPUs+1:

trunk:  7:33s
mytrunk:7:23s


At this point I want to thank Steven Bosscher and Paolo Bonzini for 
mentoring me, together with jakub, richi, djgpp, lxo and others I'm 
probably forgetting for helping me on IRC the strangest hours. :-) Thanks 
also to ICS-FORTH for allowing me to work from the premises of CARV 
laboratory (www.ics.forth.gr/carv), hopefully I'll be working there for 
the rest of the year. Finally special thanks to maraz (CC'd) to whom I now 
owe some bet prize...


But most of all I must thank Google, that gave me the opportunity to get 
paid while working on Open Source.



I will most likely disappear for the following two weeks or so. I have 
some exams I must study for, plus I should dedicate some time to other 
work I have pending. Nevertheless please do send me any comments regarding 
my project and the patches I submitted since I plan to stay in contact 
with the GCC community, and I'll try addressing them the soonest possible. 
I also plan to update the performance related pages on the wiki, hopefully

other people will find information on GCC performance useful.


Dimitris


Re: [GSOC] Optimising GCC, conclusion

2011-08-22 Thread Ian Lance Taylor
Dimitrios Apostolou  writes:

> My proposal was about doing many small improvements in various parts
> of the compiler, both in CPU and memory utilisation. All in all I
> touched parts from the back-end and the middle-end, to the C frontend,
> but only regarding CPU utilisation. Unfortunately improvements were
> much less significant than I expected and many things I tried turned
> out fruitless. Also I didn't have any time left to profile C++
> frontend which most people really needed, hopefully it will benefit a
> tiny bit from the generic changes I have introduced, until I do some
> actual profiling in the future.

Thanks for your work on this.

Ian


Re: Just what are rtx costs?

2011-08-22 Thread Peter Bigot
On Sun, Aug 21, 2011 at 12:01 PM, Georg-Johann Lay  wrote:
>
> Richard Sandiford schrieb:
>>
>> Georg-Johann Lay  writes:
>>
>>> Richard Sandiford schrieb:
>>>
 I've been working on some patches to make insn_rtx_cost take account
 of the cost of SET_DESTs as well as SET_SRCs.  But I'm slowly beginning
 to realise that I don't understand what rtx costs are supposed to 
 represent.

 AIUI the rules have historically been:

  1) Registers have zero cost.

  2) Constants have a cost relative to that of registers.  By extension,
    constants have zero cost if they are as cheap as a register.

  3) With an outer code of SET, actual operations have the cost
    of the associated instruction.  E.g. the cost of a PLUS
    is the cost of an addition instruction.

  4) With other outer codes, actual operations have the cost
    of the combined instruction, if available, or the cost of
    a separate instruction otherwise.  E.g. the cost of a NEG
    inside an AND might be zero on targets that support BIC-like
    instructions, and COSTS_N_INSNS (1) on most others.

 [...]

 But that hardly seems clean either.  Perhaps we should instead make
 the SET_SRC always include the cost of the SET, even for registers,
 constants and the like.  Thoughts?
>>>
>>> IMO a clean approach would be to query the costs of a whole insn (resp. 
>>> it's pattern) rather than the cost of an RTX.  COSTS_N_INSNS already 
>>> indicates that the costs are compared to *insn* costs i.e. cost of the 
>>> whole pattern (modulo clobbers).
>>
>> The problem is that we sometimes want the cost of something that cannot
>> be done using a single instruction.  E.g. some CONST_INTs take several
>> instructions to create on MIPS.  In this case the costs are really
>> measuring the cost of an emit_move_insn sequence, not a single insn.
>>
>> I suppose we could use emit_move_insn to create a temporary sequence
>> and sum the cost of each individual instruction.  But that's potentially
>> expensive.
>
> No, that complexity is not needed.  For (set (reg) (const_int)) the BE can 
> just return the cost of the expanded sequence because it knows how it will be 
> expanded and how much it will cost.  There's no need to really expand the 
> sequence.
>
> That's the way, e.g. AVR backend works: Shifts/mul/div must be expanded 
> because the hardware does not support them natively.  The rtx_cost for such 
> an expression (which are always interpreted as RHS of a (set (reg) ...)) are 
> the sum over the costs of all insns the expander will produce.

One of my problems with this approach is that the logic that's put
into an expander definition preparation statement (or, in the case of
AVR, the function invoked by the insn output statement) gets
replicated abstractly in rtx_costs: both places have long switch
statements on operand mode and const shift value to determine the
instructions that get emitted (in the former) or how many of them
there are (in the latter).  How likely is it the two are kept
consistent over the years?

I'm working on the (not yet pushed upstream) back-end for the TI
MSP430, which has some historical relationship to AVR from about a
decade ago, and the answer to that question is "not very likely".
I've changed the msp430 back-end so that instead of putting all that
logic in the output statement for the insn, it goes into a preparation
statement for a standard expander.  This way the individual insns that
result in (say) a constant shift of 8 bits using xor and bswap are
available for the optimizer and register allocator to improve.

This works pretty well, but still leaves me with problems when it
comes to computing RTX costs, because there seems to be some strength
reduction optimization for multiplication that's asking for the costs
to shift each integer type by 1 to 15 bits, when in fact no such insn
should ever be produced if real code was being generated.  I think
this is an example of the case Richard's describing.

If, in rtx_costs, I could detect an unexpected insn, deduce the
correct expander function, call it, then recurse on the sequence it
generated, I'd get the right answer---though I'd infinitely prefer not
to be asked to calculate the cost of an unexpected insn.  Doing this
expansion would probably be very expensive, though, and with the side
effects that are part of emit_insn I don't know how to safely call
things that invoke it when what gets emitted isn't part of the actual
stream.

>>
>> Also, any change along these lines is similar to the "tie costs to
>> .md patterns" thing that I mentioned at the end of the message.
>> I don't really have time to work on anything so invasive, so the
>> question is really whether we can sensibly change the costs within
>> the current framework.
>>
>>> E.g. the cost of a CONST_INT is meaningless if you don't know what to do 
>>> with the constant. (set (reg:QI) (const_int 0)) might have 

Re: Performance degradation on g++ 4.6

2011-08-22 Thread Oleg Smolsky

Hey David, these two --param options made no difference to the test.

I've cut the suite down to a single test (attached), which yields the 
following results:


./simple_types_constant_folding_os (gcc 41)
test description   time   operations/s
 0 "int8_t constant add"   1.34 sec   1194.03 M

./simple_types_constant_folding_os (gcc 46)
test description   time   operations/s
 0 "int8_t constant add"   2.84 sec   563.38 M

Both compilers fully inline the templated function and the emitted code 
looks very similar. I am puzzled as to why one of these loops is 
significantly slower than the other. I've attached disassembled listings 
- perhaps someone could have a look please? (the body of the loop starts 
at 00400FD for gcc41 and at 00400D90 for gcc46)


Thanks,
Oleg.


On 2011/8/1 22:48, Xinliang David Li wrote:

Try isolate the int8_t constant folding testing from the rest to see
if the slow down can be reproduced with the isolated case. If the
problem disappear, it is likely due to the following inline
parameters:

large-function-insns, large-function-growth, large-unit-insns,
inline-unit-growth. For instance set

--param large-function-insns=1
--param large-unit-insns=2

David

On Mon, Aug 1, 2011 at 11:43 AM, Oleg Smolsky  wrote:

On 2011/7/29 14:07, Xinliang David Li wrote:

Profiling tools are your best friend here. If you don't have access to
any, the least you can do is to build the program with -pg option and
use gprof tool to find out differences.

The test suite has a bunch of very basic C++ tests that are executed an
enormous number of times. I've built one with the obvious performance
degradation and attached the source, output and reports.

Here are some highlights:
v4.1:Total absolute time for int8_t constant folding: 30.42 sec
v4.6:Total absolute time for int8_t constant folding: 43.32 sec

Every one of the tests in this section had degraded... the first half more
than the second. I am not sure how much further I can take this - the
benchmarked code is very short and plain. I can post disassembly for one
(some?) of them if anyone is willing to take a look...

Thanks,
Oleg.



/*
Copyright 2007-2008 Adobe Systems Incorporated
Distributed under the MIT License (see accompanying file LICENSE_1_0_0.txt
or a copy at http://stlab.adobe.com/licenses.html )


Source file for tests shared among several benchmarks
*/

/**/

template
inline bool tolerance_equal(T &a, T &b) {
T diff = a - b;
return (abs(diff) < 1.0e-6);
}


template<>
inline bool tolerance_equal(int32_t &a, int32_t &b) {
return (a == b);
}
template<>
inline bool tolerance_equal(uint32_t &a, uint32_t &b) {
return (a == b);
}
template<>
inline bool tolerance_equal(uint64_t &a, uint64_t &b) {
return (a == b);
}
template<>
inline bool tolerance_equal(int64_t &a, int64_t &b) {
return (a == b);
}

template<>
inline bool tolerance_equal(double &a, double &b) {
double diff = a - b;
double reldiff = diff;
if (fabs(a) > 1.0e-8)
reldiff = diff / a;
return (fabs(reldiff) < 1.0e-6);
}

template<>
inline bool tolerance_equal(float &a, float &b) {
float diff = a - b;
double reldiff = diff;
if (fabs(a) > 1.0e-4)
reldiff = diff / a;
return (fabs(reldiff) < 1.0e-3);// single precision 
divide test is really imprecise
}

/**/

template 
inline void check_shifted_sum(T result) {
T temp = (T)SIZE * Shifter::do_shift((T)init_value);
if (!tolerance_equal(result,temp))
printf("test %i failed\n", current_test);
}

template 
inline void check_shifted_sum_CSE(T result) {
T temp = (T)0.0;
if (!tolerance_equal(result,temp))
printf("test %i failed\n", current_test);
}

template 
inline void check_shifted_variable_sum(T result, T var) {
T temp = (T)SIZE * Shifter::do_shift((T)init_value, var);
if (!tolerance_equal(result,temp))
printf("test %i failed\n", current_test);
}

template 
inline void check_shifted_variable_sum(T result, T var1, T var2, T var3, T 
var4) {
T temp = (T)SIZE * Shifter::do_shift((T)init_value, var1, var2, var3, 
var4);
if (!tolerance_equal(result,temp))
printf("test %i failed\n", current_test);
}

template 
inline void check_shifted_variable_sum_CSE(T result, T var) {
T temp = (T)0.0;
if (!tolerance_equal(result,temp))
printf("test %i failed\n", current_test);
}

template 
inline void check_shifted_variable_sum_CSE(T result, T var1, T var2, T var3, T 
var4) {
T temp = (T)0.0;
if (!tolerance_equal(result,temp))
printf("test %i failed\n", current_test);
}


/

Re: Performance degradation on g++ 4.6

2011-08-22 Thread Oleg Smolsky

On 2011/8/22 18:09, Oleg Smolsky wrote:
Both compilers fully inline the templated function and the emitted 
code looks very similar. I am puzzled as to why one of these loops is 
significantly slower than the other. I've attached disassembled 
listings - perhaps someone could have a look please? (the body of the 
loop starts at 00400FD for gcc41 and at 00400D90 for 
gcc46)

The difference, theoretically, should be due to the inner loop:

v4.6:
.text:00400DA0 loc_400DA0:
.text:00400DA0 add eax, 0Ah
.text:00400DA3 add al, [rdx]
.text:00400DA5 add rdx, 1
.text:00400DA9 cmp rdx, 5034E0h
.text:00400DB0 jnz short loc_400DA0

v4.1:
.text:00400FE0 loc_400FE0:
.text:00400FE0 movzx   eax, ds:data8[rdx]
.text:00400FE7 add rdx, 1
.text:00400FEB add eax, 0Ah
.text:00400FEE cmp rdx, 1F40h
.text:00400FF5 lea ecx, [rax+rcx]
.text:00400FF8 jnz short loc_400FE0

However, I cannot see how the first version would be slow... The custom 
templated "shifter" degenerates into "add 0xa", which is the point of 
the test... Hmm...


Oleg.


Re: Performance degradation on g++ 4.6

2011-08-22 Thread Andrew Pinski
On Mon, Aug 22, 2011 at 6:34 PM, Oleg Smolsky  wrote:
> On 2011/8/22 18:09, Oleg Smolsky wrote:
>>
>> Both compilers fully inline the templated function and the emitted code
>> looks very similar. I am puzzled as to why one of these loops is
>> significantly slower than the other. I've attached disassembled listings -
>> perhaps someone could have a look please? (the body of the loop starts at
>> 00400FD for gcc41 and at 00400D90 for gcc46)
>
> The difference, theoretically, should be due to the inner loop:
>
> v4.6:
> .text:00400DA0 loc_400DA0:
> .text:00400DA0                 add     eax, 0Ah
> .text:00400DA3                 add     al, [rdx]
> .text:00400DA5                 add     rdx, 1
> .text:00400DA9                 cmp     rdx, 5034E0h
> .text:00400DB0                 jnz     short loc_400DA0
>
> v4.1:
> .text:00400FE0 loc_400FE0:
> .text:00400FE0                 movzx   eax, ds:data8[rdx]
> .text:00400FE7                 add     rdx, 1
> .text:00400FEB                 add     eax, 0Ah
> .text:00400FEE                 cmp     rdx, 1F40h
> .text:00400FF5                 lea     ecx, [rax+rcx]
> .text:00400FF8                 jnz     short loc_400FE0
>
> However, I cannot see how the first version would be slow... The custom
> templated "shifter" degenerates into "add 0xa", which is the point of the
> test... Hmm...

It is slower because of the subregister depedency between eax and al.

Thanks,
Andrew Pinski