Re: Just what are rtx costs?
Georg-Johann Lay writes: >>>IMO a clean approach would be to query the costs of a whole insn (resp. >>>it's pattern) rather than the cost of an RTX. COSTS_N_INSNS already >>>indicates that the costs are compared to *insn* costs i.e. cost of the >>>whole pattern (modulo clobbers). >> >> The problem is that we sometimes want the cost of something that cannot >> be done using a single instruction. E.g. some CONST_INTs take several >> instructions to create on MIPS. In this case the costs are really >> measuring the cost of an emit_move_insn sequence, not a single insn. >> >> I suppose we could use emit_move_insn to create a temporary sequence >> and sum the cost of each individual instruction. But that's potentially >> expensive. > > No, that complexity is not needed. For (set (reg) (const_int)) the BE > can just return the cost of the expanded sequence because it knows how > it will be expanded and how much it will cost. There's no need to > really expand the sequence. Sorry, I'd misunderstood your suggestion. I thought you were suggesting that the rtx costs functions should only be presented with SETs that are valid instructions. I hadn't realised that you were still allowing these SETs to be arbitrary ones that have been cooked up by the optimisers. So are you saying that we should remove the recursive nature of the rtx_cost/targetm.rtx_costs interface, and have the backend handle any recursion itself? I.e. targetm.rtx_costs only ever sees a complete (but perhaps invalid) instruction pattern? Or would you still keep the current recursion? Richard
Re: Just what are rtx costs?
On Mon, Aug 22, 2011 at 10:19 AM, Richard Sandiford wrote: > Georg-Johann Lay writes: IMO a clean approach would be to query the costs of a whole insn (resp. it's pattern) rather than the cost of an RTX. COSTS_N_INSNS already indicates that the costs are compared to *insn* costs i.e. cost of the whole pattern (modulo clobbers). >>> >>> The problem is that we sometimes want the cost of something that cannot >>> be done using a single instruction. E.g. some CONST_INTs take several >>> instructions to create on MIPS. In this case the costs are really >>> measuring the cost of an emit_move_insn sequence, not a single insn. >>> >>> I suppose we could use emit_move_insn to create a temporary sequence >>> and sum the cost of each individual instruction. But that's potentially >>> expensive. >> >> No, that complexity is not needed. For (set (reg) (const_int)) the BE >> can just return the cost of the expanded sequence because it knows how >> it will be expanded and how much it will cost. There's no need to >> really expand the sequence. > > Sorry, I'd misunderstood your suggestion. I thought you were suggesting > that the rtx costs functions should only be presented with SETs that are > valid instructions. I hadn't realised that you were still allowing these > SETs to be arbitrary ones that have been cooked up by the optimisers. > > So are you saying that we should remove the recursive nature of the > rtx_cost/targetm.rtx_costs interface, and have the backend handle any > recursion itself? I.e. targetm.rtx_costs only ever sees a complete > (but perhaps invalid) instruction pattern? Or would you still keep > the current recursion? I would say yes to that - kill the recursion. Richard. > Richard >
Re: Just what are rtx costs?
Richard Sandiford wrote: > Georg-Johann Lay writes: IMO a clean approach would be to query the costs of a whole insn (resp. it's pattern) rather than the cost of an RTX. COSTS_N_INSNS already indicates that the costs are compared to *insn* costs i.e. cost of the whole pattern (modulo clobbers). >>> The problem is that we sometimes want the cost of something that cannot >>> be done using a single instruction. E.g. some CONST_INTs take several >>> instructions to create on MIPS. In this case the costs are really >>> measuring the cost of an emit_move_insn sequence, not a single insn. >>> >>> I suppose we could use emit_move_insn to create a temporary sequence >>> and sum the cost of each individual instruction. But that's potentially >>> expensive. >> No, that complexity is not needed. For (set (reg) (const_int)) the BE >> can just return the cost of the expanded sequence because it knows how >> it will be expanded and how much it will cost. There's no need to >> really expand the sequence. > > Sorry, I'd misunderstood your suggestion. I thought you were suggesting > that the rtx costs functions should only be presented with SETs that are > valid instructions. I hadn't realised that you were still allowing these > SETs to be arbitrary ones that have been cooked up by the optimisers. RTX costs only make sense if the rtx eventually results in insns. This can basically happen in two ways: * expander which transforms insn-like expression to a sequence of insns. Example is x << y in some backend that cannot do it natively and expand it into loop. Similar example is X + big_const which cannot be handled by target, i.e. insn predicate denies it. * cooking up new insns like in insn combine. It only makes sense to query costs for insns that actually match, i.e. pass recog or recog_for_combine or whatever. > So are you saying that we should remove the recursive nature of the > rtx_cost/targetm.rtx_costs interface, and have the backend handle any > recursion itself? I.e. targetm.rtx_costs only ever sees a complete > (but perhaps invalid) instruction pattern? Or would you still keep > the current recursion? I don't see benefit of recursion because every step removes information. E.g. in the example you gave in http://gcc.gnu.org/ml/gcc-patches/2011-08/msg01264.html which is cost of shifts like x << ? the operand number does not help because you need the second operand to determine the cost: shift by constant offset in general has different cost than shift by variable. Thus, only the complete RTX makes sense for cost computation. In general it's not possible to separate the cost function because cost (f(a,b)) != cost (f) + cost (a,0) + cost (b,1) resp. you cannot represent costs in that orthogonal way and such an ansatz must fail. There are also cases where costs are paradoxical, i.e. a more complex expression has lower costs than a simpler one. Example is bit extraction which might be cheaper than shifting the mask plus oring/anding. BTW, avr BE does recursion inside rtx_costs which is bad idea, imo. But that's up to the target. Johann > Richard
Re: i370 port
Paul Edwards wrote: > if (operands[1] == const0_rtx) > { > CC_STATUS_INIT; > mvs_check_page (0, 6, 8); > return \"MVC%O0(8,%R0),=XL8'00'\"; > } > mvs_check_page (0, 6, 8); > return \"MVC%O0(8,%R0),%1\"; > }" >[(set_attr "length" "8")] > ) > > forces it to use XL8'00' instead of the default F'0' and that > seems to work. Does that seem like a proper solution to > you? Well, there isn't really anything special about const0_rtx. *Any* CONST_INT that shows up as second operand to the movdi pattern must be emitted into an 8 byte literal at this point. You can do that inline; but the more usual way would be to define an operand print format that encodes the fact that a 64-bit operand is requested. In fact, looking at the i370.h PRINT_OPERAND, there already seems to be such a format: 'W'. (Maybe not quite; since 'W' sign-extends a 32-bit operand to 64-bit. But since 'W' doesn't seem to be used anyway, maybe this can be changed.) Bye, Ulrich -- Dr. Ulrich Weigand GNU Toolchain for Linux on System z and Cell BE ulrich.weig...@de.ibm.com
Re: [named address] ice-on-valid: in postreload.c:reload_cse_simplify_operands
Georg-Johann Lay wrote: > Ulrich Weigand schrieb: > > Georg-Johann Lay wrote: > > > >>http://gcc.gnu.org/ml/gcc/2011-08/msg00131.html > >> > >>Are you going to install that patch? Or maybe you already installed it? > > > > No, it isn't approved yet (in fact, it isn't even posted for approval). > > Usually, patches that add new target macros, or new arguments to target > > macros, but do not actually add any *exploiter* of the new features, > > are frowned upon ... > > I thought about implementing a "hidden" named AS first and not exposing > it to user land, e.g. to be able to do optimizations like > http://gcc.gnu.org/PR49857 > http://gcc.gnu.org/PR43745 > which need named AS to express that some pointers/accesses are different. > > The most prominent drawback of named AS at the moment is that AVR has > few address registers and register allocation often regerates unpleasant > code or even runs into spill fails. > > The AS in question can only be accessed by means of post-increment > addressing via one single hard register. Well, it doesn't really matter whether you want to expose the AS externally or just use it internally. Either way, I'll be happy to propose my patch for inclusion once you have a patch ready that depends on it ... Bye, Ulrich -- Dr. Ulrich Weigand GNU Toolchain for Linux on System z and Cell BE ulrich.weig...@de.ibm.com
Re: Just what are rtx costs?
Quoting Richard Guenther : So are you saying that we should remove the recursive nature of the rtx_cost/targetm.rtx_costs interface, and have the backend handle any recursion itself? I.e. targetm.rtx_costs only ever sees a complete (but perhaps invalid) instruction pattern? Or would you still keep the current recursion? I would say yes to that - kill the recursion. But the recursion is already optional. If you don't want to use recursion for your port, just make the rtx_costs hook return true. There is no need to break ports that are OK to use the recursion in rtlanal.c partially or in whole.
Re: Just what are rtx costs?
On Mon, Aug 22, 2011 at 9:08 AM, Joern Rennecke wrote: > Quoting Richard Guenther : > >>> So are you saying that we should remove the recursive nature of the >>> rtx_cost/targetm.rtx_costs interface, and have the backend handle any >>> recursion itself? I.e. targetm.rtx_costs only ever sees a complete >>> (but perhaps invalid) instruction pattern? Or would you still keep >>> the current recursion? >> >> I would say yes to that - kill the recursion. > > But the recursion is already optional. If you don't want to use recursion > for your port, just make the rtx_costs hook return true. > There is no need to break ports that are OK to use the recursion in > rtlanal.c > partially or in whole. Exactly. I don't understand the disagreement about recursion. For instance, the rs6000 port explicitly returns true or false for rtx_costs as necessary for its computation. If a port wants to compute rtx_costs without recursion, it already has that control. Thanks, David
Re: Trunk LTO Bootstrap of Sun Aug 21 18:01:01 UTC 2011 (revision 177942) FAILED
On 08/21/2011 08:19 PM, Toon Moene wrote: See: http://gcc.gnu.org/ml/gcc-testresults/2011-08/msg02361.html The configure line is: ../gcc/configure \ --prefix=/tmp/lto \ --enable-languages=c++ \ --with-build-config=bootstrap-lto \ --with-gnu-ld \ --disable-multilib \ --disable-nls \ --with-arch=native \ --with-tune=native on x86_64-unknown-linux-gnu After studying this a bit more, I almost convinced this is due to the upgrade of Debian Testing I did at 12:15 UTC, Sunday the 21st of August. Apparently, the install of libc6-2.13-16 does some evil things to the /usr/include/bits directory ... I'll turn off the daily builds until this problem is solved. Cheers, -- Toon Moene - e-mail: t...@moene.org - phone: +31 346 214290 Saturnushof 14, 3738 XG Maartensdijk, The Netherlands At home: http://moene.org/~toon/; weather: http://moene.org/~hirlam/ Progress of GNU Fortran: http://gcc.gnu.org/wiki/GFortran#news
Re: Trunk LTO Bootstrap of Sun Aug 21 18:01:01 UTC 2011 (revision 177942) FAILED
On Mon, 22 Aug 2011, Toon Moene wrote: After studying this a bit more, I almost convinced this is due to the upgrade of Debian Testing I did at 12:15 UTC, Sunday the 21st of August. Apparently, the install of libc6-2.13-16 does some evil things to the /usr/include/bits directory ... Ah, then I guess this patch will solve it: http://gcc.gnu.org/ml/gcc-patches/2011-08/msg01674.html -- Marc Glisse
Re: Fwd: C6X fails to build in FSF mainline
On 08/18/11 03:45, Andrew Pinski wrote: > Forwarding this to the gcc list. Also Adding RTH to the CC since he > helped Bernd to get the dwarf2 parts working correctly. > You probably know this already. The c6x-elf target fails to build > libgcc with the current FSF mainline sources: > > gcc/libgcc2.c: In function ‘__gnu_mulsc3’: > gcc/libgcc2.c:1928:1: internal compiler error: in scan_trace, at > dwarf2cfi.c:2433 > Please submit a full bug report, Thanks Richard for fixing this (I've been on vacation). There are some testsuite failures at -O3 in another part of dwarf2cfi, which are caused by computed_jump_p returning 0 for the indirect_jump_shadow pattern. There isn't really a sensible way to represent this pattern in RTL, but we can take advantage of the fact that computed_jump_p returns true for constants. I committed the following patch. Bernd Index: gcc/ChangeLog === --- gcc/ChangeLog (revision 177967) +++ gcc/ChangeLog (working copy) @@ -1,3 +1,8 @@ +2011-08-22 Bernd Schmidt + + * config/c6x/c6x.md (indirect_jump_shadow): Tweak representation + to make computed_jump_p return true. + 2011-08-22 Rainer Orth * configure.ac (GCC_PICFLAG_FOR_TARGET): Call it. Index: gcc/config/c6x/c6x.md === --- gcc/config/c6x/c6x.md (revision 177952) +++ gcc/config/c6x/c6x.md (working copy) @@ -1427,8 +1427,10 @@ (define_insn "real_ret" (set_attr "cross" "y,n") (set_attr "dest_regfile" "b")]) +;; computed_jump_p returns true if it finds a constant; so use one in the +;; unspec. (define_insn "indirect_jump_shadow" - [(set (pc) (unspec [(pc)] UNSPEC_JUMP_SHADOW))] + [(set (pc) (unspec [(const_int 1)] UNSPEC_JUMP_SHADOW))] "" ";; indirect jump occurs" [(set_attr "type" "shadow")])
[GSOC] Optimising GCC, conclusion
Monday 22nd of August, 2011: pencils down. Today my GSOC adventure comes to an end. For whoever doesn't know, this summer I've been trying to make GCC faster. A task that proved much harder than I initially thought. My proposal was about doing many small improvements in various parts of the compiler, both in CPU and memory utilisation. All in all I touched parts from the back-end and the middle-end, to the C frontend, but only regarding CPU utilisation. Unfortunately improvements were much less significant than I expected and many things I tried turned out fruitless. Also I didn't have any time left to profile C++ frontend which most people really needed, hopefully it will benefit a tiny bit from the generic changes I have introduced, until I do some actual profiling in the future. No matter the difficulties, the experience has been very positive for me. I have certainly learned many things about GCC and how to work with the open source community. I even managed to speed-up GCC a little and finished with a 3-page long TODO list with ideas. Various results were measured after applying all of my final patches, and making sure the resulting tree (mytrunk) passes all tests on both i386 and x86_64. For anyone that wants to reproduce the tree that I used for final measurements, he should apply all patches I sent the last couple of days, in particular: http://gcc.gnu.org/ml/gcc-patches/2011-08/msg01711.html http://gcc.gnu.org/ml/gcc-patches/2011-08/msg01712.html http://gcc.gnu.org/ml/gcc-patches/2011-08/msg01713.html http://gcc.gnu.org/ml/gcc-patches/2011-08/msg01714.html http://gcc.gnu.org/ml/gcc-patches/2011-08/msg01717.html http://gcc.gnu.org/ml/gcc-patches/2011-08/msg01719.html http://gcc.gnu.org/ml/gcc-patches/2011-08/msg01722.html http://gcc.gnu.org/ml/gcc-patches/2011-08/msg01723.html http://gcc.gnu.org/ml/gcc-patches/2011-08/msg01729.html http://gcc.gnu.org/ml/gcc-patches/2011-08/msg01740.html http://gcc.gnu.org/ml/gcc-patches/2011-08/msg01752.html http://gcc.gnu.org/ml/gcc-patches/2011-08/msg01782.html http://gcc.gnu.org/ml/gcc-patches/2011-08/msg01796.html Time and instruction count measurements: Example compilation of ext4's super.c, in linux-3.0 on x86_64 (-O2 -g): trunk: 3.177s 7996.5 M instr mytrunk:3.059s 7645.0 M instr Example compilation of tcp_ipv4.c on i386, with flags changed to -O0 and no debug symbols: trunk: 0.622s 1438.4 M instr mytrunk:0.592s 1368.5 M instr Compiling the whole linux-3.0 tarball on a ramdrive, using make -j NCPUs+1: trunk: 7:33s mytrunk:7:23s At this point I want to thank Steven Bosscher and Paolo Bonzini for mentoring me, together with jakub, richi, djgpp, lxo and others I'm probably forgetting for helping me on IRC the strangest hours. :-) Thanks also to ICS-FORTH for allowing me to work from the premises of CARV laboratory (www.ics.forth.gr/carv), hopefully I'll be working there for the rest of the year. Finally special thanks to maraz (CC'd) to whom I now owe some bet prize... But most of all I must thank Google, that gave me the opportunity to get paid while working on Open Source. I will most likely disappear for the following two weeks or so. I have some exams I must study for, plus I should dedicate some time to other work I have pending. Nevertheless please do send me any comments regarding my project and the patches I submitted since I plan to stay in contact with the GCC community, and I'll try addressing them the soonest possible. I also plan to update the performance related pages on the wiki, hopefully other people will find information on GCC performance useful. Dimitris
Re: [GSOC] Optimising GCC, conclusion
Dimitrios Apostolou writes: > My proposal was about doing many small improvements in various parts > of the compiler, both in CPU and memory utilisation. All in all I > touched parts from the back-end and the middle-end, to the C frontend, > but only regarding CPU utilisation. Unfortunately improvements were > much less significant than I expected and many things I tried turned > out fruitless. Also I didn't have any time left to profile C++ > frontend which most people really needed, hopefully it will benefit a > tiny bit from the generic changes I have introduced, until I do some > actual profiling in the future. Thanks for your work on this. Ian
Re: Just what are rtx costs?
On Sun, Aug 21, 2011 at 12:01 PM, Georg-Johann Lay wrote: > > Richard Sandiford schrieb: >> >> Georg-Johann Lay writes: >> >>> Richard Sandiford schrieb: >>> I've been working on some patches to make insn_rtx_cost take account of the cost of SET_DESTs as well as SET_SRCs. But I'm slowly beginning to realise that I don't understand what rtx costs are supposed to represent. AIUI the rules have historically been: 1) Registers have zero cost. 2) Constants have a cost relative to that of registers. By extension, constants have zero cost if they are as cheap as a register. 3) With an outer code of SET, actual operations have the cost of the associated instruction. E.g. the cost of a PLUS is the cost of an addition instruction. 4) With other outer codes, actual operations have the cost of the combined instruction, if available, or the cost of a separate instruction otherwise. E.g. the cost of a NEG inside an AND might be zero on targets that support BIC-like instructions, and COSTS_N_INSNS (1) on most others. [...] But that hardly seems clean either. Perhaps we should instead make the SET_SRC always include the cost of the SET, even for registers, constants and the like. Thoughts? >>> >>> IMO a clean approach would be to query the costs of a whole insn (resp. >>> it's pattern) rather than the cost of an RTX. COSTS_N_INSNS already >>> indicates that the costs are compared to *insn* costs i.e. cost of the >>> whole pattern (modulo clobbers). >> >> The problem is that we sometimes want the cost of something that cannot >> be done using a single instruction. E.g. some CONST_INTs take several >> instructions to create on MIPS. In this case the costs are really >> measuring the cost of an emit_move_insn sequence, not a single insn. >> >> I suppose we could use emit_move_insn to create a temporary sequence >> and sum the cost of each individual instruction. But that's potentially >> expensive. > > No, that complexity is not needed. For (set (reg) (const_int)) the BE can > just return the cost of the expanded sequence because it knows how it will be > expanded and how much it will cost. There's no need to really expand the > sequence. > > That's the way, e.g. AVR backend works: Shifts/mul/div must be expanded > because the hardware does not support them natively. The rtx_cost for such > an expression (which are always interpreted as RHS of a (set (reg) ...)) are > the sum over the costs of all insns the expander will produce. One of my problems with this approach is that the logic that's put into an expander definition preparation statement (or, in the case of AVR, the function invoked by the insn output statement) gets replicated abstractly in rtx_costs: both places have long switch statements on operand mode and const shift value to determine the instructions that get emitted (in the former) or how many of them there are (in the latter). How likely is it the two are kept consistent over the years? I'm working on the (not yet pushed upstream) back-end for the TI MSP430, which has some historical relationship to AVR from about a decade ago, and the answer to that question is "not very likely". I've changed the msp430 back-end so that instead of putting all that logic in the output statement for the insn, it goes into a preparation statement for a standard expander. This way the individual insns that result in (say) a constant shift of 8 bits using xor and bswap are available for the optimizer and register allocator to improve. This works pretty well, but still leaves me with problems when it comes to computing RTX costs, because there seems to be some strength reduction optimization for multiplication that's asking for the costs to shift each integer type by 1 to 15 bits, when in fact no such insn should ever be produced if real code was being generated. I think this is an example of the case Richard's describing. If, in rtx_costs, I could detect an unexpected insn, deduce the correct expander function, call it, then recurse on the sequence it generated, I'd get the right answer---though I'd infinitely prefer not to be asked to calculate the cost of an unexpected insn. Doing this expansion would probably be very expensive, though, and with the side effects that are part of emit_insn I don't know how to safely call things that invoke it when what gets emitted isn't part of the actual stream. >> >> Also, any change along these lines is similar to the "tie costs to >> .md patterns" thing that I mentioned at the end of the message. >> I don't really have time to work on anything so invasive, so the >> question is really whether we can sensibly change the costs within >> the current framework. >> >>> E.g. the cost of a CONST_INT is meaningless if you don't know what to do >>> with the constant. (set (reg:QI) (const_int 0)) might have
Re: Performance degradation on g++ 4.6
Hey David, these two --param options made no difference to the test. I've cut the suite down to a single test (attached), which yields the following results: ./simple_types_constant_folding_os (gcc 41) test description time operations/s 0 "int8_t constant add" 1.34 sec 1194.03 M ./simple_types_constant_folding_os (gcc 46) test description time operations/s 0 "int8_t constant add" 2.84 sec 563.38 M Both compilers fully inline the templated function and the emitted code looks very similar. I am puzzled as to why one of these loops is significantly slower than the other. I've attached disassembled listings - perhaps someone could have a look please? (the body of the loop starts at 00400FD for gcc41 and at 00400D90 for gcc46) Thanks, Oleg. On 2011/8/1 22:48, Xinliang David Li wrote: Try isolate the int8_t constant folding testing from the rest to see if the slow down can be reproduced with the isolated case. If the problem disappear, it is likely due to the following inline parameters: large-function-insns, large-function-growth, large-unit-insns, inline-unit-growth. For instance set --param large-function-insns=1 --param large-unit-insns=2 David On Mon, Aug 1, 2011 at 11:43 AM, Oleg Smolsky wrote: On 2011/7/29 14:07, Xinliang David Li wrote: Profiling tools are your best friend here. If you don't have access to any, the least you can do is to build the program with -pg option and use gprof tool to find out differences. The test suite has a bunch of very basic C++ tests that are executed an enormous number of times. I've built one with the obvious performance degradation and attached the source, output and reports. Here are some highlights: v4.1:Total absolute time for int8_t constant folding: 30.42 sec v4.6:Total absolute time for int8_t constant folding: 43.32 sec Every one of the tests in this section had degraded... the first half more than the second. I am not sure how much further I can take this - the benchmarked code is very short and plain. I can post disassembly for one (some?) of them if anyone is willing to take a look... Thanks, Oleg. /* Copyright 2007-2008 Adobe Systems Incorporated Distributed under the MIT License (see accompanying file LICENSE_1_0_0.txt or a copy at http://stlab.adobe.com/licenses.html ) Source file for tests shared among several benchmarks */ /**/ template inline bool tolerance_equal(T &a, T &b) { T diff = a - b; return (abs(diff) < 1.0e-6); } template<> inline bool tolerance_equal(int32_t &a, int32_t &b) { return (a == b); } template<> inline bool tolerance_equal(uint32_t &a, uint32_t &b) { return (a == b); } template<> inline bool tolerance_equal(uint64_t &a, uint64_t &b) { return (a == b); } template<> inline bool tolerance_equal(int64_t &a, int64_t &b) { return (a == b); } template<> inline bool tolerance_equal(double &a, double &b) { double diff = a - b; double reldiff = diff; if (fabs(a) > 1.0e-8) reldiff = diff / a; return (fabs(reldiff) < 1.0e-6); } template<> inline bool tolerance_equal(float &a, float &b) { float diff = a - b; double reldiff = diff; if (fabs(a) > 1.0e-4) reldiff = diff / a; return (fabs(reldiff) < 1.0e-3);// single precision divide test is really imprecise } /**/ template inline void check_shifted_sum(T result) { T temp = (T)SIZE * Shifter::do_shift((T)init_value); if (!tolerance_equal(result,temp)) printf("test %i failed\n", current_test); } template inline void check_shifted_sum_CSE(T result) { T temp = (T)0.0; if (!tolerance_equal(result,temp)) printf("test %i failed\n", current_test); } template inline void check_shifted_variable_sum(T result, T var) { T temp = (T)SIZE * Shifter::do_shift((T)init_value, var); if (!tolerance_equal(result,temp)) printf("test %i failed\n", current_test); } template inline void check_shifted_variable_sum(T result, T var1, T var2, T var3, T var4) { T temp = (T)SIZE * Shifter::do_shift((T)init_value, var1, var2, var3, var4); if (!tolerance_equal(result,temp)) printf("test %i failed\n", current_test); } template inline void check_shifted_variable_sum_CSE(T result, T var) { T temp = (T)0.0; if (!tolerance_equal(result,temp)) printf("test %i failed\n", current_test); } template inline void check_shifted_variable_sum_CSE(T result, T var1, T var2, T var3, T var4) { T temp = (T)0.0; if (!tolerance_equal(result,temp)) printf("test %i failed\n", current_test); } /
Re: Performance degradation on g++ 4.6
On 2011/8/22 18:09, Oleg Smolsky wrote: Both compilers fully inline the templated function and the emitted code looks very similar. I am puzzled as to why one of these loops is significantly slower than the other. I've attached disassembled listings - perhaps someone could have a look please? (the body of the loop starts at 00400FD for gcc41 and at 00400D90 for gcc46) The difference, theoretically, should be due to the inner loop: v4.6: .text:00400DA0 loc_400DA0: .text:00400DA0 add eax, 0Ah .text:00400DA3 add al, [rdx] .text:00400DA5 add rdx, 1 .text:00400DA9 cmp rdx, 5034E0h .text:00400DB0 jnz short loc_400DA0 v4.1: .text:00400FE0 loc_400FE0: .text:00400FE0 movzx eax, ds:data8[rdx] .text:00400FE7 add rdx, 1 .text:00400FEB add eax, 0Ah .text:00400FEE cmp rdx, 1F40h .text:00400FF5 lea ecx, [rax+rcx] .text:00400FF8 jnz short loc_400FE0 However, I cannot see how the first version would be slow... The custom templated "shifter" degenerates into "add 0xa", which is the point of the test... Hmm... Oleg.
Re: Performance degradation on g++ 4.6
On Mon, Aug 22, 2011 at 6:34 PM, Oleg Smolsky wrote: > On 2011/8/22 18:09, Oleg Smolsky wrote: >> >> Both compilers fully inline the templated function and the emitted code >> looks very similar. I am puzzled as to why one of these loops is >> significantly slower than the other. I've attached disassembled listings - >> perhaps someone could have a look please? (the body of the loop starts at >> 00400FD for gcc41 and at 00400D90 for gcc46) > > The difference, theoretically, should be due to the inner loop: > > v4.6: > .text:00400DA0 loc_400DA0: > .text:00400DA0 add eax, 0Ah > .text:00400DA3 add al, [rdx] > .text:00400DA5 add rdx, 1 > .text:00400DA9 cmp rdx, 5034E0h > .text:00400DB0 jnz short loc_400DA0 > > v4.1: > .text:00400FE0 loc_400FE0: > .text:00400FE0 movzx eax, ds:data8[rdx] > .text:00400FE7 add rdx, 1 > .text:00400FEB add eax, 0Ah > .text:00400FEE cmp rdx, 1F40h > .text:00400FF5 lea ecx, [rax+rcx] > .text:00400FF8 jnz short loc_400FE0 > > However, I cannot see how the first version would be slow... The custom > templated "shifter" degenerates into "add 0xa", which is the point of the > test... Hmm... It is slower because of the subregister depedency between eax and al. Thanks, Andrew Pinski