[issue4753] Faster opcode dispatch on gcc

2009-01-26 Thread Paolo 'Blaisorblade' Giarrusso
Paolo 'Blaisorblade' Giarrusso added the comment: -fno-gcse is controversial. Even if it might avoid jumps sharing, the impact of that option has to be measured, since common subexpression elimination allows omitting some recalculations, so disabling global CSE might have a negative

[issue4941] Tell GCC Py_DECREF is unlikely to call the destructor

2009-01-26 Thread Paolo 'Blaisorblade' Giarrusso
Paolo 'Blaisorblade' Giarrusso added the comment: >Probably #if the definitions of Py_LIKELY and Py_UNLIKELY instead of __builtin_expect so new compilers can easily add their own definitions. This was done in the first version, but with the currently supported compilers it's

[issue4896] Faster why variable manipulation in ceval.c

2009-01-16 Thread Paolo 'Blaisorblade' Giarrusso
Paolo 'Blaisorblade' Giarrusso added the comment: Given a 10% speedup on some systems, and statistically insignificant changes on other systems, I would still apply the patch, even simply because the bitmask part simply makes more sense. I'm not sure about the goto part, but

[issue4941] Tell GCC Py_DECREF is unlikely to call the destructor

2009-01-16 Thread Paolo 'Blaisorblade' Giarrusso
Paolo 'Blaisorblade' Giarrusso added the comment: Yep, agreed. It could be quite cool to rely on __attribute__((cold)) to mark format_exc_check_arg(), but that only works with newer gcc's. I guess I'll add many likely() by hand - in some cases GCC might already get it right

[issue4941] Tell GCC Py_DECREF is unlikely to call the destructor

2009-01-14 Thread Paolo 'Blaisorblade' Giarrusso
Paolo 'Blaisorblade' Giarrusso added the comment: Also, GCC 2.95 does not support the construct, GCC 2.96 is required. So, I'd suggest defining likely/unlikely unconditionally and using this, which leads to less code overall: # if (__GNUC__ < 2) || ((__GNUC__ == 2) &&

[issue4715] optimize bytecode for conditional branches

2009-01-14 Thread Paolo &#x27;Blaisorblade' Giarrusso
Changes by Paolo 'Blaisorblade' Giarrusso : -- nosy: +blaisorblade ___ Python tracker <http://bugs.python.org/issue4715> ___ ___ Python-bugs-l

[issue4753] Faster opcode dispatch on gcc

2009-01-13 Thread Paolo &#x27;Blaisorblade' Giarrusso
Paolo 'Blaisorblade' Giarrusso added the comment: #4715 is interesting, but is not really about superinstructions. Superinstructions are not created because they make sense; any common sequence of opcodes can become a superinstruction, just for the point of saving dispatches. And th

[issue4941] Tell GCC Py_DECREF is unlikely to call the destructor

2009-01-13 Thread Paolo &#x27;Blaisorblade' Giarrusso
Paolo 'Blaisorblade' Giarrusso added the comment: If speedup on other machines are confirmed, a slowdown of less than 2% should be acceptable (it's also similar to the statistic noise I have on my machine - it's a pity pybench does not calculate the standard deviation o

[issue4753] Faster opcode dispatch on gcc

2009-01-12 Thread Paolo &#x27;Blaisorblade' Giarrusso
Paolo 'Blaisorblade' Giarrusso added the comment: Ok, then vmgen adds almost just direct threading instead of indirect threading. Since the purpose of superinstructions is to eliminate dispatch overhead, and that's more important when little actual work is done, what about

[issue4896] Faster why variable manipulation in ceval.c

2009-01-12 Thread Paolo &#x27;Blaisorblade' Giarrusso
Changes by Paolo 'Blaisorblade' Giarrusso : -- nosy: +blaisorblade ___ Python tracker <http://bugs.python.org/issue4896> ___ ___ Python-bugs-l

[issue4753] Faster opcode dispatch on gcc

2009-01-12 Thread Paolo &#x27;Blaisorblade' Giarrusso
Paolo 'Blaisorblade' Giarrusso added the comment: A couple percent maybe is not worth vmgen-ing. But even if I'm not a vmgen expert, I read many papers from Ertl about superinstructions and replication, so the expected speedup from vmgen'ing is much bigger. Is there some m

[issue4753] Faster opcode dispatch on gcc

2009-01-10 Thread Paolo &#x27;Blaisorblade' Giarrusso
Paolo 'Blaisorblade' Giarrusso added the comment: > Same for CPU-specific tuning: I don't think we want to ship Python with compiler flags which depend on the particular CPU being used. I wasn't suggesting this - but since different CPUs have different optimization ru

[issue4753] Faster opcode dispatch on gcc

2009-01-10 Thread Paolo &#x27;Blaisorblade' Giarrusso
Paolo 'Blaisorblade' Giarrusso added the comment: The standing question is still: can we get ICC to produce the expected output? It looks like we still didn't manage, and since ICC is the best compiler out there, this matters. Some problems with SunCC, even if it doesn't

[issue4753] Faster opcode dispatch on gcc

2009-01-09 Thread Paolo &#x27;Blaisorblade' Giarrusso
Paolo 'Blaisorblade' Giarrusso added the comment: @ ajaksu2 > Applying your patches makes no difference with gcc 4.2 and gives a > barely noticeable (~2%) slowdown with icc. "Your patches" is something quite unclear :-) Which are the patch sets you are comparing? And

[issue4753] Faster opcode dispatch on gcc

2009-01-07 Thread Paolo &#x27;Blaisorblade' Giarrusso
Paolo 'Blaisorblade' Giarrusso added the comment: @skip: In simple words, the x86 call: call 0x2000 placed at address 0x1000 becomes: call %rip + 0x1000 RIP holds the instruction pointer, which will be 0x1000 in this case (actually, I'm ignoring the detail that when executin

[issue4753] Faster opcode dispatch on gcc

2009-01-07 Thread Paolo &#x27;Blaisorblade' Giarrusso
Paolo 'Blaisorblade' Giarrusso added the comment: I finally implemented my suggestion for the switch elimination. On top of threadedceval5.patch, apply abstract-switch-reduced.diff and then restore-old-oparg-load.diff to test it. This way, only computed goto's are used. I wou

[issue4753] Faster opcode dispatch on gcc

2009-01-07 Thread Paolo &#x27;Blaisorblade' Giarrusso
Changes by Paolo 'Blaisorblade' Giarrusso : Added file: http://bugs.python.org/file12634/restore-old-oparg-load.diff ___ Python tracker <http://bugs.python.

[issue4753] Faster opcode dispatch on gcc

2009-01-07 Thread Paolo &#x27;Blaisorblade' Giarrusso
Changes by Paolo 'Blaisorblade' Giarrusso : Added file: http://bugs.python.org/file12633/abstract-switch-reduced.diff ___ Python tracker <http://bugs.python.

[issue4753] Faster opcode dispatch on gcc

2009-01-06 Thread Paolo &#x27;Blaisorblade' Giarrusso
Paolo 'Blaisorblade' Giarrusso added the comment: @pitrou: Argh, reference counting hinders even that? I just discovered another problem caused by refcounting. Various techniques allow to create binary code from the interpreter binary, by just pasting together the code for

[issue4753] Faster opcode dispatch on gcc

2009-01-04 Thread Paolo &#x27;Blaisorblade' Giarrusso
Paolo 'Blaisorblade' Giarrusso added the comment: @alexandre: if you add two labels per opcode and two dispatch tables, one before (like now) and one after the parameter fetch (where we have the 'case'), you can keep the same speed. And under the hood we also had two di

[issue4753] Faster opcode dispatch on gcc

2009-01-04 Thread Paolo &#x27;Blaisorblade' Giarrusso
Paolo 'Blaisorblade' Giarrusso added the comment: @Skip: if one decides to generate binary code, there is no need to use switches. Inline threading (also known as "code copying" in some research papers) is what you are probably looking for: http://blog.mozilla.com/dmandel

[issue4753] Faster opcode dispatch on gcc

2009-01-04 Thread Paolo &#x27;Blaisorblade' Giarrusso
Paolo 'Blaisorblade' Giarrusso added the comment: @Alexandre: > > So, can you try dropping the switch altogether, using always computed > > goto and seeing how does the resulting code get compiled? > Removing the switch won't be possible unless we change t

[issue4753] Faster opcode dispatch on gcc

2009-01-03 Thread Paolo &#x27;Blaisorblade' Giarrusso
Paolo 'Blaisorblade' Giarrusso added the comment: Daniel, I forgot to ask for the compilation command line you used, since they make a lot of difference. Can you post them? Thanks ___ Python tracker <http://bugs.python.

[issue4753] Faster opcode dispatch on gcc

2009-01-03 Thread Paolo &#x27;Blaisorblade' Giarrusso
Paolo 'Blaisorblade' Giarrusso added the comment: 1st note: is that code from the threaded version? Note that you need to modify the source to make it accept also ICC to try that. In case you already did that, I guess the patch is not useful at all with ICC since, as far as I can see

[issue4753] Faster opcode dispatch on gcc

2009-01-02 Thread Paolo &#x27;Blaisorblade' Giarrusso
Paolo 'Blaisorblade' Giarrusso added the comment: > I'm not an expert in this kind of optimizations. Could we gain more speed by making the dispatcher table more dense? Python has less than 128 opcodes (len(opcode.opmap) == 113) so they can be squeezed in a smaller table.

[issue4753] Faster opcode dispatch on gcc

2009-01-02 Thread Paolo &#x27;Blaisorblade' Giarrusso
Paolo 'Blaisorblade' Giarrusso added the comment: About miscompilations: the current patch is a bit weird for GCC, because you keep both the switch and the computed goto. But actually, there is no case in which the switch is needed, and computed goto give less room to GCC's ch

[issue4753] Faster opcode dispatch on gcc

2009-01-01 Thread Paolo &#x27;Blaisorblade' Giarrusso
Paolo 'Blaisorblade' Giarrusso added the comment: > I attached some additional benchmarks on SunOS. So far, it seems the benefits of the proposed optimization are highly compiler-dependent. Well, it would be more correct to say that as you verified for GCC 3.4, "miscompil

[issue4753] Faster opcode dispatch on gcc

2009-01-01 Thread Paolo &#x27;Blaisorblade' Giarrusso
Paolo 'Blaisorblade' Giarrusso added the comment: > We would have to change opcode.h for this to be truely useful (in order to re-use OPCODE_LIST()). Yep. > I think that should be the subject of a separate bug entry for code reorganization. Agreed, I'll maybe tr

[issue4753] Faster opcode dispatch on gcc

2008-12-31 Thread Paolo &#x27;Blaisorblade' Giarrusso
Paolo 'Blaisorblade' Giarrusso added the comment: == On the patch itself == Why don't you use the C preprocessor instead of that Python code? Sample code: #define OPCODE_LIST(DEFINE_OPCODE) \ DEFINE_OPCODE(STOP_CODE, 0) \ DEFINE_OP

[issue4753] Faster opcode dispatch on gcc

2008-12-31 Thread Paolo &#x27;Blaisorblade' Giarrusso
Paolo 'Blaisorblade' Giarrusso added the comment: Topics 1) About different speedups on 32bits vs 64 bits 2) About PPC slowdown 3) PyPI === About different speedups on 32bits vs 64 bits === An interpreter is very register-hungry, so on x86_64 it spends much less time on regi

[issue4753] Faster opcode dispatch on gcc

2008-12-31 Thread Paolo &#x27;Blaisorblade' Giarrusso
Paolo 'Blaisorblade' Giarrusso added the comment: > You may want to check out issue1408710 in which a similar patch was > provided, but failed to deliver the desired results. It's not really similar, because you don't duplicate the dispatch code. It took me some

[issue4753] Faster opcode dispatch on gcc

2008-12-31 Thread Paolo &#x27;Blaisorblade' Giarrusso
Paolo 'Blaisorblade' Giarrusso added the comment: Some other comments. The time saving of indirect threading are also associated with the removal of the range check, but better branch prediction is the main advantage. > Also, the macro USE_THREADED_CODE should be renamed to s

[issue4753] Faster opcode dispatch on gcc

2008-12-31 Thread Paolo &#x27;Blaisorblade' Giarrusso
Paolo 'Blaisorblade' Giarrusso added the comment: Mentioning other versions as well. The patch is so easy that it can be backported to all supported versions, so I'm adding all of them (2.5 is in bugfix-only mode, and as far as I can see this patch cannot be accept

[issue4753] Faster opcode dispatch on gcc

2008-12-31 Thread Paolo &#x27;Blaisorblade' Giarrusso
Changes by Paolo 'Blaisorblade' Giarrusso : -- nosy: +blaisorblade ___ Python tracker <http://bugs.python.org/issue4753> ___ ___ Python-bugs-l