Paolo 'Blaisorblade' Giarrusso added the comment:
-fno-gcse is controversial.
Even if it might avoid jumps sharing, the impact of that option has to
be measured, since common subexpression elimination allows omitting some
recalculations, so disabling global CSE might have a negative
Paolo 'Blaisorblade' Giarrusso added the comment:
>Probably #if the definitions of Py_LIKELY and Py_UNLIKELY instead of
__builtin_expect so new compilers can easily add their own definitions.
This was done in the first version, but with the currently supported
compilers it's
Paolo 'Blaisorblade' Giarrusso added the comment:
Given a 10% speedup on some systems, and statistically insignificant
changes on other systems, I would still apply the patch, even simply
because the bitmask part simply makes more sense.
I'm not sure about the goto part, but
Paolo 'Blaisorblade' Giarrusso added the comment:
Yep, agreed. It could be quite cool to rely on __attribute__((cold)) to
mark format_exc_check_arg(), but that only works with newer gcc's. I
guess I'll add many likely() by hand - in some cases GCC might already
get it right
Paolo 'Blaisorblade' Giarrusso added the comment:
Also, GCC 2.95 does not support the construct, GCC 2.96 is required.
So, I'd suggest defining likely/unlikely unconditionally and using this,
which leads to less code overall:
# if (__GNUC__ < 2) || ((__GNUC__ == 2) &&
Changes by Paolo 'Blaisorblade' Giarrusso :
--
nosy: +blaisorblade
___
Python tracker
<http://bugs.python.org/issue4715>
___
___
Python-bugs-l
Paolo 'Blaisorblade' Giarrusso added the comment:
#4715 is interesting, but is not really about superinstructions.
Superinstructions are not created because they make sense; any common
sequence of opcodes can become a superinstruction, just for the point of
saving dispatches. And th
Paolo 'Blaisorblade' Giarrusso added the comment:
If speedup on other machines are confirmed, a slowdown of less than 2%
should be acceptable (it's also similar to the statistic noise I have on
my machine - it's a pity pybench does not calculate the standard
deviation o
Paolo 'Blaisorblade' Giarrusso added the comment:
Ok, then vmgen adds almost just direct threading instead of indirect
threading.
Since the purpose of superinstructions is to eliminate dispatch
overhead, and that's more important when little actual work is done,
what about
Changes by Paolo 'Blaisorblade' Giarrusso :
--
nosy: +blaisorblade
___
Python tracker
<http://bugs.python.org/issue4896>
___
___
Python-bugs-l
Paolo 'Blaisorblade' Giarrusso added the comment:
A couple percent maybe is not worth vmgen-ing. But even if I'm not a
vmgen expert, I read many papers from Ertl about superinstructions and
replication, so the expected speedup from vmgen'ing is much bigger.
Is there some m
Paolo 'Blaisorblade' Giarrusso added the comment:
> Same for CPU-specific tuning: I don't think we want to ship Python
with compiler flags which depend on the particular CPU being used.
I wasn't suggesting this - but since different CPUs have different
optimization ru
Paolo 'Blaisorblade' Giarrusso added the comment:
The standing question is still: can we get ICC to produce the expected
output? It looks like we still didn't manage, and since ICC is the best
compiler out there, this matters.
Some problems with SunCC, even if it doesn't
Paolo 'Blaisorblade' Giarrusso added the comment:
@ ajaksu2
> Applying your patches makes no difference with gcc 4.2 and gives a
> barely noticeable (~2%) slowdown with icc.
"Your patches" is something quite unclear :-)
Which are the patch sets you are comparing?
And
Paolo 'Blaisorblade' Giarrusso added the comment:
@skip:
In simple words, the x86 call:
call 0x2000
placed at address 0x1000 becomes:
call %rip + 0x1000
RIP holds the instruction pointer, which will be 0x1000 in this case
(actually, I'm ignoring the detail that when executin
Paolo 'Blaisorblade' Giarrusso added the comment:
I finally implemented my suggestion for the switch elimination.
On top of threadedceval5.patch, apply abstract-switch-reduced.diff and
then restore-old-oparg-load.diff to test it.
This way, only computed goto's are used. I wou
Changes by Paolo 'Blaisorblade' Giarrusso :
Added file: http://bugs.python.org/file12634/restore-old-oparg-load.diff
___
Python tracker
<http://bugs.python.
Changes by Paolo 'Blaisorblade' Giarrusso :
Added file: http://bugs.python.org/file12633/abstract-switch-reduced.diff
___
Python tracker
<http://bugs.python.
Paolo 'Blaisorblade' Giarrusso added the comment:
@pitrou:
Argh, reference counting hinders even that?
I just discovered another problem caused by refcounting.
Various techniques allow to create binary code from the interpreter
binary, by just pasting together the code for
Paolo 'Blaisorblade' Giarrusso added the comment:
@alexandre: if you add two labels per opcode and two dispatch tables,
one before (like now) and one after the parameter fetch (where we have
the 'case'), you can keep the same speed.
And under the hood we also had two di
Paolo 'Blaisorblade' Giarrusso added the comment:
@Skip: if one decides to generate binary code, there is no need to use
switches. Inline threading (also known as "code copying" in some
research papers) is what you are probably looking for:
http://blog.mozilla.com/dmandel
Paolo 'Blaisorblade' Giarrusso added the comment:
@Alexandre:
> > So, can you try dropping the switch altogether, using always computed
> > goto and seeing how does the resulting code get compiled?
> Removing the switch won't be possible unless we change t
Paolo 'Blaisorblade' Giarrusso added the comment:
Daniel, I forgot to ask for the compilation command line you used, since
they make a lot of difference. Can you post them? Thanks
___
Python tracker
<http://bugs.python.
Paolo 'Blaisorblade' Giarrusso added the comment:
1st note: is that code from the threaded version? Note that you need to
modify the source to make it accept also ICC to try that.
In case you already did that, I guess the patch is not useful at all
with ICC since, as far as I can see
Paolo 'Blaisorblade' Giarrusso added the comment:
> I'm not an expert in this kind of optimizations. Could we gain more
speed by making the dispatcher table more dense? Python has less than
128 opcodes (len(opcode.opmap) == 113) so they can be squeezed in a
smaller table.
Paolo 'Blaisorblade' Giarrusso added the comment:
About miscompilations: the current patch is a bit weird for GCC, because
you keep both the switch and the computed goto.
But actually, there is no case in which the switch is needed, and
computed goto give less room to GCC's ch
Paolo 'Blaisorblade' Giarrusso added the comment:
> I attached some additional benchmarks on SunOS. So far, it seems the
benefits of the proposed optimization are highly compiler-dependent.
Well, it would be more correct to say that as you verified for GCC 3.4,
"miscompil
Paolo 'Blaisorblade' Giarrusso added the comment:
> We would have to change opcode.h for this to be truely useful (in
order to re-use OPCODE_LIST()).
Yep.
> I think that should be the subject of a separate bug entry for code
reorganization.
Agreed, I'll maybe tr
Paolo 'Blaisorblade' Giarrusso added the comment:
== On the patch itself ==
Why don't you use the C preprocessor instead of that Python code?
Sample code:
#define OPCODE_LIST(DEFINE_OPCODE) \
DEFINE_OPCODE(STOP_CODE, 0) \
DEFINE_OP
Paolo 'Blaisorblade' Giarrusso added the comment:
Topics
1) About different speedups on 32bits vs 64 bits
2) About PPC slowdown
3) PyPI
=== About different speedups on 32bits vs 64 bits ===
An interpreter is very register-hungry, so on x86_64 it spends much less
time on regi
Paolo 'Blaisorblade' Giarrusso added the comment:
> You may want to check out issue1408710 in which a similar patch was
> provided, but failed to deliver the desired results.
It's not really similar, because you don't duplicate the dispatch code.
It took me some
Paolo 'Blaisorblade' Giarrusso added the comment:
Some other comments.
The time saving of indirect threading are also associated with the
removal of the range check, but better branch prediction is the main
advantage.
> Also, the macro USE_THREADED_CODE should be renamed to s
Paolo 'Blaisorblade' Giarrusso added the comment:
Mentioning other versions as well.
The patch is so easy that it can be backported to all supported
versions, so I'm adding all of them (2.5 is in bugfix-only mode, and as
far as I can see this patch cannot be accept
Changes by Paolo 'Blaisorblade' Giarrusso :
--
nosy: +blaisorblade
___
Python tracker
<http://bugs.python.org/issue4753>
___
___
Python-bugs-l
34 matches
Mail list logo