[Python-Dev] [ANN] VPython 0.1

2008-10-22 Thread J. Sievers
Hi,

I implemented a variant of the CPython VM on top of Gforth's Vmgen; this made
it fairly straightforward to add direct threaded code and superinstructions for
the various permutations of LOAD_CONST, LOAD_FAST, and most of the two-argument
VM instructions.

Sources:
  http://svirfneblin.org/stuff/VPython-0.1.tar.gz

Pybench output:

Test minimum run-timeaverage  run-time
 thisother   diffthisother   diff
---
  BuiltinFunctionCalls:   104ms   126ms  -17.3%   110ms   128ms  -14.0%
   BuiltinMethodLookup:98ms   136ms  -27.9%   100ms   139ms  -28.1%
 CompareFloats:59ms   121ms  -51.1%60ms   123ms  -50.7%
 CompareFloatsIntegers:67ms   121ms  -45.0%73ms   136ms  -46.7%
   CompareIntegers:57ms   157ms  -63.7%58ms   161ms  -64.1%
CompareInternedStrings:55ms   143ms  -61.6%57ms   161ms  -64.5%
  CompareLongs:56ms   122ms  -54.0%61ms   141ms  -56.7%
CompareStrings:71ms   126ms  -43.6%72ms   131ms  -44.7%
CompareUnicode:82ms   123ms  -33.5%83ms   131ms  -36.7%
 ConcatStrings:   119ms   146ms  -18.4%   129ms   158ms  -18.4%
 ConcatUnicode:90ms   109ms  -17.2%97ms   121ms  -19.8%
   CreateInstances:   116ms   124ms   -6.6%   118ms   127ms   -7.0%
CreateNewInstances:   109ms   119ms   -7.9%   113ms   121ms   -6.6%
   CreateStringsWithConcat:97ms   162ms  -40.3%99ms   169ms  -41.5%
   CreateUnicodeWithConcat:90ms   116ms  -22.8%97ms   122ms  -20.8%
  DictCreation:87ms   122ms  -28.6%91ms   127ms  -28.0%
 DictWithFloatKeys:98ms   139ms  -29.5%   105ms   148ms  -29.3%
   DictWithIntegerKeys:71ms   133ms  -46.7%74ms   136ms  -46.0%
DictWithStringKeys:62ms   126ms  -51.0%64ms   128ms  -50.3%
  ForLoops:68ms   135ms  -49.2%69ms   136ms  -49.2%
IfThenElse:63ms   130ms  -51.6%64ms   134ms  -51.9%
   ListSlicing:   122ms   123ms   -0.9%   126ms   125ms   +0.8%
NestedForLoops:89ms   149ms  -40.2%93ms   152ms  -38.9%
  NormalClassAttribute:88ms   132ms  -33.1%95ms   134ms  -29.5%
   NormalInstanceAttribute:72ms   116ms  -37.9%77ms   118ms  -34.8%
   PythonFunctionCalls:90ms   122ms  -26.1%94ms   125ms  -24.7%
 PythonMethodCalls:   117ms   144ms  -18.8%   121ms   147ms  -17.8%
 Recursion:   121ms   180ms  -32.6%   124ms   184ms  -32.4%
  SecondImport:   144ms   139ms   +3.5%   150ms   143ms   +4.8%
   SecondPackageImport:   151ms   145ms   +3.9%   156ms   149ms   +4.3%
 SecondSubmoduleImport:   178ms   168ms   +5.8%   186ms   176ms   +5.4%
   SimpleComplexArithmetic:71ms   112ms  -36.7%76ms   123ms  -38.3%
SimpleDictManipulation:77ms   139ms  -44.3%78ms   140ms  -44.3%
 SimpleFloatArithmetic:61ms   124ms  -50.7%63ms   126ms  -50.2%
  SimpleIntFloatArithmetic:61ms   121ms  -49.4%62ms   123ms  -49.5%
   SimpleIntegerArithmetic:61ms   121ms  -49.5%62ms   123ms  -49.8%
SimpleListManipulation:58ms   116ms  -50.0%58ms   117ms  -50.2%
  SimpleLongArithmetic:89ms   121ms  -26.3%91ms   124ms  -27.0%
SmallLists:79ms   116ms  -31.8%82ms   122ms  -32.6%
   SmallTuples:91ms   117ms  -22.6%93ms   122ms  -23.6%
 SpecialClassAttribute:84ms   132ms  -36.4%93ms   134ms  -30.4%
  SpecialInstanceAttribute:   111ms   153ms  -27.6%   114ms   155ms  -26.2%
StringMappings:   102ms   115ms  -11.1%   104ms   117ms  -10.9%
  StringPredicates:   100ms   136ms  -26.7%   101ms   137ms  -26.1%
 StringSlicing:79ms   114ms  -30.2%84ms   119ms  -29.9%
 TryExcept:68ms   145ms  -53.2%69ms   148ms  -53.6%
TryRaiseExcept:   106ms   104ms   +2.7%   109ms   106ms   +2.8%
  TupleSlicing:   108ms   126ms  -14.4%   113ms   130ms  -13.0%
   UnicodeMappings:   150ms   150ms   -0.4%   152ms   154ms   -1.7%
 UnicodePredicates:   106ms   130ms  -18.3%   108ms   133ms  -18.3%
 UnicodeProperties:94ms   111ms  -15.2%97ms   115ms  -15.5%
UnicodeSlicing:   101ms   130ms  -22.5%   105ms   136ms  -22.6%
---
Totals:  4750ms  6788ms  -30.0%  4929ms  7035ms  -29.9%

``other'' is vanilla Python 2.5.2.
For details, see README, TODO, and PYBENCH which come with the sources.

Re: [Python-Dev] [ANN] VPython 0.1

2008-10-23 Thread J. Sievers
Hey,

I hope you don't mind my replying in digest form.

First off, I guess I should be a little clearer as to what VPthon is
and what it does.

VPython is essentially a set of patches for CPython (in touches only
three files, diff -b is about 800 lines IIRC plus the switch statement
in ceval.c's EvalFrameEx()).


The main change is moving the VM instruction implementations, in
CPython, blocks of code following a case label, into a separate file,
adding Vmgen stack comments, and removing the explicit stack
manipulation code (plus some minor modification like renaming variables
to work with Vmgen's type prefixes and labels to enable the generation
of superinstructions).
Vmgen parses the stack comments and prints some macros around the
provided instruction code (incidentally, this reduced the 1500 line
switch body to about 1000 lines).
Interested parties should consult ceval.vmg and ceval-vm.i.

The nice thing about this is that:

a) It's fairly easy to implement different types of dispatch, simply by
changing a few macros (and while I haven't done this, it shouldn't be a
problem to add some switch dispatch #ifdefs for non-GCC platforms).

In particular, direct threaded code leads to less horrible branch
prediction than switch dispatch on many machines (exactly how
pronounced this effect is depends heavily on the specific
architecture).

b) Vmgen can generate superinstructions.
A quick primer:
A sequence of code such as LOAD_CONST LOAD_FAST BINARY_ADD will, in
CPython, push some constant onto the stack, push some local onto the
stack, then pop both off the stack, add them and push the result back
onto the stack.
Turning this into a superinstruction means inlining LOAD_CONST and
LOAD_FAST, modifying them to store the values they'd otherwise push
onto the stack in local variables and adding a version of BINARY_ADD
which reads its arguments from those local variables rather than the
stack (this reduces dispatch time in addition to pops and pushes).

David Gregg (and friends) recently published a paper comparing stack
based and register based VMs for Java and found that register based VMs
were substantially faster. The main reason for this appears to be the
absence of the various LOAD_ instructions in a register VM. They looked
at mitigating this using superinstructions but Java would have required
(again, IIRC) about a 1000 (which leads to substantial code growth).

Since Python doesn't have multiple (typed) versions of every
instruction (Java has iadd and so on) much fewer superinstructions are
necessary.

On my system, superinstructions account for about 10% of the 30%
performance gain.


As for limitations, as the README points out, currently 2 cases in
test_doctest fail, as well as 1 case in test_hotshot, test_inspect, and
test_subprocess. And most of the cases in test_trace.
The reason for this is, I suspect, that I removed the line tracing code
from ceval.c (I didn't want to look at it detail, and it doesn't seem
to affect anything else). I expect this would be a bit of work to fix
but I don't see it as a huge problem (in fact, if you don't use
settrace(?) it shouldn't affect you?).


Stack caching: a previous version of VPython supported this, but the
performance gain was minimal (maybe 1-2%, though if done really well
(e.g. using x as the top of stack cache), who knows, more may be
possible). Also, it let to some problems with the garbage collector
seeing an out-of-date stack_pointer[-1].


``Cell'' is, unfortunately, hardcoded into Vmgen. I could either patch
that or run ceval-vm.i through sed or something.


Finally, to the people who pointed out that VPython (the name) is
already taken: Darn! I really should have checked that! Will call it
something else in the future.

Anyway, HTH,
-jakob
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] [ANN] superinstructions (VPython 0.1)

2008-10-24 Thread J. Sievers
Antoine Pitrou <[EMAIL PROTECTED]> writes:

> Hi,
>
> J. Sievers  gmail.com> writes:
>> 
>> A sequence of code such as LOAD_CONST LOAD_FAST BINARY_ADD will, in
>> CPython, push some constant onto the stack, push some local onto the
>> stack, then pop both off the stack, add them and push the result back
>> onto the stack.
>> Turning this into a superinstruction means inlining LOAD_CONST and
>> LOAD_FAST, modifying them to store the values they'd otherwise push
>> onto the stack in local variables and adding a version of BINARY_ADD
>> which reads its arguments from those local variables rather than the
>> stack (this reduces dispatch time in addition to pops and pushes).
>
> The problem is that this only optimizes code like "x + 1" but not "1 + x" or 
> "x
> + y". To make this generic a first step would be to try to fuse LOAD_CONST and
> LOAD_FAST into a single opcode (and check it doesn't slow down the VM). This
> could be possible by copying the constants table into the start of the frame's
> variables array when the frame is created, so that the LOAD_FAST code still 
> does
> a single indexed array dereference. Since constants are constants, they don't
> need to be copied again when the frame is re-used by a subsequent call of the
> same function (but this would slow done recursive functions a bit, since those
> have to create new frames each time they are called).
>
> Then fusing e.g. LOAD_FAST LOAD_FAST BINARY_ADD into ADD_FAST_FAST would cover
> many more cases than the optimization you are writing about, without any
> explosion in the number of opcodes.
>
> Regards
>
> Antoine.
>

I don't know that I'd call it an explosion. Currently there are ~150
superinstructions in all (problematic when using bytecode but
inconsequential once one is committed to threaded code).

A superinstruction definition in Vmgen, btw, looks as follows:

 fcbinary_add = load_fast load_const binary_add

-jakob

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] [ANN] VPython 0.1

2008-10-24 Thread J. Sievers
"M.-A. Lemburg" <[EMAIL PROTECTED]> writes:

[snip]
> BTW: I hope you did not use pybench to get profiles of the opcodes.
> That would most certainly result in good results for pybench, but
> less good ones for general applications such as Django or Zope/Plone.

Algorithm used for superinstruction selection:

1) idea: LOAD_CONST/LOAD_FAST + some suffix
2) potential suffixes:
   $ grep '..*(..*--..*)$' ceval.vmg | grep 'a1 a2 --' > INSTRUCTIONS
3) delete any instruction that I felt wouldn't be particularly frequent
   from INSTRUCTIONS (e.g. raise_varargs)
4) use awk to generate superinstruction definitions

Not only is this relatively unbiased but also very low effort.

-jakob

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] [ANN] VPython 0.1

2008-10-24 Thread J. Sievers
"Daniel Stutzbach" <[EMAIL PROTECTED]> writes:

[snip]

>I searched around for information on how threaded code interacts with
>branch prediction, and here's what I found.  The short answer is that
>threaded code significantly improves branch prediction.

See ``Optimizing indirect branch prediction accuracy in virtual machine
interpreters'' and ``The Structure and Performance of Efficient
Interpreters''.

Cheers,
-jakob

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] [ANN] VPython 0.1

2008-10-24 Thread J. Sievers
Greg Ewing <[EMAIL PROTECTED]> writes:

> Daniel Stutzbach wrote:
>
>> With threaded code, every handler ends with its own dispatcher, so
>> the processor can make fine-grained predictions.
>
> I'm still wondering whether all this stuff makes a
> noticeable difference in real-life Python code, which
> spends most of its time doing expensive things like
> attribute lookups and function calls, rather than
> fiddling with integers in local variables.

Does it? Typically, VM interpreters spend a large percentage of their
cycles in opcode dispatch rather than opcode execution.
Running WITH_TSC compiled Python on some computation heavy application
(Mercurial maybe?) processing a typical workload would be interesting.

Also, I'd estimate that operations such as string concat and list
append (opcode BINARY_ADD) are fairly common, even in real-life Python
code.

Cheers,
-jakob

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] [ANN] VPython 0.1

2008-10-24 Thread J. Sievers
[EMAIL PROTECTED] writes:

> On 23 Oct, 10:42 pm, [EMAIL PROTECTED] wrote:
>>Guido van Rossum wrote:
>>>there already is something else called VPython
>>
>>Perhaps it could be called Fython (Python with a Forth-like VM)
>>or Thython (threaded-code Python).
>
> I feel like I've missed something important, but, why not just call it
> "Python"? ;-)
>
> It's a substantial patch, but from what I understand it's a huge
> performance improvement and completely compatible, both at the C API
> and Python source levels.
>
> Is there any reason this should be a separate project rather than just
> be rolled in to the core?  Obviously once issues like the 'Cell' macro
> are dealt with.

While it seems to work reliably, I don't think the current
implementation is really ``core-ready'' as it stands.
I consider it more of a prototype to demonstrate the potential impact
on these kinds of low-level dispatch optimizations (and for me
personally an opportunity to familiarize myself with the CPython VM).

IMO the main issues are:
 - Right now, CPython's bytecode is translated to direct threaded code
 lazily (when a code object is first evaluated). This would have to
 be merged into compile.c in some way plus some assorted minor changes.
 - The various things mentioned in TODO.
 - Finally, the core developers probably won't want to depend on Gforth
 (required to run Vmgen), so one might have to re-implement Vmgen (not
 a huge deal; I know that there are a number of unpublished versions of
 Vmgen floating around and IIRC one of them is even written in Python;
 even if not Vmgen isn't really that difficult to implement).
 Once that's done, however, I'd consider readability to have
 _increased_ if anything (compare the switch statement in vanilla
 Python 2.5.2's ceval.c to the patchset's ceval.vmg).

Note that none of the above are really show stoppers from a user's
point of view. It's about conforming to CPython's standards of
neatness.

Cheers,
-jakob

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com