[Python-Dev] [ANN] VPython 0.1
Hi, I implemented a variant of the CPython VM on top of Gforth's Vmgen; this made it fairly straightforward to add direct threaded code and superinstructions for the various permutations of LOAD_CONST, LOAD_FAST, and most of the two-argument VM instructions. Sources: http://svirfneblin.org/stuff/VPython-0.1.tar.gz Pybench output: Test minimum run-timeaverage run-time thisother diffthisother diff --- BuiltinFunctionCalls: 104ms 126ms -17.3% 110ms 128ms -14.0% BuiltinMethodLookup:98ms 136ms -27.9% 100ms 139ms -28.1% CompareFloats:59ms 121ms -51.1%60ms 123ms -50.7% CompareFloatsIntegers:67ms 121ms -45.0%73ms 136ms -46.7% CompareIntegers:57ms 157ms -63.7%58ms 161ms -64.1% CompareInternedStrings:55ms 143ms -61.6%57ms 161ms -64.5% CompareLongs:56ms 122ms -54.0%61ms 141ms -56.7% CompareStrings:71ms 126ms -43.6%72ms 131ms -44.7% CompareUnicode:82ms 123ms -33.5%83ms 131ms -36.7% ConcatStrings: 119ms 146ms -18.4% 129ms 158ms -18.4% ConcatUnicode:90ms 109ms -17.2%97ms 121ms -19.8% CreateInstances: 116ms 124ms -6.6% 118ms 127ms -7.0% CreateNewInstances: 109ms 119ms -7.9% 113ms 121ms -6.6% CreateStringsWithConcat:97ms 162ms -40.3%99ms 169ms -41.5% CreateUnicodeWithConcat:90ms 116ms -22.8%97ms 122ms -20.8% DictCreation:87ms 122ms -28.6%91ms 127ms -28.0% DictWithFloatKeys:98ms 139ms -29.5% 105ms 148ms -29.3% DictWithIntegerKeys:71ms 133ms -46.7%74ms 136ms -46.0% DictWithStringKeys:62ms 126ms -51.0%64ms 128ms -50.3% ForLoops:68ms 135ms -49.2%69ms 136ms -49.2% IfThenElse:63ms 130ms -51.6%64ms 134ms -51.9% ListSlicing: 122ms 123ms -0.9% 126ms 125ms +0.8% NestedForLoops:89ms 149ms -40.2%93ms 152ms -38.9% NormalClassAttribute:88ms 132ms -33.1%95ms 134ms -29.5% NormalInstanceAttribute:72ms 116ms -37.9%77ms 118ms -34.8% PythonFunctionCalls:90ms 122ms -26.1%94ms 125ms -24.7% PythonMethodCalls: 117ms 144ms -18.8% 121ms 147ms -17.8% Recursion: 121ms 180ms -32.6% 124ms 184ms -32.4% SecondImport: 144ms 139ms +3.5% 150ms 143ms +4.8% SecondPackageImport: 151ms 145ms +3.9% 156ms 149ms +4.3% SecondSubmoduleImport: 178ms 168ms +5.8% 186ms 176ms +5.4% SimpleComplexArithmetic:71ms 112ms -36.7%76ms 123ms -38.3% SimpleDictManipulation:77ms 139ms -44.3%78ms 140ms -44.3% SimpleFloatArithmetic:61ms 124ms -50.7%63ms 126ms -50.2% SimpleIntFloatArithmetic:61ms 121ms -49.4%62ms 123ms -49.5% SimpleIntegerArithmetic:61ms 121ms -49.5%62ms 123ms -49.8% SimpleListManipulation:58ms 116ms -50.0%58ms 117ms -50.2% SimpleLongArithmetic:89ms 121ms -26.3%91ms 124ms -27.0% SmallLists:79ms 116ms -31.8%82ms 122ms -32.6% SmallTuples:91ms 117ms -22.6%93ms 122ms -23.6% SpecialClassAttribute:84ms 132ms -36.4%93ms 134ms -30.4% SpecialInstanceAttribute: 111ms 153ms -27.6% 114ms 155ms -26.2% StringMappings: 102ms 115ms -11.1% 104ms 117ms -10.9% StringPredicates: 100ms 136ms -26.7% 101ms 137ms -26.1% StringSlicing:79ms 114ms -30.2%84ms 119ms -29.9% TryExcept:68ms 145ms -53.2%69ms 148ms -53.6% TryRaiseExcept: 106ms 104ms +2.7% 109ms 106ms +2.8% TupleSlicing: 108ms 126ms -14.4% 113ms 130ms -13.0% UnicodeMappings: 150ms 150ms -0.4% 152ms 154ms -1.7% UnicodePredicates: 106ms 130ms -18.3% 108ms 133ms -18.3% UnicodeProperties:94ms 111ms -15.2%97ms 115ms -15.5% UnicodeSlicing: 101ms 130ms -22.5% 105ms 136ms -22.6% --- Totals: 4750ms 6788ms -30.0% 4929ms 7035ms -29.9% ``other'' is vanilla Python 2.5.2. For details, see README, TODO, and PYBENCH which come with the sources.
Re: [Python-Dev] [ANN] VPython 0.1
Hey, I hope you don't mind my replying in digest form. First off, I guess I should be a little clearer as to what VPthon is and what it does. VPython is essentially a set of patches for CPython (in touches only three files, diff -b is about 800 lines IIRC plus the switch statement in ceval.c's EvalFrameEx()). The main change is moving the VM instruction implementations, in CPython, blocks of code following a case label, into a separate file, adding Vmgen stack comments, and removing the explicit stack manipulation code (plus some minor modification like renaming variables to work with Vmgen's type prefixes and labels to enable the generation of superinstructions). Vmgen parses the stack comments and prints some macros around the provided instruction code (incidentally, this reduced the 1500 line switch body to about 1000 lines). Interested parties should consult ceval.vmg and ceval-vm.i. The nice thing about this is that: a) It's fairly easy to implement different types of dispatch, simply by changing a few macros (and while I haven't done this, it shouldn't be a problem to add some switch dispatch #ifdefs for non-GCC platforms). In particular, direct threaded code leads to less horrible branch prediction than switch dispatch on many machines (exactly how pronounced this effect is depends heavily on the specific architecture). b) Vmgen can generate superinstructions. A quick primer: A sequence of code such as LOAD_CONST LOAD_FAST BINARY_ADD will, in CPython, push some constant onto the stack, push some local onto the stack, then pop both off the stack, add them and push the result back onto the stack. Turning this into a superinstruction means inlining LOAD_CONST and LOAD_FAST, modifying them to store the values they'd otherwise push onto the stack in local variables and adding a version of BINARY_ADD which reads its arguments from those local variables rather than the stack (this reduces dispatch time in addition to pops and pushes). David Gregg (and friends) recently published a paper comparing stack based and register based VMs for Java and found that register based VMs were substantially faster. The main reason for this appears to be the absence of the various LOAD_ instructions in a register VM. They looked at mitigating this using superinstructions but Java would have required (again, IIRC) about a 1000 (which leads to substantial code growth). Since Python doesn't have multiple (typed) versions of every instruction (Java has iadd and so on) much fewer superinstructions are necessary. On my system, superinstructions account for about 10% of the 30% performance gain. As for limitations, as the README points out, currently 2 cases in test_doctest fail, as well as 1 case in test_hotshot, test_inspect, and test_subprocess. And most of the cases in test_trace. The reason for this is, I suspect, that I removed the line tracing code from ceval.c (I didn't want to look at it detail, and it doesn't seem to affect anything else). I expect this would be a bit of work to fix but I don't see it as a huge problem (in fact, if you don't use settrace(?) it shouldn't affect you?). Stack caching: a previous version of VPython supported this, but the performance gain was minimal (maybe 1-2%, though if done really well (e.g. using x as the top of stack cache), who knows, more may be possible). Also, it let to some problems with the garbage collector seeing an out-of-date stack_pointer[-1]. ``Cell'' is, unfortunately, hardcoded into Vmgen. I could either patch that or run ceval-vm.i through sed or something. Finally, to the people who pointed out that VPython (the name) is already taken: Darn! I really should have checked that! Will call it something else in the future. Anyway, HTH, -jakob ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] [ANN] superinstructions (VPython 0.1)
Antoine Pitrou <[EMAIL PROTECTED]> writes: > Hi, > > J. Sievers gmail.com> writes: >> >> A sequence of code such as LOAD_CONST LOAD_FAST BINARY_ADD will, in >> CPython, push some constant onto the stack, push some local onto the >> stack, then pop both off the stack, add them and push the result back >> onto the stack. >> Turning this into a superinstruction means inlining LOAD_CONST and >> LOAD_FAST, modifying them to store the values they'd otherwise push >> onto the stack in local variables and adding a version of BINARY_ADD >> which reads its arguments from those local variables rather than the >> stack (this reduces dispatch time in addition to pops and pushes). > > The problem is that this only optimizes code like "x + 1" but not "1 + x" or > "x > + y". To make this generic a first step would be to try to fuse LOAD_CONST and > LOAD_FAST into a single opcode (and check it doesn't slow down the VM). This > could be possible by copying the constants table into the start of the frame's > variables array when the frame is created, so that the LOAD_FAST code still > does > a single indexed array dereference. Since constants are constants, they don't > need to be copied again when the frame is re-used by a subsequent call of the > same function (but this would slow done recursive functions a bit, since those > have to create new frames each time they are called). > > Then fusing e.g. LOAD_FAST LOAD_FAST BINARY_ADD into ADD_FAST_FAST would cover > many more cases than the optimization you are writing about, without any > explosion in the number of opcodes. > > Regards > > Antoine. > I don't know that I'd call it an explosion. Currently there are ~150 superinstructions in all (problematic when using bytecode but inconsequential once one is committed to threaded code). A superinstruction definition in Vmgen, btw, looks as follows: fcbinary_add = load_fast load_const binary_add -jakob ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] [ANN] VPython 0.1
"M.-A. Lemburg" <[EMAIL PROTECTED]> writes: [snip] > BTW: I hope you did not use pybench to get profiles of the opcodes. > That would most certainly result in good results for pybench, but > less good ones for general applications such as Django or Zope/Plone. Algorithm used for superinstruction selection: 1) idea: LOAD_CONST/LOAD_FAST + some suffix 2) potential suffixes: $ grep '..*(..*--..*)$' ceval.vmg | grep 'a1 a2 --' > INSTRUCTIONS 3) delete any instruction that I felt wouldn't be particularly frequent from INSTRUCTIONS (e.g. raise_varargs) 4) use awk to generate superinstruction definitions Not only is this relatively unbiased but also very low effort. -jakob ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] [ANN] VPython 0.1
"Daniel Stutzbach" <[EMAIL PROTECTED]> writes: [snip] >I searched around for information on how threaded code interacts with >branch prediction, and here's what I found. The short answer is that >threaded code significantly improves branch prediction. See ``Optimizing indirect branch prediction accuracy in virtual machine interpreters'' and ``The Structure and Performance of Efficient Interpreters''. Cheers, -jakob ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] [ANN] VPython 0.1
Greg Ewing <[EMAIL PROTECTED]> writes: > Daniel Stutzbach wrote: > >> With threaded code, every handler ends with its own dispatcher, so >> the processor can make fine-grained predictions. > > I'm still wondering whether all this stuff makes a > noticeable difference in real-life Python code, which > spends most of its time doing expensive things like > attribute lookups and function calls, rather than > fiddling with integers in local variables. Does it? Typically, VM interpreters spend a large percentage of their cycles in opcode dispatch rather than opcode execution. Running WITH_TSC compiled Python on some computation heavy application (Mercurial maybe?) processing a typical workload would be interesting. Also, I'd estimate that operations such as string concat and list append (opcode BINARY_ADD) are fairly common, even in real-life Python code. Cheers, -jakob ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] [ANN] VPython 0.1
[EMAIL PROTECTED] writes: > On 23 Oct, 10:42 pm, [EMAIL PROTECTED] wrote: >>Guido van Rossum wrote: >>>there already is something else called VPython >> >>Perhaps it could be called Fython (Python with a Forth-like VM) >>or Thython (threaded-code Python). > > I feel like I've missed something important, but, why not just call it > "Python"? ;-) > > It's a substantial patch, but from what I understand it's a huge > performance improvement and completely compatible, both at the C API > and Python source levels. > > Is there any reason this should be a separate project rather than just > be rolled in to the core? Obviously once issues like the 'Cell' macro > are dealt with. While it seems to work reliably, I don't think the current implementation is really ``core-ready'' as it stands. I consider it more of a prototype to demonstrate the potential impact on these kinds of low-level dispatch optimizations (and for me personally an opportunity to familiarize myself with the CPython VM). IMO the main issues are: - Right now, CPython's bytecode is translated to direct threaded code lazily (when a code object is first evaluated). This would have to be merged into compile.c in some way plus some assorted minor changes. - The various things mentioned in TODO. - Finally, the core developers probably won't want to depend on Gforth (required to run Vmgen), so one might have to re-implement Vmgen (not a huge deal; I know that there are a number of unpublished versions of Vmgen floating around and IIRC one of them is even written in Python; even if not Vmgen isn't really that difficult to implement). Once that's done, however, I'd consider readability to have _increased_ if anything (compare the switch statement in vanilla Python 2.5.2's ceval.c to the patchset's ceval.vmg). Note that none of the above are really show stoppers from a user's point of view. It's about conforming to CPython's standards of neatness. Cheers, -jakob ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com