Re: [Python-Dev] Micro-optimizations by adding special-case bytecodes?

2017-06-28 Thread Ben Hoyt
I've now got a patch to add new COMPARE_IS_NONE and
COMPARE_IS_NOT_NONE bytecodes to CPython to attempt to speed up the
very common "is None" and "is not None" expressions. It's a very
simple change overall. Here is my patch:
https://gist.github.com/benhoyt/e5ba19afe7b869fd743c1c39fc2afdf8

Note: The intent here is a proof of concept, not a CPython-quality
implementation. I've put the code to generate these in compile.c,
though it may be better off in peephole.c. Also, I'm not sure two new
bytecodes are needed -- it'd be simpler and probably almost as fast to
have one bytecode instruction COMPARE_NONE with an argument that
determines the "is" vs "is not" part.

Anyway, it looks like a nice improvement on a micro-benchmark, with a
20% speed increase for "is None" and "is not None" expressions using
timeit. See results at [1] below.

However, I'm not very familiar with the python/performance benchmarks
and with my first try at running those I'm getting wildly varying
results (most faster, but some tests say 1.7x slower, so I'm pretty
sure this is an artifact of how I'm running them). See [2] below for
my results. Would someone with more experience running those be able
to run the performance tests and post results?

Any other thoughts welcome as well.

-Ben



[1] Here are the micro-benchmarks using timeit against the latest
master (normal build and also PGO) as well as with my patch:

latest master (e613e6add5f07ff6aad5802924596b631b707d2a, June 27)

./configure
make -j
../test_none.sh
x = 1234; x is None -- 2000 loops, best of 5: 19.8 nsec per loop
x = 1234; x is not None -- 1000 loops, best of 5: 20 nsec per loop
x = None; x is None -- 1000 loops, best of 5: 20.7 nsec per loop
x = None; x is not None -- 1000 loops, best of 5: 20.8 nsec per loop
avg 20.3 nsec per loop

./configure --enable-optimizations
make -j
../test_none.sh
x = 1234; x is None -- 2000 loops, best of 5: 18.6 nsec per loop
x = 1234; x is not None -- 2000 loops, best of 5: 19.2 nsec per loop
x = None; x is None -- 1000 loops, best of 5: 19.5 nsec per loop
x = None; x is not None -- 2000 loops, best of 5: 19.4 nsec per loop
avg 19.2 nsec per loop


is_none_bytecode branch (with my patch)

./configure
make -j
../test_none.sh
x = 1234; x is None -- 2000 loops, best of 5: 16 nsec per loop
x = 1234; x is not None -- 2000 loops, best of 5: 16.4 nsec per loop
x = None; x is None -- 2000 loops, best of 5: 16.6 nsec per loop
x = None; x is not None -- 2000 loops, best of 5: 15.5 nsec per loop
avg 16.1 nsec per loop (21% faster than master)

./configure --enable-optimizations
make -j
../test_none.sh
x = 1234; x is None -- 2000 loops, best of 5: 15.6 nsec per loop
x = 1234; x is not None -- 2000 loops, best of 5: 16.4 nsec per loop
x = None; x is None -- 2000 loops, best of 5: 15.7 nsec per loop
x = None; x is not None -- 2000 loops, best of 5: 15.4 nsec per loop
avg 15.8 nsec per loop (18% faster than master)


[2] Benchmarks comparing master and is_none_bytecode patch (each
compiled with --enable-optimizations) using python/performance:

+-++--+
| Benchmark   | master_opt | is_none_bytecode_opt |
+=++==+
| 2to3| 617 ms | 541 ms: 1.14x faster (-12%)  |
+-++--+
| chameleon   | 19.9 ms| 18.6 ms: 1.07x faster (-7%)  |
+-++--+
| crypto_pyaes| 208 ms | 201 ms: 1.04x faster (-3%)   |
+-++--+
| deltablue   | 13.8 ms| 12.9 ms: 1.07x faster (-7%)  |
+-++--+
| django_template | 245 ms | 233 ms: 1.05x faster (-5%)   |
+-++--+
| dulwich_log | 132 ms | 126 ms: 1.05x faster (-5%)   |
+-++--+
| fannkuch| 898 ms | 849 ms: 1.06x faster (-5%)   |
+-++--+
| float   | 204 ms | 191 ms: 1.07x faster (-7%)   |
+-++--+
| genshi_text | 57.5 ms| 54.8 ms: 1.05x faster (-5%)  |
+-++--+
| genshi_xml  | 122 ms | 114 ms: 1.07x faster (-6%)   |
+-++--+
| hexiom  | 18.4 ms| 26.9 ms: 1.46x slower (+46%) |
+-++--+
| html5lib| 155 ms | 265 ms: 1.72x slower (+72%)  |
+---

Re: [Python-Dev] Micro-optimizations by adding special-case bytecodes?

2017-06-28 Thread Victor Stinner
(Victor wears his benchmark hat.)

2017-06-28 15:50 GMT+02:00 Ben Hoyt :
> ../test_none.sh
> x = 1234; x is None -- 2000 loops, best of 5: 19.8 nsec per loop
> x = 1234; x is not None -- 1000 loops, best of 5: 20 nsec per loop
> x = None; x is None -- 1000 loops, best of 5: 20.7 nsec per loop
> x = None; x is not None -- 1000 loops, best of 5: 20.8 nsec per loop
> avg 20.3 nsec per loop

Hum, please use perf timeit instead of timeit, it's more reliable. See also:

"How to get reproductible benchmark results"
http://perf.readthedocs.io/en/latest/run_benchmark.html#how-to-get-reproductible-benchmark-results

> [2] Benchmarks comparing master and is_none_bytecode patch (each
> compiled with --enable-optimizations) using python/performance:
>
> +-++--+
> | Benchmark   | master_opt | is_none_bytecode_opt |
> +=++==+
> | 2to3| 617 ms | 541 ms: 1.14x faster (-12%)  |
> +-++--+
> | chameleon   | 19.9 ms| 18.6 ms: 1.07x faster (-7%)  |
> +-++--+
> | crypto_pyaes| 208 ms | 201 ms: 1.04x faster (-3%)   |
> +-++--+
> | deltablue   | 13.8 ms| 12.9 ms: 1.07x faster (-7%)  |

FYI you can add the -G option to perf compare_to to sort results by
speed (group faster & slower). It gives a more readable table. I also
like using --min-speed=5 to ignore changes smaller than 5%, it reduces
the noise and makes the stable more readable.

Victor
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Micro-optimizations by adding special-case bytecodes?

2017-06-28 Thread Antoine Pitrou
On Wed, 28 Jun 2017 09:50:10 -0400
Ben Hoyt  wrote:
> I've now got a patch to add new COMPARE_IS_NONE and
> COMPARE_IS_NOT_NONE bytecodes to CPython to attempt to speed up the
> very common "is None" and "is not None" expressions. It's a very
> simple change overall. Here is my patch:
> https://gist.github.com/benhoyt/e5ba19afe7b869fd743c1c39fc2afdf8

I don't want to discourage you, but this is unlikely to produce
significant speedups on real-world code.  The simple reason is that
comparing to None is not a bottleneck for any real application --
comparing to None is probably damn fast compared to everything else
the real application does.

That said, it would be nice if you could get stable benchmark results
to validate that theory (or not!) ;-)

Regards

Antoine.

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Micro-optimizations by adding special-case bytecodes?

2017-06-28 Thread Ben Hoyt
Thanks for the pointers -- I wasn't aware of perf timeit. I'll try to
get some more stable benchmark results. -Ben

On Wed, Jun 28, 2017 at 10:20 AM, Victor Stinner
 wrote:
> (Victor wears his benchmark hat.)
>
> 2017-06-28 15:50 GMT+02:00 Ben Hoyt :
>> ../test_none.sh
>> x = 1234; x is None -- 2000 loops, best of 5: 19.8 nsec per loop
>> x = 1234; x is not None -- 1000 loops, best of 5: 20 nsec per loop
>> x = None; x is None -- 1000 loops, best of 5: 20.7 nsec per loop
>> x = None; x is not None -- 1000 loops, best of 5: 20.8 nsec per loop
>> avg 20.3 nsec per loop
>
> Hum, please use perf timeit instead of timeit, it's more reliable. See also:
>
> "How to get reproductible benchmark results"
> http://perf.readthedocs.io/en/latest/run_benchmark.html#how-to-get-reproductible-benchmark-results
>
>> [2] Benchmarks comparing master and is_none_bytecode patch (each
>> compiled with --enable-optimizations) using python/performance:
>>
>> +-++--+
>> | Benchmark   | master_opt | is_none_bytecode_opt |
>> +=++==+
>> | 2to3| 617 ms | 541 ms: 1.14x faster (-12%)  |
>> +-++--+
>> | chameleon   | 19.9 ms| 18.6 ms: 1.07x faster (-7%)  |
>> +-++--+
>> | crypto_pyaes| 208 ms | 201 ms: 1.04x faster (-3%)   |
>> +-++--+
>> | deltablue   | 13.8 ms| 12.9 ms: 1.07x faster (-7%)  |
>
> FYI you can add the -G option to perf compare_to to sort results by
> speed (group faster & slower). It gives a more readable table. I also
> like using --min-speed=5 to ignore changes smaller than 5%, it reduces
> the noise and makes the stable more readable.
>
> Victor
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Micro-optimizations by adding special-case bytecodes?

2017-06-28 Thread Ben Hoyt
> I don't want to discourage you, but this is unlikely to produce
> significant speedups on real-world code.  The simple reason is that
> comparing to None is not a bottleneck for any real application --
> comparing to None is probably damn fast compared to everything else
> the real application does.

That's true. However, I don't think that makes small optimizations
worthless. For example, the core devs speed up memory allocation or
method calls or whatever by a few percent, and consider it worthwhile,
because it all adds up.

> That said, it would be nice if you could get stable benchmark results
> to validate that theory (or not!) ;-)

Yep, I'm more than willing to accept that outcome if that's what the
results show! :-)

-Ben
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com