[issue32534] Speed-up list.insert

2018-01-11 Thread Jeethu Rao

New submission from Jeethu Rao :

I've noticed that replacing the for loop in the ins1 function in listobject.c 
with a memmove when the number of pointers to move is greater than 16 seems to 
speed up list.insert by about 3 to 4x on a contrived benchmark.

# Before
jeethu@dev:cpython  (master)$ ./python -m timeit -s "l = []" "for _ in 
range(100): l.insert(0, None)"
200 loops, best of 5: 3.07 msec per loop

#After
jeethu@dev:cpython  (3.7_list_insert_memmove)$ ./python -m timeit -s "l = []" 
"for _ in range(100): l.insert(0, None)"
500 loops, best of 5: 744 usec per loop

Both builds were configured with --enable-optimizations and --with-lto

--
components: Interpreter Core
messages: 309806
nosy: jeethu
priority: normal
severity: normal
status: open
title: Speed-up list.insert
type: performance
versions: Python 3.7

___
Python tracker 
<https://bugs.python.org/issue32534>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue32534] Speed-up list.insert

2018-01-11 Thread Jeethu Rao

Change by Jeethu Rao :


--
keywords: +patch
pull_requests: +5017
stage:  -> patch review

___
Python tracker 
<https://bugs.python.org/issue32534>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue32534] Speed-up list.insert

2018-01-11 Thread Jeethu Rao

Jeethu Rao  added the comment:

I tried it with a couple of different thresholds, twice each, ignoring the 
results of the first run. 16 seems to be the sweet spot.

THRESHOLD = 0
jeethu@dev:cpython  (3.7_list_insert_memmove)$ ./python -m timeit -s "l = []" 
"for _ in range(100): l.insert(0, None)"
500 loops, best of 5: 787 usec per loop

THRESHOLD = 4
jeethu@dev:cpython  (3.7_list_insert_memmove)$ ./python -m timeit -s "l = []" 
"for _ in range(100): l.insert(0, None)"
500 loops, best of 5: 781 usec per loop

THRESHOLD = 8
jeethu@dev:cpython  (3.7_list_insert_memmove)$ ./python -m timeit -s "l = []" 
"for _ in range(100): l.insert(0, None)"
500 loops, best of 5: 780 usec per loop

THRESHOLD = 16
jeethu@dev:cpython  (3.7_list_insert_memmove)$ ./python -m timeit -s "l = []" 
"for _ in range(100): l.insert(0, None)"
500 loops, best of 5: 758 usec per loop

THRESHOLD = 32
jeethu@dev:cpython  (3.7_list_insert_memmove)$ ./python -m timeit -s "l = []" 
"for _ in range(100): l.insert(0, None)"
500 loops, best of 5: 764 usec per loop

--

___
Python tracker 
<https://bugs.python.org/issue32534>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue32534] Speed-up list.insert

2018-01-14 Thread Jeethu Rao

Jeethu Rao  added the comment:

I managed to tune an i7700k desktop running Ubuntu 17.10 per this doc[1], and 
ran the pyperformance benchmarks[2].
I also tried various threshold with this benchmark and 16 still seems to be the 
sweet spot.
The geometric mean of the relative changes across all benchmarks indicates a 
12.85% improvement.

[1]: http://pyperformance.readthedocs.io/usage.html#how-to-get-stable-benchmarks
[2]: https://gist.github.com/jeethu/4d9a8bd3eecbd067c00f085f7d2e7595

--

___
Python tracker 
<https://bugs.python.org/issue32534>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue32534] Speed-up list.insert

2018-01-15 Thread Jeethu Rao

Jeethu Rao  added the comment:

I rebased my branch off of master and rebuilt it, and also rebuilt the baseline 
from master. Both versions were configured with --with-lto and 
--enable-optimizations. The benchmark numbers are rather different this 
time[1]. pidigits is slower, but nbody is still inexplicably faster. Should I 
run the benchmark without a PGO build (i.e without --enable-optimizations)?

Also, I've attached the output from the pyperformance run to the aforementioned 
gist.

[1]: https://gist.github.com/jeethu/dc0811d415dd6d1a1621761e43842f88

--

___
Python tracker 
<https://bugs.python.org/issue32534>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue32534] Speed-up list.insert

2018-01-16 Thread Jeethu Rao

Jeethu Rao  added the comment:

Built and benchmarked both the baseline and the patch without PGO; the 
differences are less pronounced, but still present.

https://gist.github.com/jeethu/abd404e39c6dfcbabb4c01661b9238d1

--

___
Python tracker 
<https://bugs.python.org/issue32534>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue32534] Speed-up list.insert: use memmove()

2018-01-16 Thread Jeethu Rao

Jeethu Rao  added the comment:

Victor: I’m booting with the isolcpus and rcu_nocbs flags, and running 
pyperformance with the --affinity flag to pin the benchmark to the isolated CPU 
cores. I’ve also run `perf system tune`. And the OS is Ubuntu 17.10. Thanks for 
the tip on using perf timeit instead of timeit. I’ve run the benchmark that 
you've suggested with a minor change (to avoid the cost of LOAD_ATTR) and 
attached the output on a gist[1].

Antoine: Thanks for benchmarking it. After looking at the generated 
assembly[2], I found that ins1 is being inlined and the call to memmove was 
appearing before the loop (possibly because the compiler assumes that the call 
to memmove is more likely). I made a minor change and increased the threshold 
to 32. I’ve attached the generated assembly in a gist[3] (The relevant sequence 
is around line 8406, if you’re interested). And here’s the pyperformance 
comparison[4]. Could you please try benchmarking this version on your machine?

[1]: https://gist.github.com/jeethu/2d2de55afdb8ea4ad03b6a5d04d5227f
[2]: Generated with “gcc -DNDEBUG -fwrapv -O3 -std=c99  -I. -I./Include 
-DPy_BUILD_CORE -S -masm=intel Objects/listobject.c”
[3]: https://gist.github.com/jeethu/596bfc1251590bc51cc230046b52fb38
[4]: https://gist.github.com/jeethu/d6e4045f7932136548a806380eddd030

--

___
Python tracker 
<https://bugs.python.org/issue32534>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue32534] Speed-up list.insert: use memmove()

2018-01-16 Thread Jeethu Rao

Jeethu Rao  added the comment:

> Be careful. Moving "l.insert" lookup of the loop might make the code slower. 
> I never looked why. But Python 3.7 was also optimized in many places to call 
> methods, so I'm not sure anymore :)

Thanks again! Here's a gist without the hack[1].

[1]: https://gist.github.com/jeethu/19430d802aa08e28d1cb5eb20a47a470

--

___
Python tracker 
<https://bugs.python.org/issue32534>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28521] _PyEval_RequestCodeExtraIndex should return a globally valid index, not a ThreadState specific one

2018-01-16 Thread Jeethu Rao

Change by Jeethu Rao :


--
nosy: +jeethu

___
Python tracker 
<https://bugs.python.org/issue28521>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue30604] co_extra_freefuncs is stored thread locally and can lead to crashes

2018-01-16 Thread Jeethu Rao

Change by Jeethu Rao :


--
nosy: +jeethu

___
Python tracker 
<https://bugs.python.org/issue30604>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue32534] Speed-up list.insert: use memmove()

2018-01-17 Thread Jeethu Rao

Jeethu Rao  added the comment:

> FWIW, we've encountered a number of situations in the past when something 
> that improved the timings on one compiler would make timings worse on another 
> compiler.  There was also variance between timings on 32-bit builds versus 
> 64-bit builds.

I've verified that both clang and gcc generate similar assembly (memcpy is not 
inlined and the loop is not vectorized) 32-bit mode. I'd wager that the 
improvement with vectorization (in memmove) would be even more pronounced on 
32-bit systems, given that pointers are half the size and cache lines are still 
64 bytes wide.

> It's 1.08x faster (-7.8%). It's small for a microbenchmark, usually an 
> optimization should make a *microbenchmark* at least 10% faster.

That's true if we assume lists to have 100 or lesser elements. On the other 
hand, on the pyperformance comparison I'd posted yesterday[1], there seems to 
be an average improvement of 1.27x  on the first seven benchmarks, and the 
slowest slowdown is only 1.03x. Albeit, the improvement cannot be better than 
by a constant factor with the vectorized loop in memmove.

> Using memmove() for large copy is a good idea. The main question is the "if 
> (n <= INS1_MEMMOVE_THRESHOLD)" test. Is it slower if we always call memmove()?

The overhead of calling memmove makes it slower for small lists. That's how I 
arrived at this patch in the first place. I tried replacing the loop with a 
memmove() and it was slower on pyperformance and it was faster with switching 
to memmove() after a threshold.

> Previously, Python had a Py_MEMCPY() macro which also had such threshold. 
> Basically, it's a workaround for compiler performance issues:

That's very interesting! I think it boils down to the pointer aliasing problem. 
The pointers in memcpy()'s signature have the `restrict` qualifier, which gives 
the compiler more leeway to optimize calls to memcpy, while the compiler has to 
be more conservative with memmove(). I wonder if it's worth trying out a 
Py_MEMMOVE() :)


[1]: https://gist.github.com/jeethu/d6e4045f7932136548a806380eddd030

--

___
Python tracker 
<https://bugs.python.org/issue32534>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue32534] Speed-up list.insert: use memmove()

2018-01-17 Thread Jeethu Rao

Jeethu Rao  added the comment:

> > I still think those numbers are misleading or downright bogus.  There is no 
> > existing proof that list.insert() is a critical path in those benchmarks.

> Can someone check if these bencmarks really use list.insert() in hot code? If 
> yes, why? :-) The cost of list.insert() is known, no?

Sure! https://gist.github.com/jeethu/000a2d3ecd9033c0ef51331f062ac294

--

___
Python tracker 
<https://bugs.python.org/issue32534>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue32534] Speed-up list.insert: use memmove()

2018-01-17 Thread Jeethu Rao

Jeethu Rao  added the comment:

> What is 54640?

That's the pid of the process.

> I'm interested to know which benchmarks call list.insert() 40k times.

The django_template benchmark.

--

___
Python tracker 
<https://bugs.python.org/issue32534>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue32584] Uninitialized free_extra in code_dealloc

2018-01-17 Thread Jeethu Rao

New submission from Jeethu Rao :

In one of patches I'm building, (yet another attempt at caching 
LOAD_GLOBALS)[1], I'm using the private APIs from PEP 523 to store an array 
with every code object. I'm calling _PyEval_RequestCodeExtraIndex with 
PyMem_Free for the freefunc argument. While running  the cpython testsuite, I 
found that test_embed case crashes with a segfault. The gdb backtrace[2] seems 
to indicate that PyInterpreterState::co_extra_freefuncs is uninitialized, while 
it should be a pointer to the PyMem_Free function. 

One way to work around this is to set the array as a member on the PyCodeObject 
struct and use it directly. And I've verified that it works. Am I using the PEP 
523 private api correctly? Also, on Linux, this consistently crashes while on 
OSX, it occasionally doesn't crash which makes me wonder if it's some kind of a 
race condition involving Sub-interpreters. The attached gist[2] has steps for 
repro.

[1]: 
https://github.com/python/cpython/compare/master...jeethu:py3.7_load_global_cache
[2]: https://gist.github.com/jeethu/6d92185ca97dd692e7fadcd105e0ef70

--
components: Interpreter Core
messages: 310191
nosy: jeethu
priority: normal
severity: normal
status: open
title: Uninitialized free_extra in code_dealloc
type: crash
versions: Python 3.7

___
Python tracker 
<https://bugs.python.org/issue32584>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue32534] Speed-up list.insert: use memmove()

2018-01-17 Thread Jeethu Rao

Jeethu Rao  added the comment:

It's also interesting that in 
https://gist.github.com/pitrou/29eb7592fa1eae2be390f3bfa3db0a3a :

| django_template | 307 ms| 312 ms   | 1.02x slower 
| Not significant|

It seems to be slower and the benchmarks before it (2to3, chameleon, chaos, 
crypto_pyaes, deltablue) which we now know barely use list.insert are slightly 
faster :)

¯\_(ツ)_/¯

--

___
Python tracker 
<https://bugs.python.org/issue32534>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com