[issue45261] Unreliable (?) results from timeit (cache issue?)

2021-09-22 Thread Tim Holy


New submission from Tim Holy :

This is a lightly-edited reposting of 
https://stackoverflow.com/questions/69164027/unreliable-results-from-timeit-cache-issue

I am interested in timing certain operations in numpy and skimage, but I'm 
occasionally seeing surprising transitions (not entirely reproducible) in the 
reported times. Briefly, sometimes timeit returns results that differ by about 
5-fold from earlier runs. Here's the setup:

import skimage
import numpy as np
import timeit

nrep = 16

def run_complement(img):
def inner():
skimage.util.invert(img)
return inner

img = np.random.randint(0, 65535, (512, 512, 3), dtype='uint16')

and here's an example session:

In [1]: %run python_timing_bug.py

In [2]: t = timeit.Timer(run_complement(img))

In [3]: t.repeat(nrep, number=1)
Out[3]: 
[0.0024439050030196086,
 0.0020311699918238446,
 0.00033007100864779204,
 0.0002889479947043583,
 0.0002851780009223148,
 0.0002851030003512278,
 0.00028487699455581605,
 0.00032116699730977416,
 0.00030912700458429754,
 0.0002877369988709688,
 0.0002840430097421631,
 0.00028515000303741544,
 0.00030791999597568065,
 0.00029302599432412535,
 0.00030723700183443725,
 0.0002916679950430989]

In [4]: t = timeit.Timer(run_complement(img))

In [5]: t.repeat(nrep, number=1)
Out[5]: 
[0.0006320849934127182,
 0.0004014919977635145,
 0.00030359599622897804,
 0.00029224599711596966,
 0.0002907510061049834,
 0.0002920039987657219,
 0.0002918920072261244,
 0.0003095199936069548,
 0.00029789700056426227,
 0.0002885590074583888,
 0.00040198900387622416,
 0.00037131100543774664,
 0.00040271600300911814,
 0.0003492849937174469,
 0.0003378120018169284,
 0.00029762100894004107]

In [6]: t = timeit.Timer(run_complement(img))

In [7]: t.repeat(nrep, number=1)
Out[7]: 
[0.00026428700948599726,
 0.00012682100350502878,
 7.380900206044316e-05,
 6.346100417431444e-05,
 6.29679998382926e-05,
 6.278700311668217e-05,
 6.320899410638958e-05,
 6.25409884378314e-05,
 6.262199894990772e-05,
 6.247499550227076e-05,
 6.293901242315769e-05,
 6.259800284169614e-05,
 6.285199197009206e-05,
 6.293600017670542e-05,
 6.309800664894283e-05,
 6.248900899663568e-05]

Notice that in the final run, the minimum times were on the order of 0.6e-4 vs 
the previous minimum of ~3e-4, about 5x smaller than the times measured in 
previous runs. It's not entirely predictable when this "kicks in."

The faster time corresponds to 0.08ns/element of the array, which given that 
the 2.6GHz clock on my i7-8850H CPU ticks every ~0.4ns, seems to be pushing the 
limits of credibility (though thanks to SIMD on my AVX2 CPU, this cannot be 
entirely ruled out). My understanding is that this operation is implemented as 
a subtraction and most likely gets reduced to a bitwise-not by the compiler. So 
you do indeed expect this to be fast, but it's not entirely certain it should 
be this fast, and in either event the non-reproducibility is problematic.

It may be relevant to note that the total amount of data is

In [15]: img.size * 2
Out[15]: 1572864

and lshw reports that I have 384KiB L1 cache and 1536KiB of L2 cache:

In [16]: 384*1024
Out[16]: 393216

In [17]: 1536*1024
Out[17]: 1572864

So it seems possible that this result is being influenced by just fitting in L2 
cache. (Note that even in the "fast block," the first run was not fast.) If I 
increase the size of the image:

 img = np.random.randint(0, 65535, (2048, 2048, 3), dtype='uint16')

then my results seem more reproducible in the sense that I have not yet seen 
one of these transitions.

--
components: Library (Lib)
messages: 402411
nosy: timholy
priority: normal
severity: normal
status: open
title: Unreliable (?) results from timeit (cache issue?)
type: behavior
versions: Python 3.8

___
Python tracker 
<https://bugs.python.org/issue45261>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue45261] Unreliable (?) results from timeit (cache issue?)

2021-09-22 Thread Tim Holy


Tim Holy  added the comment:

To make sure it's clear, it's not 0.08ns/function call, it's a loop and it 
comes out to 0.08ns/element. The purpose of quoting that number was to compare 
to the CPU clock interval, which on my machine is ~0.4ns. Any operation that's 
less than 1 clock cycle is suspicious, but not automatically wrong because of 
SIMD (if the compiler generates such instructions for this operation, but I'm 
not sure how one checks that in Python). On my AVX2 processor, as many as 16 
`uint16` values could fit simultaneously, and so you can't entirely rule out 
times well below one clock cycle (although the need for load, manipulation, 
store, and increment means that its not plausible to be 1/16th of the clock 
cycle).

Interestingly, increasing `number` does seem to make it consistent, without 
obvious transitions. I'm curious why the reported times are not "per number"; I 
find myself making comparisons using

list(map(lambda tm : tm / 1000, t.repeat(repeat=nrep, number=1000)))

Should the documentation mention that the timing of the core operation should 
be divided by `number`?

However, in the bigger picture of things I suspect this should be closed. I'll 
let others chime in first, in case they think documentation or other things 
need to be changed.

--

___
Python tracker 
<https://bugs.python.org/issue45261>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue45261] Unreliable (?) results from timeit (cache issue?)

2021-09-22 Thread Tim Holy


Tim Holy  added the comment:

> And I hope that Tim Holy is interested too :-)

Sure, I'll bite :-). On the topic of which statistic to show, I am a real fan 
of the histogram. As has been pointed out, timing in the real world can be 
pretty complicated, and I don't think it does anyone good to hide that 
complexity. Even in cases where machines aren't busy doing other things, you 
can get weird multimodal distributions. A great example (maybe not relevant to 
a lot of Python benchmarks...) is in multithreaded algorithms where the main 
thread is both responsible for scheduling other threads and for a portion of 
the work. Even in an almost-idle machine, you can get little race conditions 
where the main thread decides to pick up some work instants before another 
thread starts looking for more work. That can generate peaks in the histogram 
that are separated by the time for one "unit" of work.

But if you have to quote one and only one number, I'm a fan of the minimum (as 
long as you can trust it---which relies on assuming that you've accurately 
calibrated away all the overhead of your timing apparatus, and not any more).

--

___
Python tracker 
<https://bugs.python.org/issue45261>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue45261] Unreliable (?) results from timeit (cache issue?)

2021-09-26 Thread Tim Holy


Change by Tim Holy :


--
resolution: not a bug -> 

___
Python tracker 
<https://bugs.python.org/issue45261>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com