On Tue, May 17, 2016 at 10:41 AM, Robert Kern <robert.k...@gmail.com> wrote:
> On Tue, May 17, 2016 at 6:24 PM, Nathaniel Smith <n...@pobox.com> wrote:
>>
>> On May 17, 2016 1:50 AM, "Robert Kern" <robert.k...@gmail.com> wrote:
>> >
>> [...]
>> > What you want is a function that returns many RandomState objects that
>> > are hopefully spread around the MT19937 space enough that they are
>> > essentially independent (in the absence of true jumpahead). The better
>> > implementation of such a function would look something like this:
>> >
>> > def spread_out_prngs(n, root_prng=None):
>> >     if root_prng is None:
>> >         root_prng = np.random
>> >     elif not isinstance(root_prng, np.random.RandomState):
>> >         root_prng = np.random.RandomState(root_prng)
>> >     sprouted_prngs = []
>> >     for i in range(n):
>> >         seed_array = root_prng.randint(1<<32, size=624)  #
>> > dtype=np.uint32 under 1.11
>> >         sprouted_prngs.append(np.random.RandomState(seed_array))
>> >     return spourted_prngs
>>
>> Maybe a nice way to encapsulate this in the RandomState interface would be
>> a method RandomState.random_state() that generates and returns a new child
>> RandomState.
>
> I disagree. This is a workaround in the absence of proper jumpahead or
> guaranteed-independent streams. I would not encourage it.
>
>> > Internally, this generates seed arrays of about the size of the MT19937
>> > state so make sure that you can access more of the state space. That will 
>> > at
>> > least make the chance of collision tiny. And it can be easily rewritten to
>> > take advantage of one of the newer PRNGs that have true independent 
>> > streams:
>> >
>> >   https://github.com/bashtage/ng-numpy-randomstate
>>
>> ... But unfortunately I'm not sure how to make my interface suggestion
>> above work on top of one of these RNGs, because for RandomState.random_state
>> you really want a tree of independent RNGs and the fancy new PRNGs only
>> provide a single flat namespace :-/. And even more annoyingly, the tree API
>> is actually a nicer API, because with a flat namespace you have to know up
>> front about all possible RNGs your code will use, which is an unfortunate
>> global coupling that makes it difficult to compose programs out of
>> independent pieces, while the RandomState.random_state approach composes
>> beautifully. Maybe there's some clever way to allocate a 64-bit namespace to
>> make it look tree-like? I'm not sure 64 bits is really enough...
>
> MT19937 doesn't have a "tree" any more than the others. It's the same flat
> state space. You are just getting the illusion of a tree by hoping that you
> never collide. You ought to think about precisely the same global coupling
> issues with MT19937 as you do with guaranteed-independent streams.
> Hope-and-prayer isn't really a substitute for properly engineering your
> problem. It's just a moral hazard to promote this method to the main API.

Nonsense.

If your definition of "hope and prayer" includes assuming that we
won't encounter a random collision in a 2**19937 state space, then
literally all engineering is hope-and-prayer. A collision could
happen, but if it does it's overwhelmingly more likely to happen
because of a flaw in the mathematical analysis, or a bug in the
implementation, or because random quantum fluctuations caused you and
your program to suddenly be transported to a parallel world where 1 +
1 = 1, than that you just got unlucky with your random state. And all
of these hazards apply equally to both MT19937 and more modern PRNGs.

...anyway, the real reason I'm a bit grumpy is because there are solid
engineering reasons why users *want* this API, so whether or not it
turns out to be possible I think we should at least be allowed to have
a discussion about whether there's some way to give it to them. It's
not even 100% out of the question that we conclude that existing PRNGs
are buggy because they don't take this use case into account -- it
would be far from the first time that numpy found itself going beyond
the limits of older numerical tools that weren't designed to build the
kind of large composable systems that numpy gets used for.

MT19937's state space is large enough that you could explicitly encode
a "tree seed" into it, even if you don't trust the laws of probability
-- e.g., you start with a RandomState with id [], then its children
have id [0], [1], [2], ..., and their children have ids [0, 0], [0,
1], ..., [1, 0], ..., and you write these into the state (probably
after sending them through some bit-mixing permutation), to guarantee
non-collision.

Or if you do trust the laws of probability, then the
randomly-sample-a-PRNG approach is not 100% out of the question even
for more modern PRNGs. For example, let's take a PRNG with a 64-bit
stream id and a 64-bit state space per stream. Suppose that we know
that in our application each PRNG will be used to draw 2**48 64-bit
samples (= 2 pebibytes of random data), and that we will use 2**32
PRNGs (= total of 8 yobibytes of random data). If we randomly
initialize each PRNG with a 128-bit (stream id, state) pair, then the
chance that two of our streams will overlap is about the chance of
getting a collision in an 80-bit state space (128 - 48) on 2**32
draws, which can be approximated[1] as

1 - exp(-(2**32)**2 / (2 * 2**80))
= 1 - exp(-2**64 / 2 ** 81)
= 1 - exp(-1/2**17)
= 7.6-chances-in-a-million

To put this number in some kind of context, Sandia says that on 128
pebibyte clusters (which I assume is the kind of system you use for a
simulation that needs multiple yobibytes of random data!) they see a
double-bit (!) DRAM fault ~every 4 minutes [2]. Also, by the time
we've drawn 2**48 samples from a single 64-bit stream than we've
already violated independence because in 2**48 draws you should see a
repeated value with probability ~1.0 (by the same birthday
approximation as above), but we see them with probability 0.

Alternatively, for a PRNG with a 256 bit state space, or 128-bit
stream id + 128-bit state space, then one can draw 2**64 64-bit
samples without violating independence, *and* do this for, say, 2**70
randomly initialized streams, and the probability of collision would
still be on the order of 10**-16.

Unless I totally messed up these calculations, I'd conclude that
almost everyone would be totally fine with a .random_state() method
combined with any reasonable generator, and for some reasonable
generators (those with more than, like, ~192 bits of state?) it's just
unconditionally safe for all physically realistic problem sizes.

-n

[1] https://en.wikipedia.org/wiki/Birthday_problem#Approximations
[2] http://www.fiala.me/pubs/papers/sc12-redmpi.pdf

-- 
Nathaniel J. Smith -- https://vorpus.org
_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion

Reply via email to