On Tue, May 17, 2016 at 10:41 AM, Robert Kern <robert.k...@gmail.com> wrote: > On Tue, May 17, 2016 at 6:24 PM, Nathaniel Smith <n...@pobox.com> wrote: >> >> On May 17, 2016 1:50 AM, "Robert Kern" <robert.k...@gmail.com> wrote: >> > >> [...] >> > What you want is a function that returns many RandomState objects that >> > are hopefully spread around the MT19937 space enough that they are >> > essentially independent (in the absence of true jumpahead). The better >> > implementation of such a function would look something like this: >> > >> > def spread_out_prngs(n, root_prng=None): >> > if root_prng is None: >> > root_prng = np.random >> > elif not isinstance(root_prng, np.random.RandomState): >> > root_prng = np.random.RandomState(root_prng) >> > sprouted_prngs = [] >> > for i in range(n): >> > seed_array = root_prng.randint(1<<32, size=624) # >> > dtype=np.uint32 under 1.11 >> > sprouted_prngs.append(np.random.RandomState(seed_array)) >> > return spourted_prngs >> >> Maybe a nice way to encapsulate this in the RandomState interface would be >> a method RandomState.random_state() that generates and returns a new child >> RandomState. > > I disagree. This is a workaround in the absence of proper jumpahead or > guaranteed-independent streams. I would not encourage it. > >> > Internally, this generates seed arrays of about the size of the MT19937 >> > state so make sure that you can access more of the state space. That will >> > at >> > least make the chance of collision tiny. And it can be easily rewritten to >> > take advantage of one of the newer PRNGs that have true independent >> > streams: >> > >> > https://github.com/bashtage/ng-numpy-randomstate >> >> ... But unfortunately I'm not sure how to make my interface suggestion >> above work on top of one of these RNGs, because for RandomState.random_state >> you really want a tree of independent RNGs and the fancy new PRNGs only >> provide a single flat namespace :-/. And even more annoyingly, the tree API >> is actually a nicer API, because with a flat namespace you have to know up >> front about all possible RNGs your code will use, which is an unfortunate >> global coupling that makes it difficult to compose programs out of >> independent pieces, while the RandomState.random_state approach composes >> beautifully. Maybe there's some clever way to allocate a 64-bit namespace to >> make it look tree-like? I'm not sure 64 bits is really enough... > > MT19937 doesn't have a "tree" any more than the others. It's the same flat > state space. You are just getting the illusion of a tree by hoping that you > never collide. You ought to think about precisely the same global coupling > issues with MT19937 as you do with guaranteed-independent streams. > Hope-and-prayer isn't really a substitute for properly engineering your > problem. It's just a moral hazard to promote this method to the main API.
Nonsense. If your definition of "hope and prayer" includes assuming that we won't encounter a random collision in a 2**19937 state space, then literally all engineering is hope-and-prayer. A collision could happen, but if it does it's overwhelmingly more likely to happen because of a flaw in the mathematical analysis, or a bug in the implementation, or because random quantum fluctuations caused you and your program to suddenly be transported to a parallel world where 1 + 1 = 1, than that you just got unlucky with your random state. And all of these hazards apply equally to both MT19937 and more modern PRNGs. ...anyway, the real reason I'm a bit grumpy is because there are solid engineering reasons why users *want* this API, so whether or not it turns out to be possible I think we should at least be allowed to have a discussion about whether there's some way to give it to them. It's not even 100% out of the question that we conclude that existing PRNGs are buggy because they don't take this use case into account -- it would be far from the first time that numpy found itself going beyond the limits of older numerical tools that weren't designed to build the kind of large composable systems that numpy gets used for. MT19937's state space is large enough that you could explicitly encode a "tree seed" into it, even if you don't trust the laws of probability -- e.g., you start with a RandomState with id [], then its children have id [0], [1], [2], ..., and their children have ids [0, 0], [0, 1], ..., [1, 0], ..., and you write these into the state (probably after sending them through some bit-mixing permutation), to guarantee non-collision. Or if you do trust the laws of probability, then the randomly-sample-a-PRNG approach is not 100% out of the question even for more modern PRNGs. For example, let's take a PRNG with a 64-bit stream id and a 64-bit state space per stream. Suppose that we know that in our application each PRNG will be used to draw 2**48 64-bit samples (= 2 pebibytes of random data), and that we will use 2**32 PRNGs (= total of 8 yobibytes of random data). If we randomly initialize each PRNG with a 128-bit (stream id, state) pair, then the chance that two of our streams will overlap is about the chance of getting a collision in an 80-bit state space (128 - 48) on 2**32 draws, which can be approximated[1] as 1 - exp(-(2**32)**2 / (2 * 2**80)) = 1 - exp(-2**64 / 2 ** 81) = 1 - exp(-1/2**17) = 7.6-chances-in-a-million To put this number in some kind of context, Sandia says that on 128 pebibyte clusters (which I assume is the kind of system you use for a simulation that needs multiple yobibytes of random data!) they see a double-bit (!) DRAM fault ~every 4 minutes [2]. Also, by the time we've drawn 2**48 samples from a single 64-bit stream than we've already violated independence because in 2**48 draws you should see a repeated value with probability ~1.0 (by the same birthday approximation as above), but we see them with probability 0. Alternatively, for a PRNG with a 256 bit state space, or 128-bit stream id + 128-bit state space, then one can draw 2**64 64-bit samples without violating independence, *and* do this for, say, 2**70 randomly initialized streams, and the probability of collision would still be on the order of 10**-16. Unless I totally messed up these calculations, I'd conclude that almost everyone would be totally fine with a .random_state() method combined with any reasonable generator, and for some reasonable generators (those with more than, like, ~192 bits of state?) it's just unconditionally safe for all physically realistic problem sizes. -n [1] https://en.wikipedia.org/wiki/Birthday_problem#Approximations [2] http://www.fiala.me/pubs/papers/sc12-redmpi.pdf -- Nathaniel J. Smith -- https://vorpus.org _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion