Looking at the dask helper function again reminds me of an important cavaet to this approach, which was pointed out to me by Clark Fitzgerald.
If you generate a moderately large number of random seeds in this fashion, you are quite likely to have collisions due to the Birthday Paradox. For example, you have a 50% chance of encountering at least one collision if you generate only 77,000 seeds: https://en.wikipedia.org/wiki/Birthday_attack The docstring for this function should document this limitation of the approach, which is still appropriate for a small number of seeds. Our implementation can also encourage creating these seeds in a single vectorized call to random_seed, which can significantly reduce the likelihood of collisions between seeds generated in a single call to random_seed with something like the following: def random_seed(size): base = np.random.randint(2 ** 32) offset = np.arange(size) return (base + offset) % (2 ** 32) In principle, I believe this could generate the full 2 ** 32 unique seeds without any collisions. Cryptography experts, please speak up if I'm mistaken here. On Mon, May 16, 2016 at 8:54 PM, Stephan Hoyer <sho...@gmail.com> wrote: > I have recently encountered several use cases for randomly generate random > number seeds: > > 1. When writing a library of stochastic functions that take a seed as an > input argument, and some of these functions call multiple other such > stochastic functions. Dask is one such example [1]. > > 2. When a library needs to produce results that are reproducible after > calling numpy.random.seed, but that do not want to use the functions in > numpy.random directly. This came up recently in a pandas pull request [2], > because we want to allow using RandomState objects as an alternative to > global state in numpy.random. A major advantage of this approach is that it > provides an obvious alternative to reusing the private numpy.random._mtrand > [3]. > > The implementation of this function (and the corresponding method on > RandomState) is almost trivial, and I've already written such a utility for > my code: > > def random_seed(): > # numpy.random uses uint32 seeds > np.random.randint(2 ** 32) > > The advantage of adding a new method is that it avoids the need for > explanation by making the intent of code using this pattern obvious. So I > think it is a good candidate for inclusion in numpy.random. > > Any opinions? > > [1] > https://github.com/dask/dask/blob/e0b246221957c4bd618e57246f3a7ccc8863c494/dask/utils.py#L336 > [2] https://github.com/pydata/pandas/pull/13161 > [3] On a side note, if there's no longer a good reason to keep this object > private, perhaps we should expose it in our public API. It would certainly > be useful -- scikit-learn is already using it (see links in the pandas PR > above). >
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion