Re: [Numpy-discussion] SeedSequence.spawn()

2021-08-27 Thread Stig Korsnes
Thank you Robert!
This scheme fits perfectly into what I`m trying to accomplish! :) The
"smooshing" of ints by supplying a list of ints had eluded me. Thank you
also for the pointer about built-in hash(). I would not be able to rely on
it anyways, because it does not return strictly positive ints which
SeedSequence requires.  If you have a minute to spare: Could you briefly
explain "int(joblib.hash(key)
, 16)"
, and would this always return non-negative integers?
Thanks again!

tor. 26. aug. 2021 kl. 22:59 skrev Robert Kern :

> On Thu, Aug 26, 2021 at 2:22 PM Stig Korsnes 
> wrote:
>
>> Hi,
>> Is there a way to uniquely spawn child seeds?
>> I`m doing monte carlo analysis, where I have n random processes, each
>> with their own generator.
>> All process models instantiate a generator with default_rng(). I.e
>> ss=SeedSequence() cs=ss.Spawn(n), and using cs[i] for process i. Now, the
>> problem I`m facing, is that results using individual process  depends on
>> the order of the process initialization ,and the number of processes used.
>> However, if I could spawn children with a unique identifier, I would be
>> able to reproduce my individual results without having to pickle/log
>> states. For example, all my models have an id (tuple) field which is
>> hashable.
>> If I had the ability to SeedSequence(x).Spawn([objects]) where objects
>> support hash(object), I would have reproducibility for all my processes. I
>> could do without the spawning, but then I would probably loose independence
>> when I do multiproc? Is there a way to achieve my goal in the current
>> version 1.21 of numpy?
>>
>
> I would probably not rely on `hash()` as it is only intended to be pretty
> good at getting distinct values from distinct inputs. If you can combine
> the tuple objects into a string of bytes in a reliable, collision-free way
> and use one of the cryptographic hashes to get them down to a 128bit
> number, that'd be ideal. `int(joblib.hash(key)
> ,
> 16)` should do nicely. You can combine that with your main process's seed
> easily. SeedSequence can take arbitrary amounts of integer data and smoosh
> them all together. The spawning functionality builds off of that, but you
> can also just manually pass in lists of integers.
>
> Let's call that function `stronghash()`. Let's call your main process seed
> number `seed` (this is the thing that the user can set on the command-line
> or something you get from `secrets.randbits(128)` if you need a fresh one).
> Let's call the unique tuple `key`. You can build the `SeedSequence` for
> each job according to the `key` like so:
>
> root_ss = SeedSequence(seed)
> for key, data in jobs:
> child_ss = SeedSequence([stronghash(key), seed])
> submit_job(key, data, seed=child_ss)
>
> Now each job will get its own unique stream regardless of the order the
> job is assigned. When the user reruns it with the same root `seed`, they
> will get the same results. When the user chooses a different `seed`, they
> will get another set of results (this is why you don't want to just use
> `SeedSequence(stronghash(key))` all by itself).
>
> I put the job-specific seed data ahead of the main program's seed to be on
> the super-safe side. The spawning mechanism will append integers to the
> end, so there's a super-tiny chance somewhere down a long line of
> `root_ss.spawn()`s that there would be a collision (and I mean
> super-extra-tiny). But best practices cost nothing.
>
> I hope that helps and is not too confusing!
>
> --
> Robert Kern
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
>
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] SeedSequence.spawn()

2021-08-27 Thread Robert Kern
joblib is a library that uses clever caching of function call results to
make the development of certain kinds of data-heavy computational pipelines
easier. In order to derive the key to be used to check the cache, joblib
has to look at the arguments passed to the function, which may
involve usually-nonhashable things like large numpy arrays.

  https://joblib.readthedocs.io/en/latest/

So they constructed joblib.hash() which basically takes the arguments,
pickles them into a bytestring (with some implementation details), then
computes an MD5 hash on that. It's probably overkill for your keys, but
it's easily available and quite generic. It returns a hex-encoded string of
the 128-bit MD5 hash. `int(..., 16)` will convert that to a non-negative
(almost-certainly positive!) integer that can be fed into SeedSequence.

On Fri, Aug 27, 2021 at 5:03 AM Stig Korsnes  wrote:

> Thank you Robert!
> This scheme fits perfectly into what I`m trying to accomplish! :) The
> "smooshing" of ints by supplying a list of ints had eluded me. Thank you
> also for the pointer about built-in hash(). I would not be able to rely on
> it anyways, because it does not return strictly positive ints which
> SeedSequence requires.  If you have a minute to spare: Could you briefly
> explain "int(joblib.hash(key)
> ,
> 16)" , and would this always return non-negative integers?
> Thanks again!
>
> tor. 26. aug. 2021 kl. 22:59 skrev Robert Kern :
>
>> On Thu, Aug 26, 2021 at 2:22 PM Stig Korsnes 
>> wrote:
>>
>>> Hi,
>>> Is there a way to uniquely spawn child seeds?
>>> I`m doing monte carlo analysis, where I have n random processes, each
>>> with their own generator.
>>> All process models instantiate a generator with default_rng(). I.e
>>> ss=SeedSequence() cs=ss.Spawn(n), and using cs[i] for process i. Now, the
>>> problem I`m facing, is that results using individual process  depends on
>>> the order of the process initialization ,and the number of processes used.
>>> However, if I could spawn children with a unique identifier, I would be
>>> able to reproduce my individual results without having to pickle/log
>>> states. For example, all my models have an id (tuple) field which is
>>> hashable.
>>> If I had the ability to SeedSequence(x).Spawn([objects]) where objects
>>> support hash(object), I would have reproducibility for all my processes. I
>>> could do without the spawning, but then I would probably loose independence
>>> when I do multiproc? Is there a way to achieve my goal in the current
>>> version 1.21 of numpy?
>>>
>>
>> I would probably not rely on `hash()` as it is only intended to be pretty
>> good at getting distinct values from distinct inputs. If you can combine
>> the tuple objects into a string of bytes in a reliable, collision-free way
>> and use one of the cryptographic hashes to get them down to a 128bit
>> number, that'd be ideal. `int(joblib.hash(key)
>> ,
>> 16)` should do nicely. You can combine that with your main process's seed
>> easily. SeedSequence can take arbitrary amounts of integer data and smoosh
>> them all together. The spawning functionality builds off of that, but you
>> can also just manually pass in lists of integers.
>>
>> Let's call that function `stronghash()`. Let's call your main process
>> seed number `seed` (this is the thing that the user can set on the
>> command-line or something you get from `secrets.randbits(128)` if you need
>> a fresh one). Let's call the unique tuple `key`. You can build the
>> `SeedSequence` for each job according to the `key` like so:
>>
>> root_ss = SeedSequence(seed)
>> for key, data in jobs:
>> child_ss = SeedSequence([stronghash(key), seed])
>> submit_job(key, data, seed=child_ss)
>>
>> Now each job will get its own unique stream regardless of the order the
>> job is assigned. When the user reruns it with the same root `seed`, they
>> will get the same results. When the user chooses a different `seed`, they
>> will get another set of results (this is why you don't want to just use
>> `SeedSequence(stronghash(key))` all by itself).
>>
>> I put the job-specific seed data ahead of the main program's seed to be
>> on the super-safe side. The spawning mechanism will append integers to the
>> end, so there's a super-tiny chance somewhere down a long line of
>> `root_ss.spawn()`s that there would be a collision (and I mean
>> super-extra-tiny). But best practices cost nothing.
>>
>> I hope that helps and is not too confusing!
>>
>> --
>> Robert Kern
>> ___
>> NumPy-Discussion mailing list
>> NumPy-Discussion@python.org
>> https://mail.python.org/mailman/listinfo/numpy-discussion
>>
>

-- 
Robert Kern
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] Documentation Team meeting - Monday August 30

2021-08-27 Thread Melissa Mendonça
Hi all!

Our next Documentation Team meeting will be tomorrow - *Monday, August 30*
at ***4PM UTC***.

All are welcome - you don't need to already be a contributor to join. If
you have questions or are curious about what we're doing, we'll be happy to
meet you!

If you wish to join on Zoom, use this link:

https://zoom.us/j/96219574921?pwd=VTRNeGwwOUlrYVNYSENpVVBRRjlkZz09#success

Here's the permanent hackmd document with the meeting notes (still being
updated in the next few days!):

https://hackmd.io/oB_boakvRqKR-_2jRV-Qjg


Hope to see you around!

** You can click this link to get the correct time at your timezone:
https://www.timeanddate.com/worldclock/fixedtime.html?msg=NumPy+Documentation+Team+Meeting&iso=20210830T16&p1=1440&ah=1


*** You can add the NumPy community calendar to your google calendar by
clicking this link: https://calendar.google.com/calendar
/r?cid=YmVya2VsZXkuZWR1X2lla2dwaWdtMjMyamJobGRzZmIyYzJqODFjQGdyb3VwLmNhbGVuZGFyLmdvb2dsZS5jb20

- Melissa
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion