[Numpy-discussion] SeedSequence.spawn()

2021-08-26 Thread Stig Korsnes
Hi,
Is there a way to uniquely spawn child seeds?
I`m doing monte carlo analysis, where I have n random processes, each with
their own generator.
All process models instantiate a generator with default_rng(). I.e
ss=SeedSequence() cs=ss.Spawn(n), and using cs[i] for process i. Now, the
problem I`m facing, is that results using individual process  depends on
the order of the process initialization ,and the number of processes used.
However, if I could spawn children with a unique identifier, I would be
able to reproduce my individual results without having to pickle/log
states. For example, all my models have an id (tuple) field which is
hashable.
If I had the ability to SeedSequence(x).Spawn([objects]) where objects
support hash(object), I would have reproducibility for all my processes. I
could do without the spawning, but then I would probably loose independence
when I do multiproc? Is there a way to achieve my goal in the current
version 1.21 of numpy?

Best Stig
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] SeedSequence.spawn()

2021-08-27 Thread Stig Korsnes
Thank you Robert!
This scheme fits perfectly into what I`m trying to accomplish! :) The
"smooshing" of ints by supplying a list of ints had eluded me. Thank you
also for the pointer about built-in hash(). I would not be able to rely on
it anyways, because it does not return strictly positive ints which
SeedSequence requires.  If you have a minute to spare: Could you briefly
explain "int(joblib.hash(key)
<https://joblib.readthedocs.io/en/latest/generated/joblib.hash.html>, 16)"
, and would this always return non-negative integers?
Thanks again!

tor. 26. aug. 2021 kl. 22:59 skrev Robert Kern :

> On Thu, Aug 26, 2021 at 2:22 PM Stig Korsnes 
> wrote:
>
>> Hi,
>> Is there a way to uniquely spawn child seeds?
>> I`m doing monte carlo analysis, where I have n random processes, each
>> with their own generator.
>> All process models instantiate a generator with default_rng(). I.e
>> ss=SeedSequence() cs=ss.Spawn(n), and using cs[i] for process i. Now, the
>> problem I`m facing, is that results using individual process  depends on
>> the order of the process initialization ,and the number of processes used.
>> However, if I could spawn children with a unique identifier, I would be
>> able to reproduce my individual results without having to pickle/log
>> states. For example, all my models have an id (tuple) field which is
>> hashable.
>> If I had the ability to SeedSequence(x).Spawn([objects]) where objects
>> support hash(object), I would have reproducibility for all my processes. I
>> could do without the spawning, but then I would probably loose independence
>> when I do multiproc? Is there a way to achieve my goal in the current
>> version 1.21 of numpy?
>>
>
> I would probably not rely on `hash()` as it is only intended to be pretty
> good at getting distinct values from distinct inputs. If you can combine
> the tuple objects into a string of bytes in a reliable, collision-free way
> and use one of the cryptographic hashes to get them down to a 128bit
> number, that'd be ideal. `int(joblib.hash(key)
> <https://joblib.readthedocs.io/en/latest/generated/joblib.hash.html>,
> 16)` should do nicely. You can combine that with your main process's seed
> easily. SeedSequence can take arbitrary amounts of integer data and smoosh
> them all together. The spawning functionality builds off of that, but you
> can also just manually pass in lists of integers.
>
> Let's call that function `stronghash()`. Let's call your main process seed
> number `seed` (this is the thing that the user can set on the command-line
> or something you get from `secrets.randbits(128)` if you need a fresh one).
> Let's call the unique tuple `key`. You can build the `SeedSequence` for
> each job according to the `key` like so:
>
> root_ss = SeedSequence(seed)
> for key, data in jobs:
> child_ss = SeedSequence([stronghash(key), seed])
> submit_job(key, data, seed=child_ss)
>
> Now each job will get its own unique stream regardless of the order the
> job is assigned. When the user reruns it with the same root `seed`, they
> will get the same results. When the user chooses a different `seed`, they
> will get another set of results (this is why you don't want to just use
> `SeedSequence(stronghash(key))` all by itself).
>
> I put the job-specific seed data ahead of the main program's seed to be on
> the super-safe side. The spawning mechanism will append integers to the
> end, so there's a super-tiny chance somewhere down a long line of
> `root_ss.spawn()`s that there would be a collision (and I mean
> super-extra-tiny). But best practices cost nothing.
>
> I hope that helps and is not too confusing!
>
> --
> Robert Kern
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
>
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] SeedSequence.spawn()

2021-08-28 Thread Stig Korsnes
Thank you again Robert.
I am using NamedTuple for mye keys, which also are keys in a dictionary.
Each key will be unique (tuple on distinct int and enum), so I am thinking
maybe the risk of producing duplicate hash is not present, but could as
always be wrong :)
For positive ints i followed this tip
https://stackoverflow.com/questions/18766535/positive-integer-from-python-hash-function
, and did:

def stronghash(key:ComponentId):
return ctypes.c_size_t(hash(key)).value

Since I will be using each process/random sample several times, and keeping
all of them in memory at once is not feasible (dimensionality) i did the
following:

self._rng = default_rng(cs)
self._state = dict(self._rng.bit_generator.state)  #

def scenarios(self) -> npt.NDArray[np.float64]:
self._rng.bit_generator.state = self._state
   
  return 

Would you consider this bad practice, or an ok solution?


I Norway we have a saying which directly translates :" He asked for the
finger... and took the whole arm" .

Best,
Stig


fre. 27. aug. 2021 kl. 17:01 skrev Robert Kern :

> joblib is a library that uses clever caching of function call results to
> make the development of certain kinds of data-heavy computational pipelines
> easier. In order to derive the key to be used to check the cache, joblib
> has to look at the arguments passed to the function, which may
> involve usually-nonhashable things like large numpy arrays.
>
>   https://joblib.readthedocs.io/en/latest/
>
> So they constructed joblib.hash() which basically takes the arguments,
> pickles them into a bytestring (with some implementation details), then
> computes an MD5 hash on that. It's probably overkill for your keys, but
> it's easily available and quite generic. It returns a hex-encoded string of
> the 128-bit MD5 hash. `int(..., 16)` will convert that to a non-negative
> (almost-certainly positive!) integer that can be fed into SeedSequence.
>
> On Fri, Aug 27, 2021 at 5:03 AM Stig Korsnes 
> wrote:
>
>> Thank you Robert!
>> This scheme fits perfectly into what I`m trying to accomplish! :) The
>> "smooshing" of ints by supplying a list of ints had eluded me. Thank you
>> also for the pointer about built-in hash(). I would not be able to rely on
>> it anyways, because it does not return strictly positive ints which
>> SeedSequence requires.  If you have a minute to spare: Could you briefly
>> explain "int(joblib.hash(key)
>> <https://joblib.readthedocs.io/en/latest/generated/joblib.hash.html>,
>> 16)" , and would this always return non-negative integers?
>> Thanks again!
>>
>> tor. 26. aug. 2021 kl. 22:59 skrev Robert Kern :
>>
>>> On Thu, Aug 26, 2021 at 2:22 PM Stig Korsnes 
>>> wrote:
>>>
>>>> Hi,
>>>> Is there a way to uniquely spawn child seeds?
>>>> I`m doing monte carlo analysis, where I have n random processes, each
>>>> with their own generator.
>>>> All process models instantiate a generator with default_rng(). I.e
>>>> ss=SeedSequence() cs=ss.Spawn(n), and using cs[i] for process i. Now, the
>>>> problem I`m facing, is that results using individual process  depends on
>>>> the order of the process initialization ,and the number of processes used.
>>>> However, if I could spawn children with a unique identifier, I would be
>>>> able to reproduce my individual results without having to pickle/log
>>>> states. For example, all my models have an id (tuple) field which is
>>>> hashable.
>>>> If I had the ability to SeedSequence(x).Spawn([objects]) where objects
>>>> support hash(object), I would have reproducibility for all my processes. I
>>>> could do without the spawning, but then I would probably loose independence
>>>> when I do multiproc? Is there a way to achieve my goal in the current
>>>> version 1.21 of numpy?
>>>>
>>>
>>> I would probably not rely on `hash()` as it is only intended to be
>>> pretty good at getting distinct values from distinct inputs. If you can
>>> combine the tuple objects into a string of bytes in a reliable,
>>> collision-free way and use one of the cryptographic hashes to get them down
>>> to a 128bit number, that'd be ideal. `int(joblib.hash(key)
>>> <https://joblib.readthedocs.io/en/latest/generated/joblib.hash.html>,
>>> 16)` should do nicely. You can combine that with your main process's seed
>>> easily. SeedSequence can take arbitrary amounts of integer data and smoosh
>>> them all together. The spawning functionality builds off of that, but you
>&

Re: [Numpy-discussion] SeedSequence.spawn()

2021-08-29 Thread Stig Korsnes
Thanks again Robert!
Got rid of dict(state).

Not sure I followed you completely on the test case. The "calculator" i am
writing , will for the specific use case depend on ~200-1000 processes.
Each process object will return say 1m floats when its method scenario is
called. If I am not mistaken, that would require 7-8GiB just to keep the
these in memory. Furthermore I would possibly have to add the size of the
dependent calculation on these (but would likely aggregate outside of
testing).  A given object that depends on processes will calculate its
results based on 1-4 (1-4 *1m  of these processes (non multiproc)), and
will loop over objects with processpool. So my reasoning is that running
memory consumption would then be (1-4)*size of 1m floats x processes + all
of other overhead. Since sampling 1m normals is pretty fast, I can happily
live with sampling (vs lookup in presampled array), but since two object
might depend on the same process they need the exact same array of samples.
Hence the state. If I understood you correctly, another solution is to add
another duplicate process with same seed, instead of using one where i
"reset" state.

I promised that this could run on any laptop..



søn. 29. aug. 2021 kl. 02:42 skrev Robert Kern :

> On Sat, Aug 28, 2021 at 5:56 AM Stig Korsnes 
> wrote:
>
>> Thank you again Robert.
>> I am using NamedTuple for mye keys, which also are keys in a dictionary.
>> Each key will be unique (tuple on distinct int and enum), so I am thinking
>> maybe the risk of producing duplicate hash is not present, but could as
>> always be wrong :)
>>
>
> Present, but possibly ignorably small. 128-bit spaces give enough
> breathing room for me to be comfortable; 64-bit spaces like what hash()
> will use for its results makes me just a little claustrophobic.
>
> If the structure of the keys is pretty fixed, just these two integers
> (counting the enum as an integer), then I might just use both in the
> seeding material.
>
> def get_key_seed(key:ComponentId, root_seed:int):
> return np.random.SeedSequence([key.the_int, int(key.the_enum),
> root_seed])
>
>
>> For positive ints i followed this tip
>> https://stackoverflow.com/questions/18766535/positive-integer-from-python-hash-function
>> , and did:
>>
>> def stronghash(key:ComponentId):
>> return ctypes.c_size_t(hash(key)).value
>>
>
> np.uint64(possibly_negative_integer) will also work for this purpose
> (somewhat more reliably).
>
> Since I will be using each process/random sample several times, and
>> keeping all of them in memory at once is not feasible (dimensionality) i
>> did the following:
>>
>> self._rng = default_rng(cs)
>> self._state = dict(self._rng.bit_generator.state)  #
>>
>> def scenarios(self) -> npt.NDArray[np.float64]:
>> self._rng.bit_generator.state = self._state
>>
>>   return 
>>
>> Would you consider this bad practice, or an ok solution?
>>
>
> It's what that property is there for. No need to copy; `.state` creates a
> new dict each time.
>
> In a quick test, I measured a process with 1 million Generator instances
> to use ~1.5 GiB while 1 million state dicts ~1.0 GiB (including all of the
> other overhead of Python and numpy; not a scientific test). Storing just
> the BitGenerator is half-way in between. That's something, but not a huge
> win. If that is really crossing the border from feasible to infeasible, you
> may be about to run into your limits anyways for other reasons. So balance
> that out with the complications of swapping state in and out of a single
> instance.
>
> I Norway we have a saying which directly translates :" He asked for the
>> finger... and took the whole arm" .
>>
>
> Well, when I craft an overly-complicated system, I feel responsible to
> help shepherd people along in using it well. :-)
>
> --
> Robert Kern
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
>
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] SeedSequence.spawn()

2021-08-29 Thread Stig Korsnes
And big kudos for building AND shepherding :)

søn. 29. aug. 2021 kl. 12:56 skrev Stig Korsnes :

> Thanks again Robert!
> Got rid of dict(state).
>
> Not sure I followed you completely on the test case. The "calculator" i am
> writing , will for the specific use case depend on ~200-1000 processes.
> Each process object will return say 1m floats when its method scenario is
> called. If I am not mistaken, that would require 7-8GiB just to keep the
> these in memory. Furthermore I would possibly have to add the size of the
> dependent calculation on these (but would likely aggregate outside of
> testing).  A given object that depends on processes will calculate its
> results based on 1-4 (1-4 *1m  of these processes (non multiproc)), and
> will loop over objects with processpool. So my reasoning is that running
> memory consumption would then be (1-4)*size of 1m floats x processes + all
> of other overhead. Since sampling 1m normals is pretty fast, I can happily
> live with sampling (vs lookup in presampled array), but since two object
> might depend on the same process they need the exact same array of samples.
> Hence the state. If I understood you correctly, another solution is to add
> another duplicate process with same seed, instead of using one where i
> "reset" state.
>
> I promised that this could run on any laptop..
>
>
>
> søn. 29. aug. 2021 kl. 02:42 skrev Robert Kern :
>
>> On Sat, Aug 28, 2021 at 5:56 AM Stig Korsnes 
>> wrote:
>>
>>> Thank you again Robert.
>>> I am using NamedTuple for mye keys, which also are keys in a dictionary.
>>> Each key will be unique (tuple on distinct int and enum), so I am thinking
>>> maybe the risk of producing duplicate hash is not present, but could as
>>> always be wrong :)
>>>
>>
>> Present, but possibly ignorably small. 128-bit spaces give enough
>> breathing room for me to be comfortable; 64-bit spaces like what hash()
>> will use for its results makes me just a little claustrophobic.
>>
>> If the structure of the keys is pretty fixed, just these two integers
>> (counting the enum as an integer), then I might just use both in the
>> seeding material.
>>
>> def get_key_seed(key:ComponentId, root_seed:int):
>> return np.random.SeedSequence([key.the_int, int(key.the_enum),
>> root_seed])
>>
>>
>>> For positive ints i followed this tip
>>> https://stackoverflow.com/questions/18766535/positive-integer-from-python-hash-function
>>> , and did:
>>>
>>> def stronghash(key:ComponentId):
>>> return ctypes.c_size_t(hash(key)).value
>>>
>>
>> np.uint64(possibly_negative_integer) will also work for this purpose
>> (somewhat more reliably).
>>
>> Since I will be using each process/random sample several times, and
>>> keeping all of them in memory at once is not feasible (dimensionality) i
>>> did the following:
>>>
>>> self._rng = default_rng(cs)
>>> self._state = dict(self._rng.bit_generator.state)  #
>>>
>>> def scenarios(self) -> npt.NDArray[np.float64]:
>>> self._rng.bit_generator.state = self._state
>>>
>>>   return 
>>>
>>> Would you consider this bad practice, or an ok solution?
>>>
>>
>> It's what that property is there for. No need to copy; `.state` creates a
>> new dict each time.
>>
>> In a quick test, I measured a process with 1 million Generator instances
>> to use ~1.5 GiB while 1 million state dicts ~1.0 GiB (including all of the
>> other overhead of Python and numpy; not a scientific test). Storing just
>> the BitGenerator is half-way in between. That's something, but not a huge
>> win. If that is really crossing the border from feasible to infeasible, you
>> may be about to run into your limits anyways for other reasons. So balance
>> that out with the complications of swapping state in and out of a single
>> instance.
>>
>> I Norway we have a saying which directly translates :" He asked for the
>>> finger... and took the whole arm" .
>>>
>>
>> Well, when I craft an overly-complicated system, I feel responsible to
>> help shepherd people along in using it well. :-)
>>
>> --
>> Robert Kern
>> ___
>> NumPy-Discussion mailing list
>> NumPy-Discussion@python.org
>> https://mail.python.org/mailman/listinfo/numpy-discussion
>>
>
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] SeedSequence.spawn()

2021-08-29 Thread Stig Korsnes
I am indeed making ~200-1000 generator instances.As many as I have
processes. Each process is an instance of a component class , which has a
generator. Every time i ask this process for 1m numbers, i need the same 1m
numbers. I could instead make a new generator with same seed every time I
ask for for the 1m numbers, but presumed that this would be more
computationally expensive than setting state on an existing generator.

Thank your Robert.
Best Stig

søn. 29. aug. 2021 kl. 16:08 skrev Robert Kern :

> On Sun, Aug 29, 2021 at 6:58 AM Stig Korsnes 
> wrote:
>
>> Thanks again Robert!
>> Got rid of dict(state).
>>
>> Not sure I followed you completely on the test case.
>>
>
> In the code that you showed, you were pulling out and storing the `.state`
> dict and then punching that back into a single `Generator` instance.
> Instead, you can just make the ~200-1000 `Generator` instances.
>
> --
> Robert Kern
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
>
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] SeedSequence.spawn()

2021-08-29 Thread Stig Korsnes
Agreed, I already have a flag on the class to toggle fixed "state". Could
just set self._rng instead of its state. Will check it out.
Must say, had not in my wildest dreams expected such help on any given
Sunday. Have a great day and week, sir.
Best,
Stig


søn. 29. aug. 2021, 18:29 skrev Robert Kern :

> On Sun, Aug 29, 2021 at 10:55 AM Stig Korsnes 
> wrote:
>
>> I am indeed making ~200-1000 generator instances.As many as I have
>> processes. Each process is an instance of a component class , which has a
>> generator. Every time i ask this process for 1m numbers, i need the same 1m
>> numbers. I could instead make a new generator with same seed every time I
>> ask for for the 1m numbers, but presumed that this would be more
>> computationally expensive than setting state on an existing generator.
>>
>
> Nominally, but it's overwhelmed by the actual computation. You will have
> less to juggle if you just compute it from the key each time.
>
> --
> Robert Kern
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
>
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion