Re: [Numpy-discussion] A bite of C++

2021-08-26 Thread Ralf Gommers
On Thu, Aug 26, 2021 at 12:51 AM Sebastian Berg 
wrote:

> On Wed, 2021-08-25 at 17:48 +0200, Serge Guelton wrote:
>
> > Potential follow-ups :
> >
> > - do we want to use -nostdlib, to be sure we don't bring any C++
> > runtime dep?
>
> What does this mean for compatibility?  It sounds reasonable to me for
> now if it increases systems we can run on, but I really don't know.
>

The only platform where we'd need to bundle in a runtime is Windows I
believe. Here's what we do for SciPy:
https://github.com/MacPython/scipy-wheels/blob/72cb8ab580ed5ca1b95eb60243fef4284ccc52b0/LICENSE_win32.txt#L125
.

That is indeed a bit of a pain and hard to test, so if we can get away with
not doing that by adding `-nostdlib`, that sounds great.

Cheers,
Ralf
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] SeedSequence.spawn()

2021-08-26 Thread Stig Korsnes
Hi,
Is there a way to uniquely spawn child seeds?
I`m doing monte carlo analysis, where I have n random processes, each with
their own generator.
All process models instantiate a generator with default_rng(). I.e
ss=SeedSequence() cs=ss.Spawn(n), and using cs[i] for process i. Now, the
problem I`m facing, is that results using individual process  depends on
the order of the process initialization ,and the number of processes used.
However, if I could spawn children with a unique identifier, I would be
able to reproduce my individual results without having to pickle/log
states. For example, all my models have an id (tuple) field which is
hashable.
If I had the ability to SeedSequence(x).Spawn([objects]) where objects
support hash(object), I would have reproducibility for all my processes. I
could do without the spawning, but then I would probably loose independence
when I do multiproc? Is there a way to achieve my goal in the current
version 1.21 of numpy?

Best Stig
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] A bite of C++

2021-08-26 Thread Serge Guelton
On Wed, Aug 25, 2021 at 05:50:49PM -0500, Sebastian Berg wrote:
> On Wed, 2021-08-25 at 17:48 +0200, Serge Guelton wrote:
> > Hi folks,
> > 
> > https://github.com/numpy/numpy/pull/19713 showcases what *could* be a
> > first step
> > toward getting rid of generated C code within numpy, in favor of some
> > C++ code,
> > coupled with a single macro trick.
> > 
> > Basically, templated code is an easy and robust way to replace
> > generated code
> > (the C++ compiler becomes the code generator when instantiating
> > code), and a
> > single X-macro takes care of the glue with the C world.

Hi Sebastian and thanks for the feedback.


> I am not a C++ export, and really have to get to used to this code.  So
> I would prefer if some C++ experts can look at it and give feedback.

I don't know if I'm a C++ expert, but I've a decent background with that
language. I'll try to give as much clarification as I can.

> This will be a bit harder to read for me than our `.c.src` code for a
> while.  But on the up-side, I am frustrated by my IDE not being able to
> deal with the `c.src` templating.

> One reaction reading the X-macro trick is that I would be more
> comfortable with a positive list rather than block-listing.  It just
> felt a bit like too much magic and I am not sure how good it is to
> assume we usually want to export everything (for one, datetimes are
> pretty special).
> 
> Even if it is verbose, I would not mind if we just list everything, so
> long we have short-hands for all-integers, all-float, all-inexact, etc.

There has been similar comments on the PR, I've reverted to an explicit listing.

> > 
> > Some changes in distutils were needed to cope with C++-specific
> > flags, and
> > extensions that consist in mixed C and C++ code.
> 
> 
> 
> > Potential follow-ups :
> > 
> > - do we want to use -nostdlib, to be sure we don't bring any C++
> > runtime dep?
> 
> What does this mean for compatibility?  It sounds reasonable to me for
> now if it increases systems we can run on, but I really don't know.

It basically means less packaging issues as one doesn't need to link with the
standard C++ library. It doesn't prevent from using some headers, but remove
some aspect of the language. If numpy wants to use C++ as a preprocessor on
steorids, that's fine. If Numpy wants to embrace more of C++, it's a bad idea
(e.g. no new operator)

> > - what about -fno-exception, -fno-rtti?
> 
> How do C++ exceptions work at run-time?  What if I store a C++ function
> pointer that raises an exception and use it from a C program?  Does it
> add run-time overhead, do we need that `no-exception` to define that
> our functions are actually C "calling convention" in this regard?!

Exception add runtime overhead and imply larger binaries. If an exception is
raised at C++ level and not caught at C++ level, there going to unwind the whole
C stack and then call a default handler that terminates the program.

> Run-time calling convention changes worry me, because I am not sure C++
> exception have a place in the current or even future ABI.  All our
> current API use a `-1` return value for exceptions.
> 
> This is just like Python's type slots, so there must be "off the
> shelve" approaches for this?
> 
> Embracing C++ exceptions seems a bit steep to me right now, unless I am
> missing something awesome?

I totally second your opinion. In the spirit of C++ as a preprocessor
on steroids, I don't see why exception would be needed.

> I will note that a lot of the functions that we want to template like
> this, are – and should be – accessible as public API (i.e. you can ask
> NumPy to give you the function pointer).

As of now, I've kept the current C symbol names, which requires a thin foward to
the C++ implementation. I would be glad to remove those, but I think it's a nice
second step, something that could be done once the custom preprocessor has been
removed.
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] SeedSequence.spawn()

2021-08-26 Thread Robert Kern
On Thu, Aug 26, 2021 at 2:22 PM Stig Korsnes  wrote:

> Hi,
> Is there a way to uniquely spawn child seeds?
> I`m doing monte carlo analysis, where I have n random processes, each with
> their own generator.
> All process models instantiate a generator with default_rng(). I.e
> ss=SeedSequence() cs=ss.Spawn(n), and using cs[i] for process i. Now, the
> problem I`m facing, is that results using individual process  depends on
> the order of the process initialization ,and the number of processes used.
> However, if I could spawn children with a unique identifier, I would be
> able to reproduce my individual results without having to pickle/log
> states. For example, all my models have an id (tuple) field which is
> hashable.
> If I had the ability to SeedSequence(x).Spawn([objects]) where objects
> support hash(object), I would have reproducibility for all my processes. I
> could do without the spawning, but then I would probably loose independence
> when I do multiproc? Is there a way to achieve my goal in the current
> version 1.21 of numpy?
>

I would probably not rely on `hash()` as it is only intended to be pretty
good at getting distinct values from distinct inputs. If you can combine
the tuple objects into a string of bytes in a reliable, collision-free way
and use one of the cryptographic hashes to get them down to a 128bit
number, that'd be ideal. `int(joblib.hash(key)
, 16)`
should do nicely. You can combine that with your main process's seed
easily. SeedSequence can take arbitrary amounts of integer data and smoosh
them all together. The spawning functionality builds off of that, but you
can also just manually pass in lists of integers.

Let's call that function `stronghash()`. Let's call your main process seed
number `seed` (this is the thing that the user can set on the command-line
or something you get from `secrets.randbits(128)` if you need a fresh one).
Let's call the unique tuple `key`. You can build the `SeedSequence` for
each job according to the `key` like so:

root_ss = SeedSequence(seed)
for key, data in jobs:
child_ss = SeedSequence([stronghash(key), seed])
submit_job(key, data, seed=child_ss)

Now each job will get its own unique stream regardless of the order the job
is assigned. When the user reruns it with the same root `seed`, they will
get the same results. When the user chooses a different `seed`, they will
get another set of results (this is why you don't want to just use
`SeedSequence(stronghash(key))` all by itself).

I put the job-specific seed data ahead of the main program's seed to be on
the super-safe side. The spawning mechanism will append integers to the
end, so there's a super-tiny chance somewhere down a long line of
`root_ss.spawn()`s that there would be a collision (and I mean
super-extra-tiny). But best practices cost nothing.

I hope that helps and is not too confusing!

-- 
Robert Kern
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion