[Numpy-discussion] Re: Support for Multiple Interpreters (Subinterpreters) in numpy

2022-08-24 Thread Petr Viktorin

On 23. 08. 22 16:19, Sebastian Berg wrote:

On Tue, 2022-08-23 at 14:00 +0200, Petr Viktorin wrote:

On 23. 08. 22 11:46, Sebastian Berg wrote:

[snip]

One thing that I am not clear about are e.g. creation functions.
They
are public C-API so they have no way of getting a "self" or
type/module
passed in.  How will such a function get the module state?

Now, we could likely replace those functions in the long run (or
even
just remove many).  But it seems to me that we may need a
`PyType_GetModuleByDef()` that is passed _only_ the `module_def`?


Then you're looking at per-interpreter state, or thread-locals.
That's
problematic, e.g. you now need to handle clean-up at interpreter
shutdown, and the that isn't well supported. (Or leak -- AFAIK that's
what NumPy currently does when Python's single interpreter is
finalized?)
I do urge you to assume that there can be multiple isolated NumPy
modules created from a single def, even in a single interpreter. It's
an
additional constraint, but since it's conceptually simple I do think
it
makes up for itself in regularity/maintainability/reviewability.

And if the CPython API is lacking, it would be best to solve that in
CPython.



The issue is that we have public C-API that will be lacking the
necessary information. Maybe pretty deep API (I am not certain).


Let's find out!


Now that I think about it, even things like the type is unclear to me.
`&PyArray_Type` would not be per interpreter (unless we figure out
immortality).  But it exists as public API just like `Py_None`, etc.?


Exposing PyArray_Type that way means that it must be a static type. 
Those are immortal. (That said, static types are not compatible with 
Stable ABI -- which is related but not strictly necessary for 
subinterpreter support -- so if there's a chance to make it 
`my_numpy_api->PyArray_Type`, it would be better.)



Our public C-API is currently exported as a single static struct into
the library loading NumPy.  If types depend on the interpreter, it
would seem we need to redo the whole mechanism?


Right, sounds like it needs to be a dynamically allocated struct.
In the interim, one instance of the struct is static: that's the one 
used for anything that doesn't support multiple interpreters yet, and 
also as the module state in one “main” module object. (That would be the 
first module to be loaded, and until everything switches over, it'd get 
an unpaired incref to become “immortal” and leak at exit.)



Further, many of the functions would need to be adapted.  We might be
able to hack that the API looks the same [1].  However, it cannot be
ABI compatible, so we would need a whole new API table/export mechnism
and some sort of shim to allow compiling against older NumPy versions
but using it with all versions (otherwise we need 2+ years of
patience).


Having one static “main” module state in the interim would also help here.


Of course there might be a point in saying that most C-API use is
initially not subinterpreter ready, but it does seem like a pretty huge
limitation...


A huge limitation, but it might be a good way to break up the work to 
make it more manageable :)




Cheers,

Sebastian


[1] I.e. smuggle in module state without the library importing the
NumPy C-API having to change its code.


___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com


[Numpy-discussion] help wanted to review Portuguese translation

2022-08-24 Thread Inessa Pawson
Angélica Cardozo took initiative to translate into Portuguese the subtitles
for the video  “Find your way in the NumPy codebase” (
https://www.youtube.com/watch?v=mTWpBf1zewc) posted on the official NumPy
YouTube channel (https://www.youtube.com/c/NumPy_team).🎉 We are currently
looking for a contributor who could review the translation. If you are
available, please respond to this message in a thread.

-- 
Cheers,
Inessa

Inessa Pawson
Contributor Experience Lead | NumPy
https://numpy.org/
GitHub: inessapawson
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com


[Numpy-discussion] Re: An extension of the .npy file format

2022-08-24 Thread Matti Picus
Sorry for the late reply. Adding a new "*.npy" format feature to allow 
writing to the file in chunks is nice but seems a bit limited. As I 
understand the proposal, reading the file back can only be done in the 
chunks that were originally written. I think other libraries like zar or 
h5py have solved this problem in a more flexible way. Is there a reason 
you cannot use a third-party library to solve this? I would think if you 
have an array too large to write in one chunk you will need third-party 
support to process it anyway.


Matti

___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com


[Numpy-discussion] Re: help wanted to review Portuguese translation

2022-08-24 Thread Ricardo Prins
Hi Inessa,

I can help. I'm mostly sitting still at home these days, still recovering,
so I have plenty of time. Feel free to send me the link to the PR and I'll
be more than happy to help.

Regards,
Ricardo

On Wed, Aug 24, 2022 at 9:59 AM Inessa Pawson  wrote:

> Angélica Cardozo took initiative to translate into Portuguese the
> subtitles for the video  “Find your way in the NumPy codebase” (
> https://www.youtube.com/watch?v=mTWpBf1zewc) posted on the official NumPy
> YouTube channel (https://www.youtube.com/c/NumPy_team).🎉 We are
> currently looking for a contributor who could review the translation. If
> you are available, please respond to this message in a thread.
>
> --
> Cheers,
> Inessa
>
> Inessa Pawson
> Contributor Experience Lead | NumPy
> https://numpy.org/
> GitHub: inessapawson
> ___
> NumPy-Discussion mailing list -- numpy-discussion@python.org
> To unsubscribe send an email to numpy-discussion-le...@python.org
> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
> Member address: ricardopr...@gmail.com
>
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com


[Numpy-discussion] Re: Support for Multiple Interpreters (Subinterpreters) in numpy

2022-08-24 Thread Eric Snow
On Tue, Aug 23, 2022 at 3:47 AM Sebastian Berg
 wrote:
> What is the status of immortality?  None of these seem forbidding on
> first sight, so long that we can get the state everywhere.
> Having immortal object seems convenient, but probably not particularly
> necessary.

The current proposal for immortal objects (PEP 683) will be going to
the steering
council soon.  However, it only applies to the CPython runtime (internally).  We
don't have plans right now for a public API to make an object immortal.  (That
would be a separate proposal.)  If isolating the extension, a la PEP 630, isn't
feasible in the short term, we would certainly be open to discussing
alternatives
(incl. immortal objects).

> One other thing I am not quite sure about right now is GIL grabbing.
> `PyGILState_Ensure()` will continue to work reliably?
> This used to be one of my main worries.  It is also something we can
> fix-up (pass through additional information), but where a fallback
> seems needed.

Compatibility of the GIL state API with subinterpreters has been a long-standing
bug. [1]  That will be fixed.  Otherwise, PyGILState_Ensure() should
work correctly.

-eric


[1] https://github.com/python/cpython/issues/59956
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com


[Numpy-discussion] Re: Support for Multiple Interpreters (Subinterpreters) in numpy

2022-08-24 Thread Eric Snow
On Tue, Aug 23, 2022 at 6:01 AM Petr Viktorin  wrote:
> And if the CPython API is lacking, it would be best to solve that in
> CPython.

+1

In some ways, new CPython APIs would be the most important artifacts of this
discussion.  We want to minimize the effort it takes to support
multiple interpreters.
So we definitely want to know what we could provide that would help.

> Per-interpreter GIL is an *additional* step. I believe it will need its
> own opt-in mechanism. But subinterpreter support is a prerequisite for it.

Yeah, that is an evolving point of discussion in PEP 684.

-eric
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com


[Numpy-discussion] Re: Support for Multiple Interpreters (Subinterpreters) in numpy

2022-08-24 Thread Eric Snow
On Wed, Aug 24, 2022 at 4:42 AM Petr Viktorin  wrote:
> On 23. 08. 22 16:19, Sebastian Berg wrote:
> > Our public C-API is currently exported as a single static struct into
> > the library loading NumPy.  If types depend on the interpreter, it
> > would seem we need to redo the whole mechanism?
>
> Right, sounds like it needs to be a dynamically allocated struct.
> In the interim, one instance of the struct is static: that's the one
> used for anything that doesn't support multiple interpreters yet, and
> also as the module state in one “main” module object. (That would be the
> first module to be loaded, and until everything switches over, it'd get
> an unpaired incref to become “immortal” and leak at exit.)
>
> > Further, many of the functions would need to be adapted.  We might be
> > able to hack that the API looks the same [1].  However, it cannot be
> > ABI compatible, so we would need a whole new API table/export mechnism
> > and some sort of shim to allow compiling against older NumPy versions
> > but using it with all versions (otherwise we need 2+ years of
> > patience).
>
> Having one static “main” module state in the interim would also help here.
>
> > Of course there might be a point in saying that most C-API use is
> > initially not subinterpreter ready, but it does seem like a pretty huge
> > limitation...
>
> A huge limitation, but it might be a good way to break up the work to
> make it more manageable :)

FWIW, in CPython there's a similar issue.  We currently expose static
pointers to all the builtin exceptions in the C-API.  Even worse, we
expose the object *values* for all the static types and the several
singletons.  On top of that, these are all exposed in the limited API
(stable ABI).

As a result, moving to one each per interpreter is messy.  PEP 684
talks about the possible solutions.  The simplest for us is to make
all those objects immortal.  However, in some cases we also have to do
an interpreter-specific lookup internally.  I expect you would have to
do similar where/when compatibility remains essential.

-eric
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com


[Numpy-discussion] Re: An extension of the .npy file format

2022-08-24 Thread Michael Siebert
Hi Matti, hi all,

@Matti: I don’t know what exactly you are referring to (Pull request or the 
Github project, links see below). Maybe some clarification is needed, which I 
hereby try to do ;)

A .npy file created by some appending process is a regular .npy file and does 
not need to be read in chunks. Processing arrays larger than the systems memory 
can already be done with memory mapping (numpy.load(… mmap_mode=...)), so no 
third-party support is needed to do so.

The idea is not necessarily to only write some known-but-fragmented content to 
a .npy file in chunks or to only handle files larger than the RAM.

It is more about the ability to append to a .npy file at any time and between 
program runs. For example, in our case, we have a large database-like file 
containing all (preprocessed) images of all videos used to train a neural 
network. When new video data arrives, it can simply be appended to the existing 
.npy file. When training the neural net, the data is simply memory mapped, 
which happens basically instantly and does not use extra space between multiple 
training processes. We have tried out various fancy, advanced data formats for 
this task, but most of them don’t provide the memory mapping feature which is 
very handy to keep the time required to test a code change comfortably low - 
rather, they have excessive parse/decompress times. Also other libraries can 
also be difficult to handle, see below.

The .npy array format is designed to be limited. There is a NEP for it, which 
summarizes the .npy features and concepts very well:

https://numpy.org/neps/nep-0001-npy-format.html 


One of my favorite features (besides memory mapping perhaps) is this one:

“… Be reverse engineered. Datasets often live longer than the programs that 
created them. A competent developer should be able to create a solution in his 
preferred programming language to read most NPY files that he has been given 
without much documentation. ..."

This is a big disadvantage with all the fancy formats out there: they require 
dedicated libraries. Some of these libraries don’t come with nice and free 
documentation (especially lacking easy-to-use/easy-to-understand code examples 
for the target language, e.g. C) and/or can be extremely complex, like HDF5. 
Yes, HDF5 has its users and is totally valid if one operates the world’s 
largest particle accelerator, but we have spend weeks finding some C/C++ 
library for it which does not expose bugs and is somehow documented. We 
actually failed and posted a bug which was fixed a year later or so. This can 
ruin entire projects - fortunately not ours, but it ate up a lot of time we 
could have spend more meaningful. On the other hand, I don’t see how e.g. zarr 
provides added value over .npy if one only needs the .npy features and maybe 
some append-data-along-one-axis feature. Yes, maybe there are some uses for two 
or three appendable axes, but I think having one axis to append to should cover 
a lot of use cases: this axis is typically time: video, audio, GPS, signal data 
in general, binary log data, "binary CSV" (lines in file): all of those only 
need one axis to append to.

The .npy format is so simple, it can be read e.g. in C in a few lines. Or 
accessed easily through Numpy and ctypes by pointers for high speed custom 
logic - not even requiring libraries besides Numpy.

Making .npy appendable is easy to implement. Yes, appending along one axis is 
limited as the .npy format itself. But I consider that rather to be a feature 
than a (actual) limitation, as it allows for fast and simple appends.

The question is if there is some support for an 
append-to-.npy-files-along-one-axis feature in the Numpy community and if so, 
about the details of the actual implementation. I made one suggestion in

https://github.com/numpy/numpy/pull/20321/ 


and I offer to invest time to update/modify/finalize the PR. I’ve also created 
a library that can already append to .npy:

https://github.com/xor2k/npy-append-array 


However, due to current limitations in the .npy format, the code is more 
complex than it could actually be (the library initializes and checks spare 
space in the header) and it needs to rewrite the header every time. Both could 
be made unnecessary with a very small addition to the .npy file format. Data 
would stay continuous (no fragmentation!), there should just be a way to 
indicate that the actual shape of the array should derived from the file size.

Best, Michael

> On 24. Aug 2022, at 19:16, Matti Picus  wrote:
> 
> Sorry for the late reply. Adding a new "*.npy" format feature to allow 
> writing to the file in chunks is nice but seems a bit limited. As I 
> understand the proposal, reading the file back can only be done in the chunks 
> that were originally written. I think other libraries like zar or h5py have 
> sol

[Numpy-discussion] Re: help wanted to review Portuguese translation

2022-08-24 Thread Inessa Pawson
Hi, Ricardo!
Wonderful! I’ll send you all the details via email.

-- 
Cheers,
Inessa

Inessa Pawson
Contributor Experience Lead | NumPy
https://numpy.org/
GitHub: inessapawson
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com