[Numpy-discussion] Re: NEP 55 - Add a UTF-8 Variable-Width String DType to NumPy

2023-08-31 Thread Stephan Hoyer
On Wed, Aug 30, 2023 at 4:25 AM Ralf Gommers  wrote:

>
>
> On Tue, Aug 29, 2023 at 4:08 PM Nathan  wrote:
>
>> The NEP was merged in draft form, see below.
>>
>> https://numpy.org/neps/nep-0055-string_dtype.html
>>
>
> This is a really nice NEP, thanks Nathan! I see that questions and
> constructive feedback is still coming in on GitHub, but for now it seems
> like everyone is pretty happy with moving forward with implementing this
> new dtype in NumPy.
>
> Cheers,
> Rafl
>

To echo Ralf comments, thank you for this very well-written proposal! I
particularly appreciate the detailed consideration of how to handle
different models of missing values.

Overall, I am very excited about this work. A UTF8 dtype in NumPy is long
overdue, and will bring significant benefits to the entire scientific
Python ecosystem.
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com


[Numpy-discussion] Re: Adding NumpyUnpickler to Numpy 1.26 and future Numpy 2.0

2023-10-09 Thread Stephan Hoyer
On Mon, Oct 9, 2023 at 2:29 PM Nathan  wrote:

> However, one thing we can do now is, for that one particular symbol that
> we know is going to be in every pickle file and probably never elsewhere,
> is intercept that one import and instead of generating a generic warning
> about np.core being deprecated, we instead make that specific version of
> the deprecation warning mentions NumpyUnpickler. I'll make sure this gets
> done.
>
> We *could* just allow that import to happen without a warning, but then
> we're stuck keeping np.core around even longer and we also will still
> generate a deprecation warning for an import from np.core if the pickle
> file happens to include any other numpy types that might generate imports
> in np.core.
>

My preferred option would be to keep restoring old NumPy pickles working
indefinitely, and also to preserve backwards compatibility for pickles
written in newer versions of NumPy. We can still do the rest of the
numpy.core cleanup, but it's OK if we keep a bit of compatibility code in
NumPy indefinitely.

I don't think warnings would help much in this case, because if somebody is
currently distributing pickled numpy arrays despite all of our warnings not
to do so, they are unlikely to go back and update their old files.

We could keep around numpy.core.multiarray as a minimal stub for only this
purpose, or potentially only define the object
numpy.core.multiarray._reconstruct.
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com


[Numpy-discussion] Re: Fixing definition of reduceat for Numpy 2.0?

2023-12-22 Thread Stephan Hoyer
On Fri, Dec 22, 2023 at 12:34 PM Martin Ling  wrote:

> Hi folks,
>
> I don't follow numpy development in much detail these days but I see
> that there is a 2.0 release planned soon.
>
> Would this be an opportunity to change the behaviour of 'reduceat'?
>
> This issue has been open in some form since 2006!
> https://github.com/numpy/numpy/issues/834
>
> The current behaviour was originally inherited from Numeric, and makes
> reduceat often unusable in practice, even where it should be the
> perfect, concise, efficient solution. But it has been impossible to
> change it without breaking compatibіlity with existing code.
>
> As a result, horrible hacks are needed instead, e.g. my answer here:
> https://stackoverflow.com/questions/57694003
>
> Is this something that could finally be fixed in 2.0?


The reduceat API is certainly problematic, but I don't think fixing it is
really a NumPy 2.0 thing.

As discussed in that issue, the right way to fix that is to add a new API
with the correct behavior, and then we can think about deprecating (and
maybe eventually removing) the current reduceat method. If the new
reducebins() method were available, I would say removing reduceat() would
be appropriate to consider for NumPy 2, but we don't have the new method
with fixed behavior yet, which is the bigger blocker.


>
>
> Martin
> ___
> NumPy-Discussion mailing list -- numpy-discussion@python.org
> To unsubscribe send an email to numpy-discussion-le...@python.org
> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
> Member address: sho...@gmail.com
>
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com


[Numpy-discussion] Re: NEP 56: array API standard support in the main numpy namespace

2024-01-16 Thread Stephan Hoyer
On Sun, Jan 7, 2024 at 8:08 AM Ralf Gommers  wrote:

> This NEP will supersede the following NEPs:
>
> - :ref:`NEP30` (never implemented)
> - :ref:`NEP31` (never implemented)
> - :ref:`NEP37` (never implemented; the ``__array_module__`` idea is
> basically
>   the same as ``__array_namespace__``)
> - :ref:`NEP47` (implemented with an experimental label in
> ``numpy.array_api``,
>   will be removed)
>

Thanks Ralf, Mateusz and Nathan for putting this together.

I just wanted to comment briefly to voice my strong support for this
proposal, and especially for marking these other NEPs as superseded. This
will go a long way towards clarifying NumPy's support for generic array
interfaces.
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com


Re: [Numpy-discussion] __array_ufunc__ counting down to launch, T-24 hrs.

2017-03-31 Thread Stephan Hoyer
I agree with Nathaniel -- let's finish the design doc first.

On Thu, Mar 30, 2017 at 10:08 PM, Nathaniel Smith  wrote:

> On Thu, Mar 30, 2017 at 7:40 PM, Charles R Harris
>  wrote:
> > Hi All,
> >
> > Just a note that the __array_ufunc__ PR is ready to merge. If you are
> > interested, you can review here.
>
> I want to get this in too, but 24 hours seems like a very short
> deadline for getting feedback on such a large and complex change? I'm
> pretty sure the ndarray.__array_ufunc__ code that was just added a few
> hours ago is wrong (see comments on the diff)...
>
> My main comment, also relevant to the kind of high-level discussion we
> tend to use the mailing list for:
>   https://github.com/numpy/numpy/pull/8247#issuecomment-290616432
>
> -n
>
> --
> Nathaniel J. Smith -- https://vorpus.org
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
>
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-20 Thread Stephan Hoyer
Julian -- thanks for taking this on. NumPy's handling of strings on Python
3 certainly needs fixing.

On Thu, Apr 20, 2017 at 9:47 AM, Anne Archibald 
wrote:

> Variable-length encodings, of which UTF-8 is obviously the one that makes
> good handling essential, are indeed more complicated. But is it strictly
> necessary that string arrays hold fixed-length *strings*, or can the
> encoding length be fixed instead? That is, currently if you try to assign a
> longer string than will fit, the string is truncated to the number of
> characters in the data type. Instead, for encoded Unicode, the string could
> be truncated so that the encoding fits. Of course this is not completely
> trivial for variable-length encodings, but it should be doable, and it
> would allow UTF-8 to be used just the way it usually is - as an encoding
> that's almost 8-bit.
>

I agree with Anne here. Variable-length encoding would be great to have,
but even fixed length UTF-8 (in terms of memory usage, not characters)
would solve NumPy's Python 3 string problem. NumPy's memory model needs a
fixed size per array element, but that doesn't mean we need a fixed sized
per character. Each element in a UTF-8 array would be a string with a fixed
number of codepoints, not characters.

In fact, we already have this sort of distinction between element size and
memory usage: np.string_ uses null padding to store shorter strings in a
larger dtype.

The only reason I see for supporting encodings other than UTF-8 is for
memory-mapping arrays stored with those encodings, but that seems like a
lot of extra trouble for little gain.
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-20 Thread Stephan Hoyer
On Thu, Apr 20, 2017 at 10:43 AM, Chris Barker 
wrote:

> On Thu, Apr 20, 2017 at 10:26 AM, Stephan Hoyer  wrote:
>
>> I agree with Anne here. Variable-length encoding would be great to have,
>> but even fixed length UTF-8 (in terms of memory usage, not characters)
>> would solve NumPy's Python 3 string problem. NumPy's memory model needs a
>> fixed size per array element, but that doesn't mean we need a fixed sized
>> per character. Each element in a UTF-8 array would be a string with a fixed
>> number of codepoints, not characters.
>>
>
> Ah, yes -- the nightmare of Unicode!
>
> No, it would not be a fixed number of codepoints -- it would be a fixed
> number of bytes (or "code units"). and an unknown number of characters.
>

Apologies for confusing the terminology! Yes, this would mean a fixed
number of bytes and an unknown number of characters.


> As Julian pointed out, if you wanted to specify that a numpy element would
> be able to hold, say, N characters (actually code points, combining
> characters make this even more confusing) then you would need to allocate
> N*4 bytes to make sure you could hold any string that long. Which would be
> pretty pointless -- better to use UCS-4.
>

It's already unsafe to try to insert arbitrary length strings into a numpy
string_ or unicode_ array. When determining the dtype automatically (e.g.,
with np.array(list_of_strings)), the difference is that numpy would need to
check the maximum encoded length instead of the character length (i.e.,
len(x.encode() instead of len(x)).

I certainly would not over-allocate. If users want more space, they can
explicitly choose an appropriate size. (This is an hazard of not having
length length dtypes.)

If users really want to be able to fit an arbitrary number of unicode
characters and aren't concerned about memory usage, they can still use
np.unicode_ -- that won't be going away.


> So Anne's suggestion that numpy truncates as needed would make sense --
> you'd specify say N characters, numpy would arbitrarily (or user specified)
> over-allocate, maybe N*1.5 bytes, and you'd truncate if someone passed in a
> string that didn't fit. Then you'd need to make sure you truncated
> correctly, so as not to create an invalid string (that's just code, it
> could be made correct).
>

NumPy already does this sort of silent truncation with longer strings
inserted into shorter string dtypes. The different here would indeed be the
need to check the number of bytes represented by the string instead of the
number of characters.

But I don't think this is useful behavior to bring over to a new dtype. We
should error instead of silently truncating. This is certainly easier than
trying to figure out when we would be splitting a character.


> But how much to over allocate? for english text, with an occasional
> scientific symbol, only a little. for, say, Japanese text, you'd need a
> factor 2 maybe?
>
> Anyway, the idea that "just use utf-8" solves your problems is really
> dangerous. It simply is not the right way to handle text if:
>
> you need fixed-length storage
> you care about compactness
>
> In fact, we already have this sort of distinction between element size and
>> memory usage: np.string_ uses null padding to store shorter strings in a
>> larger dtype.
>>
>
> sure -- but it is clear to the user that the dtype can hold "up to this
> many" characters.
>

As Yu Feng points out in this GitHub comment, non-latin language speakers
are already aware of the difference between string length and bytes length:
https://github.com/numpy/numpy/pull/8942#issuecomment-294409192

Making an API based on code units instead of code points really seems like
the saner way to handle unicode strings. I agree with this section with the
DyND design docs for it's string type, which notes precedent from Julia and
Go:
https://github.com/libdynd/libdynd/blob/master/devdocs/string-design.md#code-unit-api-not-code-point

I think a 1-byte-per char latin-* encoded string is a good idea though --
> scientific use tend to be latin only and space constrained.
>

I think scientific users tend be to ASCII only, so UTF-8 would also work
transparently :).
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-20 Thread Stephan Hoyer
On Thu, Apr 20, 2017 at 11:53 AM, Robert Kern  wrote:

> I don't know of a format off-hand that works with numpy uniform-length
> strings and Unicode as well. HDF5 (to my recollection) supports arrays of
> NULL-terminated, uniform-length ASCII like FITS, but only variable-length
> UTF8 strings.
>

HDF5 supports two character sets, ASCII and UTF-8. Both come in fixed and
variable length versions:
https://github.com/PyTables/PyTables/issues/499
https://support.hdfgroup.org/HDF5/doc/Advanced/UsingUnicode/index.html

"Fixed length UTF-8" for HDF5 refers to the number of bytes used for
storage, not the number of characters.
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-20 Thread Stephan Hoyer
On Thu, Apr 20, 2017 at 12:17 PM, Robert Kern  wrote:

> On Thu, Apr 20, 2017 at 12:05 PM, Stephan Hoyer  wrote:
> >
> > On Thu, Apr 20, 2017 at 11:53 AM, Robert Kern 
> wrote:
> >>
> >> I don't know of a format off-hand that works with numpy uniform-length
> strings and Unicode as well. HDF5 (to my recollection) supports arrays of
> NULL-terminated, uniform-length ASCII like FITS, but only variable-length
> UTF8 strings.
> >
> >
> > HDF5 supports two character sets, ASCII and UTF-8. Both come in fixed
> and variable length versions:
> > https://github.com/PyTables/PyTables/issues/499
> > https://support.hdfgroup.org/HDF5/doc/Advanced/UsingUnicode/index.html
> >
> > "Fixed length UTF-8" for HDF5 refers to the number of bytes used for
> storage, not the number of characters.
>
> Ah, okay, I was interpolating from a quick perusal of the h5py docs, which
> of course are also constrained by numpy's current set of dtypes. The
> NULL-terminated ASCII works well enough with np.string's semantics.
>

Yes, except that on Python 3, "Fixed length ASCII" in HDF5 should
correspond to a string type, not np.string_ (which is really bytes).
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-21 Thread Stephan Hoyer
On Fri, Apr 21, 2017 at 11:34 AM, Chris Barker 
wrote:

> 1) Use with/from  Python -- both creating and working with numpy arrays.
>

> In this case, we want something compatible with Python's string (i.e. full
> Unicode supporting) and I think should be as transparent as possible.
> Python's string has made the decision to present a character oriented API
> to users (despite what the manifesto says...).
>

Yes, but NumPy doesn't really implement string operations, so fortunately
this is pretty irrelevant to us -- except for our API for specifying dtype
size.

We already have strong precedence for dtypes reflecting number of bytes
used for storage even when Python doesn't: consider numeric types like
int64 and float32 compared to the Python equivalents. It's an intrinsic
aspect of NumPy that users need to think about how their data is actually
stored.


> However, there is a challenge here: numpy requires fixed-number-of-bytes
> dtypes. And full unicode support with fixed number of bytes matching fixed
> number of characters is only possible with UCS-4 -- hence the current
> implementation. And this is actually just fine! I know we all want to be
> efficient with data storage, but really -- in the early days of Unicode,
> when folks thought 16 bits were enough, doubling the memory usage for
> western language storage was considered fine -- how long in computer life
> time does it take to double your memory? But now, when memory, disk space,
> bandwidth, etc, are all literally orders of magnitude larger, we can't
> handle a factor of 4 increase in "wasted" space?
>

Storage cost is always going to be a concern. Arguably, it's even more of a
concern today than it used to be be, because compute has been improving
faster than storage.


> But as scientific text data often is 1-byte compatible, a
> one-byte-per-char dtype is a fine idea, too -- and we pretty much have that
> already with the existing string type -- that could simply be enhanced by
> enforcing the encoding to be latin-9 (or latin-1, if you don't want the
> Euro symbol). This would get us what scientists expect from strings in a
> way that is properly compatible with Python's string type. You'd get
> encoding errors if you tried to stuff anything else in there, and that's
> that.
>

I still don't understand why a latin encoding makes sense as a preferred
one-byte-per-char dtype. The world, including Python 3, has standardized on
UTF-8, which is also one-byte-per-char for (ASCII) scientific data.

So -- I think we should address the use-cases separately -- one for
> "normal" python use and simple interoperability with python strings, and
> one for interoperability at the binary level. And an easy way to convert
> between the two.
>
> For Python use -- a pointer to a Python string would be nice.
>

Yes, absolutely. If we want to be really fancy, we could consider a
parametric object dtype that allows for object arrays of *any* homogeneous
Python type. Even if NumPy itself doesn't do anything with that
information, there are lots of use cases for that information.

Then use a native flexible-encoding dtype for everything else.
>

No opposition here from me. Though again, I think utf-8 alone would also be
enough.


> Thinking out loud -- another option would be to set defaults for the
> multiple-encoding dtype so you'd get UCS-4 -- with its full compatibility
> with the python string type -- and make folks make an effort to get
> anything else.
>

The np.unicode_ type is already UCS-4 and the default for dtype=str on
Python 3. We probably shouldn't change that, but if we set any default
encoding for the new text type, I strongly believe it should be utf-8.

One more note: if a user tries to assign a value to a numpy string array
> that doesn't fit, they should get an error:
>
> EncodingError if it can't be encoded into the defined encoding.
>
> ValueError if it is too long -- it should not be silently truncated.
>

I think we all agree here.
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-24 Thread Stephan Hoyer
On Mon, Apr 24, 2017 at 10:04 AM, Chris Barker 
wrote:

> latin-1 or latin-9 buys you (over ASCII):
>
> ...
>
> - round-tripping of binary data (at least with Python's encoding/decoding)
> -- ANY string of bytes can be decodes as latin-1 and re-encoded to get the
> same bytes back. You may get garbage, but you won't get an EncodingError.
>

For a new application, it's a good thing if a text type breaks when you to
stuff arbitrary bytes in it (see Python 2 vs Python 3 strings).

Certainly, I would argue that nobody should write data in latin-1 unless
they're doing so for the sake of a legacy application.

I do understand the value in having some "string" data type that could be
used by default by loaders for legacy file formats/applications (i.e.,
netCDF3) that support unspecified "one byte strings." Then you're a few
short calls away from viewing (i.e., array.view('text[my_real_encoding]'),
if we support arbitrary encodings) or decoding (i.e.,
np.char.decode(array.view(bytes), 'my_real_encoding') ) the data in the
proper encoding. It's not realistic to expect users to know the true
encoding for strings from a file before they even look at the data.

On the other hand, if this is the use-case, perhaps we really want an
encoding closer to "Python 2" string, i.e, "unknown", to let this be
signaled more explicitly. I would suggest that "text[unknown]" should
support operations like a string if it can be decoded as ASCII, and
otherwise error. But unlike "text[ascii]", it will let you store arbitrary
bytes.


> Then use a native flexible-encoding dtype for everything else.
>>>
>>
>> No opposition here from me. Though again, I think utf-8 alone would also
>> be enough.
>>
>
> maybe so -- the major reason for supporting others is binary data exchange
> with other libraries -- but maybe most of them have gone to utf-8 anyway.
>

Indeed, it would be helpful for this discussion to know what other
encodings are actually currently used by scientific applications.

So far, we have real use cases for at least UTF-8, UTF-32, ASCII and
"unknown".

The current 'S' dtype truncates silently already:
>

One advantage of a new (non-default) dtype is that we can change this
behavior.


> Also -- if utf-8 is the default -- what do you get when you create an
> array from a python string sequence? Currently with the 'S' and 'U' dtypes,
> the dtype is set to the longest string passed in. Are we going to pad it a
> bit? stick with the exact number of bytes?
>

It might be better to avoid this for now, and force users to be explicit
about encoding if they use the dtype for encoded text. We can keep
bytes/str mapped to the current choices.
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-24 Thread Stephan Hoyer
On Mon, Apr 24, 2017 at 11:13 AM, Chris Barker 
wrote:

> On the other hand, if this is the use-case, perhaps we really want an
>> encoding closer to "Python 2" string, i.e, "unknown", to let this be
>> signaled more explicitly. I would suggest that "text[unknown]" should
>> support operations like a string if it can be decoded as ASCII, and
>> otherwise error. But unlike "text[ascii]", it will let you store arbitrary
>> bytes.
>>
>
> I _think_ that is what using latin-1 (Or latin-9) gets you -- if it really
> is ascii, then it's perfect. If it really is latin-*, then you get some
> extra useful stuff, and if it's corrupted somehow, you still get the ascii
> text correct, and the rest won't  barf and can be passed on through.
>

I am totally in agreement with Thomas that "We are living in a messy world
right now with messy legacy datasets that have character type data that are
*mostly* ASCII, but not infrequently contain non-ASCII characters."

My question: What are those non-ASCII characters? How often are they truly
latin-1/9 vs. some other text encoding vs. non-string binary data?

I don't think that silently (mis)interpreting non-ASCII characters as
latin-1/9 is a good idea, which is why I think it would be a mistake to use
'latin-1' for text data with unknown encoding.

I could get behind a data type that compares equal to strings for ASCII
only and allows for *storing* other characters, but making blind
assumptions about characters 128-255 seems like a recipe for disaster.
Imagine text[unknown] as a one character string type, but it supports
.decode() like bytes and every character in the range 128-255 compares for
equality with other characters like NaN -- not even equal to itself.
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-24 Thread Stephan Hoyer
On Mon, Apr 24, 2017 at 4:08 PM, Robert Kern  wrote:

> Let me make a counter-proposal for your latin-1 dtype (your #2) that might
> address your, Thomas's, and Julian's use cases:
>
> 2) We want a single-byte-per-character, NULL-terminated string dtype that
> can be used to represent mostly-ASCII textish data that may have some
> high-bit characters from some 8-bit encoding. It should be able to read
> arbitrary bytes (that is, up to the NULL-termination) and write them back
> out as the same bytes if unmodified. This lets us read this text from files
> where the encoding is unspecified (or is lying about the encoding) into
> `unicode/str` objects. The encoding is specified as `ascii` but the
> decoding/encoding is done with the `surrogateescape` option so that
> high-bit characters are faithfully represented in the `unicode/str` string
> but are not erroneously reinterpreted as other characters from an arbitrary
> encoding.
>
> I'd even be happy if Julian or someone wants to go ahead and implement
> this right now and leave the UTF-8 dtype for a later time.
>
> As long as this ASCII-surrogateescape dtype is not called np.realstring
> (it's *really* important to me that the bikeshed not be this color). ;-)
>

This sounds quite similar to my text[unknown] proposal, with the advantage
that the concept of "surrogateescape" that already exists. Surrogate-escape
characters compare equal to themselves, which is maybe less than ideal, but
it looks like you can put them in real unicode strings, which is nice.
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-24 Thread Stephan Hoyer
On Mon, Apr 24, 2017 at 7:41 PM, Nathaniel Smith  wrote:

> But also, is it important whether strings we're loading/saving to an
> HDF5 file have the same in-memory representation in numpy as they
> would in the file? I *know* [1] no-one is reading HDF5 files using
> np.memmap :-).


Of course they do :)
https://github.com/jjhelmus/pyfive/blob/98d26aaddd6a7d83cfb189c113e172cc1b60d5f8/pyfive/low_level.py#L682


> Also, further searching suggests that HDF5 actually supports all of
> nul termination, nul padding, and space padding, and that nul
> termination is the default? How much does it help to have in-memory
> compatibility with just one of these options (and not even the default
> one)? Would we need to add the other options to be really useful for
> HDF5?


h5py actually ignores this option and only uses null termination. I have
not heard any complaints about this (though I have heard complaints about
the lack of fixed-length UTF-8).

But more generally, you're right. h5py doesn't need a corresponding NumPy
dtype for each HDF5 string dtype, though that would certainly be
*convenient*. In fact, it already (ab)uses NumPy's dtype metadata with
h5py.special_dtype to indicate a homogeneous string type for object arrays.

I would guess h5py users have the same needs for efficient string
representations (including surrogate-escape options) as other scientific
users.
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-25 Thread Stephan Hoyer
On Tue, Apr 25, 2017 at 9:21 PM Robert Kern  wrote:

> On Tue, Apr 25, 2017 at 6:27 PM, Charles R Harris <
> charlesr.har...@gmail.com> wrote:
>
> > The maximum length of an UTF-8 character is 4 bytes, so we could use
> that to size arrays by character length. The advantage over UTF-32 is that
> it is easily compressible, probably by a factor of 4 in many cases. That
> doesn't solve the in memory problem, but does have some advantages on disk
> as well as making for easy display. We could compress it ourselves after
> encoding by truncation.
>
> The major use case that we have for a UTF-8 array is HDF5, and it
> specifies the width in bytes, not Unicode characters.
>

It's not just HDF5. Counting bytes is the Right Way to measure the size of
UTF-8 encoded text:
http://utf8everywhere.org/#myths

I also firmly believe (though clearly this is not universally agreed upon)
that UTF-8 is the Right Way to encode strings for *non-legacy*
applications. So if we're adding any new string encodings, it needs to be
one of them.
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-26 Thread Stephan Hoyer
On Wed, Apr 26, 2017 at 3:27 PM, Chris Barker  wrote:

> When a numpy user wants to put a string into a numpy array, they should
> know how long a string they can fit -- with "length" defined how python
> strings define it.
>

Sorry, I remain unconvinced (for the reasons that Robert, Nathaniel and
myself have already given), but we seem to be talking past each other here.

I am still -1 on any new string encoding support unless that includes at
least UTF-8, with length indicated by the number of bytes.
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-26 Thread Stephan Hoyer
On Wed, Apr 26, 2017 at 4:49 PM, Nathaniel Smith  wrote:

> It's worthwhile enough that both major HDF5 bindings don't support Unicode
> arrays, despite user requests for years. The sticking point seems to be the
> difference between HDF5's view of a Unicode string array (defined in size
> by the bytes of UTF-8 data) and numpy's current view of a Unicode string
> array (because of UCS-4, defined by the number of
> characters/codepoints/whatever). So there are HDF5 files out there that
> none of our HDF5 bindings can read, and it is impossible to write certain
> data efficiently.
>
>
> I would really like to hear more from the authors of these libraries about
> what exactly it is they feel they're missing. Is it that they want numpy to
> enforce the length limit early, to catch errors when the array is modified
> instead of when they go to write it to the file? Is it that they really
> want an O(1) way to look at a array and know the maximum number of bytes
> needed to represent it in utf-8? Is it that utf8<->utf-32 conversion is
> really annoying and files that need it are rare so they haven't had the
> motivation to implement it? My impression is similar to Julian's: you
> *could* implement HDF5 fixed-length utf-8 <-> numpy U arrays with a few
> dozen lines of code, which is nothing compared to all the other hoops these
> libraries are already jumping through, so if this is really the roadblock
> then I must be missing something.
>

I actually agree with you. I think it's mostly a matter of convenience that
h5py matched up HDF5 dtypes with numpy dtypes:
fixed width ASCII -> np.string_/bytes
variable length ASCII -> object arrays of np.string_/bytes
variable length UTF-8 -> object arrays of unicode

This was tenable in a Python 2 world, but on Python 3 it's broken and
there's not an easy fix.

We absolutely could fix h5py by mapping everything to object arrays of
Python unicode strings, as has been discussed (
https://github.com/h5py/h5py/pull/871). For fixed width UTF-8, this would
be a fine but non-ideal solution, since there is currently no fixed width
UTF-8 support.

For fixed width ASCII arrays, this would mean increased convenience for
Python 3 users, at the price of decreased convenience for Python 2 users
(arrays now contain boxed Python objects), unless we made the h5py behavior
dependent on the version of Python. Hence, we're back here, waiting for
better dtypes for encoded strings.

So for HDF5, I see good use cases for ASCII-with-surrogateescape (for
handling ASCII arrays as strings) and UTF-8 with length equal to the number
of bytes.
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Proposal: np.search() to complement np.searchsorted()

2017-05-09 Thread Stephan Hoyer
On Tue, May 9, 2017 at 9:46 AM, Martin Spacek  wrote:

> Looking at my own habits and uses, it seems to me that finding the indices
> of matching values of one array in another is a more common use case than
> finding insertion indices of one array into another sorted array. So, I
> propose that np.search(), or something like it, could be even more useful
> than np.searchsorted().
>

The current version of this PR only returns the indices of the *first*
match (rather than all matches), which is an important detail. I would
strongly consider including that detail in the name (e.g., by calling this
"find_first" rather than "search"), because my naive expectation for a
method called "search" is to find all matches.

In any case, I agree that this functionality would be welcome. Getting the
details right for a high performance solution is tricky, and there is
strong evidence of interest given the 200+ upvotes on this StackOverflow
question:
http://stackoverflow.com/questions/432112/is-there-a-numpy-function-to-return-the-first-index-of-something-in-an-array
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] NumPy v1.13.0rc1 released.

2017-05-11 Thread Stephan Hoyer
Also, as friendly reminder, GitHub is a better place for bug reports than
mailing lists with hundreds of subscribers :).

On Thu, May 11, 2017 at 6:56 AM, Eric Wieser 
wrote:

> Nadav: Can you provide a testcase that fails?
>
> I don't think you're correct - it works just fine when `axis = a.ndims` -
> the issue arises when `axis > a.ndims`, but I'd argue that in that case an
> error is correct behaviour. But still a change, so perhaps needs a release
> note entry
>
> On Thu, 11 May 2017 at 14:25 Nadav Horesh  wrote:
>
>> There is a change to "expand_dims" function, that it is now does not
>> allow axis = a.ndims.
>>
>> This influences matplotlib function get_bending_matrices in
>> triinterpolate.py
>>
>>
>>   Nadav
>> --
>> *From:* NumPy-Discussion > visionsense@python.org> on behalf of Charles R Harris <
>> charlesr.har...@gmail.com>
>> *Sent:* 11 May 2017 04:48:34
>> *To:* numpy-discussion; SciPy-User; SciPy Developers List;
>> python-announce-l...@python.org
>> *Subject:* [Numpy-discussion] NumPy v1.13.0rc1 released.
>>
>> Hi All,
>>
>> I'm please to announce the NumPy 1.13.0rc1 release. This release supports
>> Python 2.7 and 3.4-3.6 and contains many new features. It is one of the
>> most ambitious releases in the last several years. Some of the highlights
>> and new functions are
>>
>> *Highlights*
>>
>>- Operations like ``a + b + c`` will reuse temporaries on some
>>platforms, resulting in less memory use and faster execution.
>>- Inplace operations check if inputs overlap outputs and create
>>temporaries to avoid problems.
>>- New __array_ufunc__ attribute provides improved ability for classes
>>to override default ufunc behavior.
>>-  New np.block function for creating blocked arrays.
>>
>>
>> *New functions*
>>
>>- New ``np.positive`` ufunc.
>>- New ``np.divmod`` ufunc provides more efficient divmod.
>>- New ``np.isnat`` ufunc tests for NaT special values.
>>- New ``np.heaviside`` ufunc computes the Heaviside function.
>>- New ``np.isin`` function, improves on ``in1d``.
>>- New ``np.block`` function for creating blocked arrays.
>>- New ``PyArray_MapIterArrayCopyIfOverlap`` added to NumPy C-API.
>>
>> Wheels for the pre-release are available on PyPI. Source tarballs,
>> zipfiles, release notes, and the Changelog are available on github
>> <https://github.com/numpy/numpy/releases/tag/v1.13.0rc1>.
>>
>> A total of 100 people contributed to this release.  People with a "+" by
>> their
>> names contributed a patch for the first time.
>>
>>- A. Jesse Jiryu Davis +
>>- Alessandro Pietro Bardelli +
>>- Alex Rothberg +
>>- Alexander Shadchin
>>- Allan Haldane
>>- Andres Guzman-Ballen +
>>- Antoine Pitrou
>>- Antony Lee
>>- B R S Recht +
>>- Baurzhan Muftakhidinov +
>>- Ben Rowland
>>- Benda Xu +
>>- Blake Griffith
>>- Bradley Wogsland +
>>- Brandon Carter +
>>- CJ Carey
>>- Charles Harris
>>- Danny Hermes +
>>- Duke Vijitbenjaronk +
>>- Egor Klenin +
>>- Elliott Forney +
>>- Elliott M Forney +
>>- Endolith
>>- Eric Wieser
>>- Erik M. Bray
>>- Eugene +
>>- Evan Limanto +
>>- Felix Berkenkamp +
>>- François Bissey +
>>- Frederic Bastien
>>- Greg Young
>>- Gregory R. Lee
>>- Importance of Being Ernest +
>>- Jaime Fernandez
>>- Jakub Wilk +
>>- James Cowgill +
>>- James Sanders
>>- Jean Utke +
>>- Jesse Thoren +
>>- Jim Crist +
>>- Joerg Behrmann +
>>- John Kirkham
>>- Jonathan Helmus
>>- Jonathan L Long
>>- Jonathan Tammo Siebert +
>>- Joseph Fox-Rabinovitz
>>- Joshua Loyal +
>>- Juan Nunez-Iglesias +
>>- Julian Taylor
>>- Kirill Balunov +
>>- Likhith Chitneni +
>>- Loïc Estève
>>- Mads Ohm Larsen
>>- Marein Könings +
>>- Marten van Kerkwijk
>>- Martin Thoma
>>- Martino Sorbaro +
>>- Marvin Schmidt +
>>- Matthew Brett
>>- Matthias Bussonnier +
>>- Matthias C. M. Troffaes +
>>- Matti Picus
>>- Michael Seifert
>>- Mikhail Pak +
>>- Mortada Mehyar
>>

Re: [Numpy-discussion] Proposal: np.search() to complement np.searchsorted()

2017-05-15 Thread Stephan Hoyer
I like the idea of a strategy keyword argument. strategy='auto' leaves the
door open for future improvements, e.g., if we ever add hash tables to
numpy.

For the algorithm, I think we actually want to sort the needles array as
well in most (all?) cases.

If haystack is also sorted, advancing thorough both arrays at once brings
down the cost of the actual search itself down to O(n+k). (Possibly this is
worth exposing as np.searchbothsorted or something similar?)

If we don't sort haystack, we can still use binary search on the needles to
bring the search cost down to O(n log k).
On Mon, May 15, 2017 at 5:00 PM Nathaniel Smith  wrote:

> On May 9, 2017 9:47 AM, "Martin Spacek"  wrote:
>
> Hello,
>
> I've opened up a pull request to add a function called np.search(), or
> something like it, to complement np.searchsorted():
>
> https://github.com/numpy/numpy/pull/9055
>
> There's also this issue I opened before starting the PR:
>
> https://github.com/numpy/numpy/issues/9052
>
> Proposed API changes require discussion on the list, so here I am!
>
> This proposed function (and perhaps array method?) does the same as
> np.searchsorted(a, v), but doesn't require `a` to be sorted, and explicitly
> checks if all the values in `v` are a subset of those in `a`. If not, it
> currently raises an error, but that could be controlled via a kwarg.
>
> As I mentioned in the PR, I often find myself abusing np.searchsorted() by
> not explicitly checking these assumptions. The temptation to use it is
> great, because it's such a fast and convenient function, and most of the
> time that I use it, the assumptions are indeed valid. Explicitly checking
> those assumptions each and every time before I use np.searchsorted() is
> tedious, and easy to forget to do. I wouldn't be surprised if many others
> abuse np.searchsorted() in the same way.
>
>
> It's worth noting though that the "sorted" part is a critical part of what
> makes it fast. If we're looking for k needles in an n-item haystack, then:
>
> If the haystack is already sorted and we know it, using searchsorted does
> it in k*log2(n) comparisons. (Could be reduced to average case O(k log log
> n) for simple scalars by using interpolation search, but I don't think
> searchsorted is that clever atm.)
>
> If the haystack is not sorted, then sorting it and then using searchsorted
> requires a total of O(n log n) + k*log2(n) comparisons.
>
> And if the haystack is not sorted, then doing linear search to find the
> first item like list.index does requires on average 0.5*k*n comparisons.
>
> This analysis ignores memory effects, which are important -- linear memory
> access is faster than random access, and searchsorted is all about making
> memory access maximally unpredictable. But even so, I think
> sorting-then-searching will be reasonably competitive pretty much from the
> start, and for moderately large k and n values the difference between (n +
> k)*log(n) and n*k is huge.
>
> Another issue is that sorting requires an O(n)-sized temporary buffer
> (assuming you can't mutate the haystack in place). But if your haystack is
> a large enough fraction of memory that you can't afford is buffer, then
> it's likely large enough that you can't afford linear searching either...
>
>
> Looking at my own habits and uses, it seems to me that finding the indices
> of matching values of one array in another is a more common use case than
> finding insertion indices of one array into another sorted array. So, I
> propose that np.search(), or something like it, could be even more useful
> than np.searchsorted().
>
>
> My main concern here would be creating a trap for the unwary, where people
> use search() naively because it's so nice and convenient, and then
> eventually get surprised by a nasty quadratic slowdown. There's a whole
> blog about these traps :-) https://accidentallyquadratic.tumblr.com/
>
> Otoh there are also huge number of numpy use cases where it doesn't matter
> if some calculation is 1000x slower than it should be, as long as it works
> and is discoverable...
>
> So it sounds like one obvious thing would be to have a version of
> searchsorted that checks for matches (maybe side="exact"? Though that's not
> easy to find...). That's clearly useful, and orthogonal to the
> linear/binary search issue, so we shouldn't make it a reason people are
> tempted to choose the inferior algorithm.
>
> ...ok, how's this for a suggestion. Give np.search a strategy= kwarg, with
> options "linear", "searchsorted", and "auto". Linear does the obvious
> thing, searchsorted generates a sorter array using argsort (unless the user
> provided one) and then calls searchsorted, and auto picks one of them
> depending on whether a sorter array was provided and how large the arrays
> are. The default is auto. In all cases it looks for exact matches.
>
> I guess by default "not found" should be signaled with an exception, and
> then there should be some option to have it return a sentinel value
> inste

Re: [Numpy-discussion] UC Berkeley hiring developers to work on NumPy

2017-05-22 Thread Stephan Hoyer
On Mon, May 22, 2017 at 11:52 AM, Marten van Kerkwijk <
m.h.vankerkw...@gmail.com> wrote:

> My sentence "adapt the typical academic rule for conflicts of
> interests to PRs, that non-trivial ones cannot be merged by someone
> who has a conflict of interest with the author, i.e., it cannot be a
> superviser, someone from the same institute, etc." was meant as a
> suggestion for part of this blueprint!
>

This sounds like a good rule of thumb to me. As a practical matter, asking
someone outside to approve changes is a good way to ensure that decisions
are not short-circuited by offline discussions. But remember that per our
governance procedures, we already require consensus for decision making. So
I don't think we need an actual change here.

I'll readily admit, though, that since I'm not overly worried, I
> haven't even looked at the policies that are in place, nor do I intend
> to contribute much beyond this e-mail.


I am also not worried about this, really not at all. NumPy already has
governance procedures and a steering committee for handling exactly these
sorts of concerns, should they arise (which I also consider extremely
unlikely in the case of BIDS and their non-profit funder).
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Controlling NumPy __mul__ method or forcing it to use __rmul__ of the "other"

2017-06-19 Thread Stephan Hoyer
I answered your question on StackOverflow:
https://stackoverflow.com/questions/40694380/forcing-multiplication-to-use-rmul-instead-of-numpy-array-mul-or-byp/44634634#44634634

In brief, you need to set __array_priority__ or __array_ufunc__ on your
object.

On Mon, Jun 19, 2017 at 5:27 AM, Ilhan Polat  wrote:

> I will assume some simple linear systems knowledge but the question can be
> generalized to any operator that implements __mul__ and __rmul__ methods.
>
> Motivation:
>
> I am trying to implement a gain matrix, say 3x3 identity matrix, for time
> being with a single input single output (SISO) system that I have
> implemented as a class modeling a Transfer or a state space representation.
>
> In the typical usecase, suppose you would like to create an n-many
> parallel connections with the same LTI system sitting at each branch.
> MATLAB implements this as an elementwise multiplication and returning a
> multi input multi output(MIMO) system.
>
> G = tf(1,[1,1]);
> eye(3)*G
>
> produces (manually compactified)
>
> ans =
>
>   From input 1 to output...
>[1  ]
>[  --,   0   , 0]
>[  s + 1]
>[ 1 ]
>[  0,   -- ,   0]
>[   s + 1   ]
>[  1]
>[  0,   0,  --  ]
>[s + 1  ]
>
> Notice that the result type is of LTI system but, in our context, not a
> NumPy array with "object" dtype.
>
> In order to achieve a similar behavior, I would like to let the __rmul__
> of G take care of the multiplication. In fact, when I do
> G.__rmul__(np.eye(3)) I can control what the behavior should be and I
> receive the exception/result I've put in. However the array never looks for
> this method and carries out the default array __mul__ behavior.
>
> The situation is similar if we go about it as left multiplication G*eye(3)
> has no problems since this uses directly the __mul__ of G. Therefore we get
> a different result depending on the direction of multiplication.
>
> Is there anything I can do about this without forcing users subclassing or
> just letting them know about this particular quirk in the documentation?
>
> What I have in mind is to force the users to create static LTI objects and
> then multiply and reject this possibility. But then I still need to stop
> NumPy returning "object" dtyped array to be able to let the user know about
> this.
>
>
> Relevant links just in case
>
> the library : https://github.com/ilayn/harold/
>
> the issue discussion (monologue actually) : https://github.com/ilayn/
> harold/issues/7
>
> The question I've asked on SO (but with a rather offtopic answer):
> https://stackoverflow.com/q/40694380/4950339
>
>
> ilhan
>
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
>
>
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Controlling NumPy __mul__ method or forcing it to use __rmul__ of the "other"

2017-06-19 Thread Stephan Hoyer
Coming up with a single number for a sane "array priority" is basically an
impossible task :). If you only need compatibility with the latest version
of NumPy, this is one good reason to set __array_ufunc__ instead, even if
only to write __array_ufunc__ = None.

On Mon, Jun 19, 2017 at 9:14 AM, Nathan Goldbaum 
wrote:

> I don't think there's any real standard here. Just doing a github search
> reveals many different choices people have used:
>
> https://github.com/search?l=Python&q=__array_priority__&;
> type=Code&utf8=%E2%9C%93
>
> On Mon, Jun 19, 2017 at 11:07 AM, Ilhan Polat 
> wrote:
>
>> Thank you. I didn't know that it existed. Is there any place where I can
>> get a feeling for a sane priority number compared to what's being done in
>> production? Just to make sure I'm not stepping on any toes.
>>
>> On Mon, Jun 19, 2017 at 5:36 PM, Stephan Hoyer  wrote:
>>
>>> I answered your question on StackOverflow:
>>> https://stackoverflow.com/questions/40694380/forcing-multipl
>>> ication-to-use-rmul-instead-of-numpy-array-mul-or-byp/44634634#44634634
>>>
>>> In brief, you need to set __array_priority__ or __array_ufunc__ on your
>>> object.
>>>
>>> On Mon, Jun 19, 2017 at 5:27 AM, Ilhan Polat 
>>> wrote:
>>>
>>>> I will assume some simple linear systems knowledge but the question can
>>>> be generalized to any operator that implements __mul__ and __rmul__
>>>> methods.
>>>>
>>>> Motivation:
>>>>
>>>> I am trying to implement a gain matrix, say 3x3 identity matrix, for
>>>> time being with a single input single output (SISO) system that I have
>>>> implemented as a class modeling a Transfer or a state space representation.
>>>>
>>>> In the typical usecase, suppose you would like to create an n-many
>>>> parallel connections with the same LTI system sitting at each branch.
>>>> MATLAB implements this as an elementwise multiplication and returning a
>>>> multi input multi output(MIMO) system.
>>>>
>>>> G = tf(1,[1,1]);
>>>> eye(3)*G
>>>>
>>>> produces (manually compactified)
>>>>
>>>> ans =
>>>>
>>>>   From input 1 to output...
>>>>[1  ]
>>>>[  --,   0   , 0]
>>>>[  s + 1]
>>>>[ 1 ]
>>>>[  0,   -- ,   0]
>>>>[   s + 1   ]
>>>>[  1]
>>>>[  0,   0,  --  ]
>>>>[s + 1  ]
>>>>
>>>> Notice that the result type is of LTI system but, in our context, not a
>>>> NumPy array with "object" dtype.
>>>>
>>>> In order to achieve a similar behavior, I would like to let the
>>>> __rmul__ of G take care of the multiplication. In fact, when I do
>>>> G.__rmul__(np.eye(3)) I can control what the behavior should be and I
>>>> receive the exception/result I've put in. However the array never looks for
>>>> this method and carries out the default array __mul__ behavior.
>>>>
>>>> The situation is similar if we go about it as left multiplication
>>>> G*eye(3) has no problems since this uses directly the __mul__ of G.
>>>> Therefore we get a different result depending on the direction of
>>>> multiplication.
>>>>
>>>> Is there anything I can do about this without forcing users subclassing
>>>> or just letting them know about this particular quirk in the documentation?
>>>>
>>>> What I have in mind is to force the users to create static LTI objects
>>>> and then multiply and reject this possibility. But then I still need to
>>>> stop NumPy returning "object" dtyped array to be able to let the user know
>>>> about this.
>>>>
>>>>
>>>> Relevant links just in case
>>>>
>>>> the library : https://github.com/ilayn/harold/
>>>>
>>>> the issue discussion (monologue actually) :
>>>> https://github.com/ilayn/harold/issues/7
>>>>
>>>> The question I've asked on SO (but with a rather offtopic answer):
>>>> https://stackoverflow.com/q/40694380/4950339
>>>>
>>>>
>>>> ilhan
>>>>
>>>> ___
>>>> NumPy-Discussion mailing list
>>>> NumPy-Discussion@python.org
>>>> https://mail.python.org/mailman/listinfo/numpy-discussion
>>>>
>>>>
>>>
>>> ___
>>> NumPy-Discussion mailing list
>>> NumPy-Discussion@python.org
>>> https://mail.python.org/mailman/listinfo/numpy-discussion
>>>
>>>
>>
>> ___
>> NumPy-Discussion mailing list
>> NumPy-Discussion@python.org
>> https://mail.python.org/mailman/listinfo/numpy-discussion
>>
>>
>
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
>
>
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Vector stacks

2017-07-02 Thread Stephan Hoyer
I would also prefer separate functions. These are much easier to understand
that custom operator overloads.

Side note: implementing this class with __array_ufunc__ for ndarray @ cvec
actually isn't possible to do currently, until we fix this bug:
https://github.com/numpy/numpy/issues/9028

On Sat, Jul 1, 2017 at 5:31 PM, Juan Nunez-Iglesias 
wrote:

> I’m with Nathaniel on this one. Subclasses make code harder to read and
> reason about because you now have to be sure of the exact type of things
> that users are passing you — which are array-like but subtly different.
>
> On 2 Jul 2017, 9:46 AM +1000, Marten van Kerkwijk <
> m.h.vankerkw...@gmail.com>, wrote:
>
> I'm not sure there is *that* much against a class that basically just
> passes through views of itself inside `__matmul__` and `__rmatmul__`
> or calls new gufuncs, but I think the lower hurdle is to first get
> those gufuncs implemented.
> -- Marten
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
>
>
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
>
>
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Scipy 2017 NumPy sprint

2017-07-03 Thread Stephan Hoyer
On Sun, Jul 2, 2017 at 8:33 AM Sebastian Berg 
wrote:

> If someone who does subclasses/array-likes or so (e.g. like Stefan
> Hoyer ;)) and is interested, and also we do some
> teleconferencing/chatting (and I have time) I might be interested
> in discussing and possibly trying to develop the new indexer ideas,
> which I feel are pretty far, but I got stuck on how to get subclasses
> right.


I am off course very happy to discuss this (online or via teleconference,
sadly I won't be at scipy), but to be clear I use array likes, not
subclasses. I think Marten van Kerkwijk is the last one who thinks that is
still a good idea :).

>
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Scipy 2017 NumPy sprint

2017-07-05 Thread Stephan Hoyer
On Wed, Jul 5, 2017 at 10:40 AM, Chris Barker  wrote:

> Along those lines, there was some discussion of having a set of utilities
> (or maybe eve3n an ABC?) that would make it easier to create a ndarray-like
> object.
>
> That is, the boilerplate needed for multi-dimensional indexing and
> slicing, etc...
>
> That could be a nice little sprint-able project.
>

Indeed. Let me highlight a few mixins

that
I wrote for xarray that might be more broadly useful. The challenge here is
that there are quite a few different meanings to "ndarray-like", so mixins
really need to be mix-and-match-able. But at least defining a base list of
methods to implement/override would be useful.

In NumPy, this could go along with NDArrayOperatorsMixins in
numpy/lib/mixins.py

___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Scipy 2017 NumPy sprint

2017-07-06 Thread Stephan Hoyer
On Thu, Jul 6, 2017 at 4:42 AM, Ben Rowland  wrote:

> Slightly off topic, but as someone who has just spent a fair amount of
> time implementing various
> subclasses of nd-array, I am interested (and a little concerned), that the
> consensus is not to use
> them. Is there anything available which explains why this is the case and
> what the alternatives
> are?
>

Writing such docs (especially to explain how to write array-like objects
that aren't subclasses) would be another good topic for the sprint ;).

But more seriously: numpy.ndarray subclasses are supported, but inherently
error prone, because we don't have a well defined subclassing API. As
Martin will attest, this means seemingly harmless internal refactoring in
NumPy has a tendency to break downstream subclasses, which often
unintentionally end up relying on untested implementation details.

This is particularly problematic when subclasses are implemented in a
different code-base, as is the case for user subclasses of numpy.ndarray.
Due to diligent testing efforts, we often (but not always) catch these
issues before making a release, but the process is inherently error prone.
Writing NumPy functionality in a manner that is robust to all possible
subclassing approaches turns out to be very difficult (nearly impossible).

This is actually a classic OOP problem, e.g., see
https://en.wikipedia.org/wiki/Composition_over_inheritance
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Scipy 2017 NumPy sprint

2017-07-06 Thread Stephan Hoyer
On Thu, Jul 6, 2017 at 9:42 AM, Chris Barker  wrote:

> In NumPy, this could go along with NDArrayOperatorsMixins in
>> numpy/lib/mixins.py
>> 
>>
>
> Yes! I had no idea that existed.
>

It's brand new for NumPy 1.13 :). I wrote it to go along with
__array_ufunc__.
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] ENH: ratio function to mimic diff

2017-07-29 Thread Stephan Hoyer
This is an interesting idea, but I don't understand the use cases for this
function. In particular, what would you use n-th order ratios for?

One use case I can think of is estimating the slope of a log-scaled plot.
But here exp(diff(log(x))) is an easy substitute.

I guess ratio() would work in cases where values are both positive and
negative, but again I don't know when that would be useful. If your signal
crosses zero, ratios are likely to diverge.
On Fri, Jul 28, 2017 at 3:25 PM Joseph Fox-Rabinovitz <
jfoxrabinov...@gmail.com> wrote:

> I have created PR#9481 to introduce a `ratio` function that behaves very
> similarly to `diff`, except that it divides successive elements instead of
> subtracting them. It has some handling built in for zero division, as well
> as the ability to select between `/` and `//` operators.
>
> There is currently no masked version. Perhaps someone could suggest a
> simple mechanism for hooking np.ma.true_divide and np.ma.floor_divide in as
> the operators instead of the regular np.* versions.
>
> Please let me know your thoughts.
>
> Regards,
>
> -Joe
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
>
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Tensor Contraction (HPTT) and Tensor Transposition (TCL)

2017-08-16 Thread Stephan Hoyer
On Wed, Aug 16, 2017 at 2:39 AM, Paul Springer  wrote:

>
> What version of Numpy are you comparing to? Note that in 1.13 you can
> enable some optimization in einsum, and the coming 1.14 makes that the
> default and uses CBLAS when possible.
>
> I was using 1.10.4; however, I am currently running the benchmark with
> 1.13.1 and 'optimize=True'; this, however, seems to yield even worse
> performance (see attached).
> If you are interested, you can check the performance difference yourself
> via: ./benchmark/python/bechmark.sh
>

This sounds like you may be using relatively small matrices, where the
overhead of calculating the optimal strategy dominates. Can you try with a
few bigger test cases?
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Why are empty arrays False?

2017-08-18 Thread Stephan Hoyer
I agree, this behavior seems actively harmful. Let's fix it.

On Fri, Aug 18, 2017 at 2:45 PM, Michael Lamparski  wrote:

> Greetings, all.  I am troubled.
>
> The TL;DR is that `bool(array([])) is False` is misleading, dangerous, and
> unnecessary. Let's begin with some examples:
>
> >>> bool(np.array(1))
> True
> >>> bool(np.array(0))
> False
> >>> bool(np.array([0, 1]))
> ValueError: The truth value of an array with more than one element is
> ambiguous. Use a.any() or a.all()
> >>> bool(np.array([1]))
> True
> >>> bool(np.array([0]))
> False
> >>> bool(np.array([]))
> False
>
> One of these things is not like the other.
>
> The first three results embody a design that is consistent with some of
> the most fundamental design choices in numpy, such as the choice to have
> comparison operators like `==` work elementwise.  And it is the only such
> design I can think of that is consistent in all edge cases. (see footnote 1)
>
> The next two examples (involving arrays of shape (1,)) are a
> straightforward extension of the design to arrays that are isomorphic to
> scalars.  I can't say I recall ever finding a use for this feature... but
> it seems fairly harmless.
>
> So how about that last example, with array([])?  Well... it's /kind of/
> like how other python containers work, right? Falseness is emptiness (see
> footnote 2)...  Except that this is actually *a complete lie*, due to /all
> of the other examples above/!
>
> Here's what I would like to see:
>
> >>> bool(np.array([]))
> ValueError: The truth value of a non-scalar array is ambiguous. Use
> a.any() or a.all()
>
> Why do I care?  Well, I myself wasted an hour barking up the wrong tree
> while debugging some code when it turned out that I was mistakenly using
> truthiness to identify empty arrays. It just so happened that the arrays
> always contained 1 or 0 elements, so it /appeared/ to work except in the
> rare case of array([0]) where things suddenly exploded.
>
> I posit that there is no usage of the fact that `bool(array([])) is False`
> in any real-world code which is not accompanied by a horrible bug writhing
> in hiding just beneath the surface. For this reason, I wish to see this
> behavior *abolished*.
>
> Thank you.
> -Michael
>
> Footnotes:
> 1: Every now and then, I wish that `ndarray.__{bool,nonzero}__` would just
> implicitly do `all()`, which would make `if a == b:` work like it does for
> virtually every other reasonably-designed type in existence.  But then I
> recall that, if this were done, then the behavior of `if a != b:` would
> stand out like a sore thumb instead.  Truly, punting on 'any/all' was the
> right choice.
>
> 2: np.array() is also False, which makes this an interesting sort
> of n-dimensional emptiness test; but if that's really what you're looking
> for, you can achieve this much more safely with `np.all(x.shape)` or
> `bool(x.flat)`
>
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
>
>
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Interface numpy arrays to Matlab?

2017-08-28 Thread Stephan Hoyer
If you can use Octave instead of Matlab, I've had a very good experience
with Oct2Py:
https://github.com/blink1073/oct2py

On Mon, Aug 28, 2017 at 12:20 PM, Neal Becker  wrote:

> I've searched but haven't found any decent answer.  I need to call Matlab
> from python.  Matlab has a python module for this purpose, but it doesn't
> understand numpy AFAICT.  What solutions are there for efficiently
> interfacing numpy arrays to Matlab?
>
> Thanks,
> Neal
>
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
>
>
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Is there a better way to write a stacked matrix multiplication

2017-10-26 Thread Stephan Hoyer
I would certainly use einsum. It is almost perfect for these use cases,
e.g.,
np.einsum('ki,kij,kj->k', A, inv(B), A)

On Thu, Oct 26, 2017 at 12:38 PM Charles R Harris 
wrote:

> On Thu, Oct 26, 2017 at 12:11 PM, Daniele Nicolodi 
> wrote:
>
>> Hello,
>>
>> is there a better way to write the dot product between a stack of
>> matrices?  In my case I need to compute
>>
>> y = A.T @ inv(B) @ A
>>
>> with A a 3x1 matrix and B a 3x3 matrix, N times, with N in the few
>> hundred thousands range.  I thus "vectorize" the thing using stack of
>> matrices, so that A is a Nx3x1 matrix and B is Nx3x3 and I can write:
>>
>> y = np.matmul(np.transpose(A, (0, 2, 1)), np.matmul(inv(B), A))
>>
>> which I guess could be also written (in Python 3.6 and later):
>>
>> y = np.transpose(A, (0, 2, 1)) @ inv(B) @ A
>>
>> and I obtain a Nx1x1 y matrix which I can collapse to the vector I need
>> with np.squeeze().
>>
>> However, the need for the second argument of np.transpose() seems odd to
>> me, because all other functions handle transparently the matrix stacking.
>>
>> Am I missing something?  Is there a more natural matrix arrangement that
>> I could use to obtain the same results more naturally?
>
>
> There has been discussion of adding a operator for transposing the
> matrices in a stack, but no resolution at this point. However, if you have
> a stack of vectors (not matrices) you can turn then into transposed
> matrices like `A[..., None, :]`, so `A[..., None, :] @ inv(B) @ A[...,
> None]`  and then squeeze.
>
> Another option is to use einsum.
>
> Chuck
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
>
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] is __array_ufunc__ ready for prime-time?

2017-10-27 Thread Stephan Hoyer
Hi Will,

We spent a *long time* sorting out the messy details of __array_ufunc__
[1], especially for handling interactions between different types, e.g.,
between numpy arrays, non-numpy array-like objects, builtin Python objects,
objects that override arithmetic to act in non-numpy-like ways, and of
course subclasses of all the above.

We hope that we have it right this time, but as we wrote in the NumPy 1.13
release notes "The API is provisional, we do not yet guarantee backward
compatibility as modifications may be made pending feedback." That said,
let's give it a try!

If any changes are necessary, I expect it would likely relate to how we
handle interactions between different types. That's where we spent the
majority of the design effort, but debate is a poor substitute for
experience. I would be very surprised if the basic cases (one argument or
two arguments of the same type) need any changes.

Best,
Stephan

[1] https://docs.scipy.org/doc/numpy-1.13.0/neps/ufunc-overrides.html


On Fri, Oct 27, 2017 at 12:39 PM William Sheffler 
wrote:

> Right before 1.12, I arranged an API around an np.ndarray subclass, making
> use of __array_ufunc__ to customize behavior based on structured dtype (we
> come from c++ and really like operator overloading). Having seen
> __array_ufunc__ featured in Travis Oliphant's Guide to NumPy: 2nd Edition,
> I assumed this was the way to go. But it was removed in 1.12. Now that 1.13
> has reintroduced __array_ufunc__, can I now rely on its continued
> availability? I am considering using it as a base-level component in
> several libraries... is this a dangerous idea?
>
> Thanks!
> Will
>
> --
> William H. Sheffler Ph.D.
> Principal Engineer
> Institute for Protein Design
> University of Washington
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
>
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] is __array_ufunc__ ready for prime-time?

2017-11-02 Thread Stephan Hoyer
On Thu, Nov 2, 2017 at 9:45 AM  wrote:

> similar, scipy.special has ufuncs
> what units are those?
>
> Most code that I know (i.e. scipy.stats and statsmodels) does not use only
> "normal mathematical operations with ufuncs"
> I guess there are a lot of "abnormal" mathematical operations
> where just simply propagating the units will not work.
>

> Aside: The problem is more general also for other datastructures.
> E.g. statsmodels for most parts uses only numpy ndarrays inside the
> algorithm and computations because that provides well defined
> behavior. (e.g. pandas behaved too differently in many cases)
> I don't have much idea yet about how to change the infrastructure to
> allow the use of dask arrays, sparse matrices and similar and possibly
> automatic differentiation.
>

This is the exact same reason why pandas and xarray do not support wrapping
arbitrary ndarray subclasses or duck array types. The operations we use
internally (on numpy.ndarray objects) may not be what you would expect
externally, and may even be implementation details not considered part of
the public API. For example, in xarray we use numpy.nanmean() or
bottleneck.nanmean() instead of numpy.mean().

For NumPy and xarray, I think we could (and should) define an interface to
support subclasses and duck types for generic operations for core
use-cases. My main concern with subclasses / duck-arrays is
undefined/untested behavior, especially where we might silently give the
wrong answer or trigger some undesired operation (e.g., loading a lazily
computed into memory) rather than raising an informative error. Leaking
implementation details is another concern: we have already had several
cases in NumPy where a function only worked on a subclass if a particular
method was called internally, and broke when that was changed.
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] is __array_ufunc__ ready for prime-time?

2017-11-02 Thread Stephan Hoyer
On Thu, Nov 2, 2017 at 12:42 PM Nathan Goldbaum 
wrote:

> Would this issue be ameliorated given Nathaniel's proposal to try to move
> away from subclasses and towards storing data in dtypes? Or would that just
> mean that xarray would need to ban dtypes it doesn't know about?
>

Yes, I think custom dtypes would definitely help. Custom dtypes have a well
contained interface, so lots of operations (e.g., concatenate, reshaping,
indexing) are guaranteed to work in a dtype independent way. If you try to
do an unsupported operation for such a dtype (e.g., np.datetime64), you
will generally get a good error message about an invalid dtype.

In contrast, you can overload a subclass with totally arbitrary semantics
(e.g., np.matrix) and of course for duck-types as well.

This makes a big difference for libraries like dask or xarray, which need a
standard interface to guarantee they do the right thing. I'm pretty sure we
can wrap a custom dtype ndarray with units, but there's no way we're going
to support np.matrix without significant work. It's hard to know which is
which without well defined interfaces.
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] is __array_ufunc__ ready for prime-time?

2017-11-02 Thread Stephan Hoyer
Maybe the best of both worlds would require explicit opt-in for classes
that shouldn't be coerced, e.g.,
xarray.register_data_type(MyArray)

or maybe better yet ;)
xarray.void_my_nonexistent_warranty_its_my_fault_if_my_buggy_duck_array_breaks_everything(MyArray)

On Thu, Nov 2, 2017 at 3:39 PM Marten van Kerkwijk <
m.h.vankerkw...@gmail.com> wrote:

> I guess my argument boils down to it being better to state that a
> function only accepts arrays and happily let it break on, e.g.,
> matrix, than use `asarray` to make a matrix into an array even though
> it really isn't.
>
> I do like the dtype ideas, but think I'd agree they're likely to come
> with their own problems. But just making new numerical types possible
> is interesting.
>
> -- Marten
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
>
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] is __array_ufunc__ ready for prime-time?

2017-11-02 Thread Stephan Hoyer
On Thu, Nov 2, 2017 at 3:35 PM Nathan Goldbaum 
wrote:

> Ah, but what if the dtype modifies the interface? That might sound evil,
> but it's something that's been proposed. For example, if I wanted to
> replace yt's YTArray in a backward compatibile way with a dtype and just
> use plain ndarrays everywhere, the dtype would need to *at least* modify
> ndarray's API, adding e.g. to(), convert_to_unit(), a units attribute, and
> several other things.
>

I suppose we'll need to sort this out. But adding new methods/properties
feels pretty safe to me, as long as existing ones are guaranteed to work in
the same way.
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] is __array_ufunc__ ready for prime-time?

2017-11-06 Thread Stephan Hoyer
On Mon, Nov 6, 2017 at 2:29 PM Ryan May  wrote:

> On Mon, Nov 6, 2017 at 3:18 PM, Chris Barker 
> wrote:
>
>> Klunky, and maybe we could come up with a standard way to do it and
>> include that in numpy, but I'm not sure that ABCs are the way to do it.
>>
>
> ABCs are *absolutely* the way to go about it. It's the only way baked into
> the Python language itself that allows you to register a class for purposes
> of `isinstance` without needing to subclass--i.e. duck-typing.
>
> What's needed, though, is not just a single ABC. Some thought and design
> needs to go into segmenting the ndarray API to declare certain behaviors,
> just like was done for collections:
>
> https://docs.python.org/3/library/collections.abc.html
>
> You don't just have a single ABC declaring a collection, but rather "I am
> a mapping" or "I am a mutable sequence". It's more of a pain for developers
> to properly specify things, but this is not a bad thing to actually give
> code some thought.
>

I agree, it would be nice to nail down a hierarchy of duck-arrays, if
possible. Although, there are quite a few options, so I don't know how
doable this is. Any interest in opening up an issue on GitHub to discuss?
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] is __array_ufunc__ ready for prime-time?

2017-11-07 Thread Stephan Hoyer
On Tue, Nov 7, 2017 at 12:23 PM Chris Barker  wrote:

>
> And then a third abc for indexing support, although, I am not sure how
>> that could get implemented...
>
>
> This is the really tricky one -- all ABCs really check is the existence of
> methods -- making sure they behave the same way is up to the developer of
> the ducktype.
>
> which is K, but will require discipline.
>
> But indexing, specifically fancy indexing, is another matter -- I'm not
> sure if there even a way with an ABC to check for what types of indexing
> are support, but we'd still have the problem with whether the semantics are
> the same!
>
> For example, I work with netcdf variable objects, which are partly
> duck-typed as ndarrays, but I think n-dimensional fancy indexing works
> differently... how in the world do you detect that with an ABC???
>

We recently worked out a hierarchy of indexing types for xarray. To a crude
approximation, we have:
- "Basic" indexing support for slices and integers. Nearly every array type
satisfies this.
- "Outer" or "orthogonal" indexing with slices, integers and 1D arrays.
This is what netCDF4-Python and Fortran/MATLAB support.
- "Vectorized" indexing with broadcasting and multi-dimensional indexers.
NumPy supports a generalization of this, but I would not wish the edge
cases involving mixed slices/arrays upon anyone.
- "Logical" indexing by a boolean array with the same shape.
- "Exactly like NumPy" for subclasses or wrappers around NumPy arrays.

There's some ambiguities in this, but that's what specs are for. For most
applications, we probably don't need most of these: ABCs for "Basic",
"Logical" and "Exactly like NumPy" would go a long ways.
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] np.vstack vs. np.stack

2017-11-09 Thread Stephan Hoyer
I'm pretty sure I wrote the offending line in the vstack() docs.

The original motivation for stack() was that stacking behavior of hstack(),
vstack() and dstack() was somewhat inconsistent, especially with regard to
lower dimensional input. stack() is conceptually much simpler and more
general.

That said, if you know vstack() and find it useful, great, use it. It is
not going away in NumPy. We don't remove functions just because there's a
better alternative API, but rather use the docs to try to point new users
in a better direction.

On Thu, Nov 9, 2017 at 2:11 PM Eric Wieser 
wrote:

> I think the primary problems with it are:
>
>- A poor definition of “vertical” in the world of stacked matrices -
>in np.linalg land, this means axis=-2, but in vstack land, it means
>axis=0.
>- Mostly undocumented auto-2d behavior that doesn’t make you think
>well enough about dimensions. Numpy deliberately distinguishes between “row
>vectors” (1, N) and vectors (N,), so it’s a shame when APIs like vstack
>and np.matrix try to hide this distinction.
>
> Eric
>
> On Thu, 9 Nov 2017 at 13:59 Mark Bakker  wrote:
>
> On 11/09/2017 04:30 AM, Joe wrote:
>>> > Hello,
>>> >
>>> > I have a question and hope that you can help me.
>>> >
>>> > The doc for vstack mentions that "this function continues to be
>>> > supported for backward compatibility, but you should prefer
>>> > np.concatenate or np.stack."
>>> >
>>> > Using vstack was convenient because "the arrays must have the same
>>> shape
>>> > along all but the first axis."
>>> >
>>> > So it was possible to stack an array (3,) and (2, 3) to a (3, 3) array
>>> > without using e.g. atleast_2d on the (3,) array.
>>> >
>>> > Is there a possibility to mimic that behavior with np.concatenate or
>>> > np.stack?
>>> >
>>>
>> > Joe
>>>
>>>
>> Can anybody explain why vstack is going the way of the dodo?
>> Why are stack / concatenate better? What is 'bad' about vstack?
>>
>> Thanks,
>>
>> Mark
>> ___
>> NumPy-Discussion mailing list
>> NumPy-Discussion@python.org
>> https://mail.python.org/mailman/listinfo/numpy-discussion
>>
> ​
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
>
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] np.vstack vs. np.stack

2017-11-10 Thread Stephan Hoyer
On Thu, Nov 9, 2017 at 2:49 PM Allan Haldane  wrote:

> Maybe we should reword the vstack docstring so that it doesn't imply
> that vstack is going away. It should say something weaker
> like "the functions np.stack, np.concatenate, and np.block are often
> more general/useful/less confusing alternatives".. or better explain
> what the problem is.
>

Yes, I would support this.
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] Type annotations for NumPy

2017-11-25 Thread Stephan Hoyer
There's been growing interest in supporting PEP-484 style type annotations
in NumPy: https://github.com/numpy/numpy/issues/7370

This would allow NumPy users to add type-annotations to their code that
uses NumPy, which they could check with mypy, pycharm or pytype. For
example:

def f(x: np.ndarray) -> np.ndarray:
"""Identity function on a NumPy array."""
return x

Eventually, we could include data types and potentially array shapes as
part of the type. This gets quite a bit more complicated, and to do in a
really satisfying way would require new features in Python's typing system.
To help guide discussion, I wrote a doc describing use-cases and needs for
typing array shapes in more detail:
https://docs.google.com/document/d/1vpMse4c6DrWH5rq2tQSx3qwP_m_0lyn-Ij4WHqQqRHY

Nathaniel Smith and I recently met with group in San Francisco interested
in this topic, including several mypy/typeshed developers (Jelle Zijlstra
and Ethan Smith). We discussed and came up with a plan for moving forward:
1. Release basic type stubs for numpy.ndarray without dtypes or shapes, as
separate "numpy_stubs" package on PyPI per PEP 561. This will let us
iterate rapidly on (experimental) type annotations without coupling to
NumPy's release cycle.
2. Add support for dtypes in ndarray type-annotations. This might be as
simple as writing np.ndarray[np.float64], but will need a decision about
appropriate syntax for shape typing to ensure that this is forwards
compatible with typing shapes. Note: this will likely require minor changes
to NumPy itself, e.g., to add __class_getitem__ per PEP 560.
3. Add support for shapes in ndarray type-annotations, and define a broader
standard for typing array shapes. This will require collaboration with
type-checker developers on the required typing features (for details, see
my doc above). Eventually, this may entail writing a PEP.

I'm writing to gauge support for this general plan, and specifically to get
support for step 1.

Cheers,
Stephan
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Type annotations for NumPy

2017-11-25 Thread Stephan Hoyer
On Sat, Nov 25, 2017 at 7:21 AM Marten van Kerkwijk <
m.h.vankerkw...@gmail.com> wrote:

> A question of perhaps broader scope than what you were asking for, and
> more out of curiosity than anything else, but can one mix type
> annotations with others? E.g., in astropy, we have a decorator that
> looks for units in the annotations (not dissimilar from dtype, I
> guess). Could one mix annotations or does one have to stick with one
> purpose?
>

Hi Marten,

I took a look at Astropy's units decorator:
http://docs.astropy.org/en/stable/api/astropy.units.quantity_input.html

Annotations for return values that "coerce" units would be hard to make
compatible with typing, because type annotations are used to check
programs, not change runtime semantics. But in principle, I think you could
even make a physical units library that relies entirely on static type
checking for correctness, imposing almost no run-time overhead at all.
There are several examples for Haskell:
https://wiki.haskell.org/Physical_units

I don't see any obvious way to support to mixing of annotations for typing
and runtime effects in the same function, though doing so in the same
program might be possible. My guess is that the preferred way to do this
would be to use decorators for runtime changes to arguments, and keep
annotations for typing. The Python community seems to be standardizing on
using annotations for typing:
https://www.python.org/dev/peps/pep-0563/#non-typing-usage-of-annotations

Cheers,
Stephan
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Type annotations for NumPy

2017-11-26 Thread Stephan Hoyer
On Sat, Nov 25, 2017 at 3:34 PM Matthew Rocklin  wrote:

> Thoughts on basing this on a more generic Array type rather than the
> np.ndarray?  I can imagine other nd-array libraries (XArray, Tensorflow,
> Dask.array) wanting to reuse this work.  For dask.array in particular we
> would want to copy this entirely, but we probably can't specify that
> dask.arrays are np.ndarrays.  It would be nice to ensure that the container
> type was swappable.
>

Yes, absolutely. I do briefly mention this in my longer doc (see the
"Syntax" section). This is also one of my personal goals for this project.

This will be most relevant when we start working on typing support for
array shapes and broadcasting: details like data types can be more library
specific, and can probably be expressed with the existing generics system
in the typing module.

After we do some experimentation to figure out appropriate syntax and
semantics for array shape typing, I would like to standardize the rules for
typing multi-dimensional arrays in Python. This will probably entail
writing a PEP, so we can add appropriate base classes in the typing module.
I view this as the natural complement to existing standard library features
that make it easier to interchange between multiple multi-dimensional array
libraries, such as memory views and the buffer protocol.

>
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Type annotations for NumPy

2017-11-28 Thread Stephan Hoyer
On Tue, Nov 28, 2017 at 5:11 PM Robert T. McGibbon 
wrote:

> I'm strongly in support of this proposal.  Type annotations have really
> helped me write more correct code.
>
> I started working on numpy type stubs a few months ago. I needed a mypy
> plugin to support shape-aware functions. Those whole thing is pretty
> tricky. Still very WIP, but I'll clean them up a little bit and opensource
> it shortly.
>

Great to hear -- I'd love to see what this looks like, or hear any lessons
you learned from the experience!

Actual experience using and writing such a type checker gives you a
valuable perspective to share, as opposed to my speculation.

Cheers,
Stephan
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Type annotations for NumPy

2017-12-05 Thread Stephan Hoyer
This discussion has died down, but I don't want to lose momentum .

It sounds like there is at least strong interest from a subset of our
community in type annotations. Are there any objections to the first part
of my plan, to start developing type stubs for NumPy in separate repository?

We'll come back to the mailing list when we have concrete proposals for
typing dtypes and shapes.

On Tue, Nov 28, 2017 at 4:05 PM Chris Barker - NOAA Federal <
chris.bar...@noaa.gov> wrote:

>
>
> (a) it would be good if NumPy type annotations could include an
> “array_like” type that allows lists, tuples, etc.
>
>
> I think that would be a sequence — already supported by the Typing system.
>
> (b) I’ve always thought (since PEP561) that it would be cool for type
> annotations to replace compiler type annotations for e.g. Cython and Numba.
> Is this in the realm of possibility for the future?
>
>
> Well, this was brought up early in the Typing discussion, and it was made
> clear that these kinds of truly static types, as needed by Cython, was a
> non-goal of the project.
>
> That being said, perhaps it could be made to work with a bunch of
> additional type objects.
>
> And we should lol lol to Cython for ideas about how to type numpy arrays.
>
> One note: in addition to shape (rank) and types, there is contiguous and C
> or F order. That may want to be considered.
>
> -CHB
>
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
>
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Type annotations for NumPy

2017-12-05 Thread Stephan Hoyer
OK, in that case let's get to work over in
https://github.com/numpy/numpy_stubs!

On Tue, Dec 5, 2017 at 2:43 PM Fernando Perez  wrote:

> On Tue, Dec 5, 2017 at 2:19 PM, Nathaniel Smith  wrote:
>
>> On Tue, Dec 5, 2017 at 10:04 AM, Stephan Hoyer  wrote:
>> > This discussion has died down, but I don't want to lose momentum .
>> >
>> > It sounds like there is at least strong interest from a subset of our
>> > community in type annotations. Are there any objections to the first
>> part of
>> > my plan, to start developing type stubs for NumPy in separate
>> repository?
>>
>> I think there's been plenty of time for folks to object to this if
>> they wanted, so we can assume consensus until we hear otherwise.
>>
>
> peanut_gallery.throw('+1', thanks=True)   # very happy to see this moving
> forward
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
>
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] What is the pythonic way to write a function that handles arrays and scalars?

2017-12-12 Thread Stephan Hoyer
On Tue, Dec 12, 2017 at 5:07 PM Mark Campanelli 
wrote:

> I think I saw some other discussion recently about numpy joining forces
> with Python 3's gradual type system. Is there any draft formal proposal for
> this yet? If numpy+scipy wants to scale to "bigger" projects, I think it
> behooves the community to get rid of this messiness.
>

We're still figuring this out, but if you're interested please follow along
(and pitch in!):
https://github.com/numpy/numpy_stubs
https://github.com/numpy/numpy/issues/7370
https://github.com/python/typing/issues/516

Cheers,
Stephan
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] What is the pythonic way to write a function that handles arrays and scalars?

2017-12-12 Thread Stephan Hoyer
On Tue, Dec 12, 2017 at 6:20 PM Marten van Kerkwijk <
m.h.vankerkw...@gmail.com> wrote:

> The real magic happens when you ducktype, and ensure your function
> works both for arrays and scalars on its own. This is more often
> possible than you might think!


Sadly, this still doesn't work in a type-stable way.

NumPy's ufuncs convert 0-dimensional arrays into scalars. The typing rules
for functions like np.sin() look like:
- scalar or 0d array -> scalar
- 1d or higher dimensional array -> array

I'm not entirely sure, but I suspect this was a practical rather than
principled choice.

NumPy scalars are "duck arrays" of sorts (e.g., with shape and dtype
attributes) which helps to some extent, but the bugs that slip through are
even harder to understand. This wart reminds me of how mixed basic/advanced
indexing reorders sliced dimensions to make the result more "intuitive",
which only works in some cases.

I usually favor coercing all arguments to my functions to numpy arrays with
np.asarray(), but to guarantee the return type you would also need to
coerce it with np.asarray(), too.
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Moving NumPy's PRNG Forward

2018-01-19 Thread Stephan Hoyer
On Fri, Jan 19, 2018 at 6:57 AM Robert Kern  wrote:

> As an alternative, we may also want to leave `np.random.RandomState`
> entirely fixed in place as deprecated legacy code that is never updated.
> This would allow current unit tests that depend on the stream-compatibility
> that we previously promised to still pass until they decide to update.
> Development would move to a different class hierarchy with new names.
>

I like this alternative, but I would hesitate to call it "deprecated".
Users who care about exact reproducibility across NumPy versions (e.g., for
testing) are probably less concerned about performance, and could continue
to use it.

New random number generator classes could implement their own guarantees
about compatibility across their methods.

I am personally not at all interested in preserving any stream
> compatibility for the `numpy.random.*` aliases or letting the user swap out
> the core PRNG for the global PRNG that underlies them. `np.random.seed()`
> should be discouraged (if not outright deprecated) in favor of explicitly
> passing around instances.
>

I agree that np.random.seed() should be discouraged, but it feels very late
in NumPy's development to remove it.

If we do alter the random number streams for numpy.random.*, it seems that
we should probably issue a warning (at least for a several major versions)
whenever numpy.random.seed() is called. This could get pretty noisy. I
guess that's all the more incentive to switch to random state objects.
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] NumPy should not silently promote numbers to strings

2018-02-08 Thread Stephan Hoyer
This is one of my oldest NumPy pain-points:
>>> np.array([1, 2, 'three'])
array(['1', '2', 'three'],
  dtype='https://github.com/pydata/xarray/pull/1847), but mostly just
hides bugs until later. It's certainly very un-Pythonic.

The sane promotion rule would be `np.promote_types(str, float) -> object`,
not a size 32 string.

Is it way too late to fix this for NumPy, or is this something we could
change in a major release? It would certainly need at least a deprecation
cycle. This is easy enough to introduce accidentally that there are
undoubtedly many users whose code would break if we changed this.
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] NumPy should not silently promote numbers to strings

2018-02-08 Thread Stephan Hoyer
On Thu, Feb 8, 2018 at 11:00 PM Eric Wieser 
wrote:

> Presumably you would extend that to all (str, np.number), or even (str,
> np.generic_)?
>
Yes, I'm currently doing (np.character, np.number) and (np.character,
np.bool_). But only in direct consultation with the diagram of NumPy's type
hierarchy :).

> I suppose there’s the argument that with python-3-only support around the
> corner, even (str, bytes) should go to object.
>
Yes, that's also pretty bad.

The current behavior (str, bytes) -> str relies on bytes being valid ASCII:
>>> np.array([b'\xFF', u'cd'])
UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 0:
ordinal not in range(128)

It exactly matches Python 2's str/unicode behavior, but doesn't make sense
at all in a Python 3 world.
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Permissable NumPy logo usage

2018-02-16 Thread Stephan Hoyer
I don't know the history of the NumPy logo, or who officially owns the
rights to NumPy's branding at this point. In principle, that might be
NumFOCUS, but the logo far predates NumFOCUS and NumFOCUS's fiscal
sponsorship of NumPy. Looking at the Git history, it looks like David
Cournapeau added it to NumPy's repo back in 2009:
https://github.com/numpy/numpy/commit/c5b2f31aeafa32c705f87f5801a952e394063a3d

Just speaking for myself, I think this use for NumPy's logo and name is
appropriate and within community norms. I don't think anyone would be
confuse your project with NumPy or assume any sort of official endorsement.

On Fri, Feb 16, 2018 at 10:52 AM Daniel Smith  wrote:

> Hello everyone,
> I have a project which combines NumPy and a quantum chemistry program
> (Psi4, psicode.org) for education and rapid prototyping. One of the
> authors has proposed a new logo which is a tweaked form of the the NumPy
> and Psi4 logos combined. I was curious if we were violating any NumPy
> copyright or community taboos in either using the NumPy name or a modified
> version of the NumPy logo. Any advice or direction would be most welcome.
>
> Current logo:
>
> https://github.com/psi4/psi4numpy/blob/master/media/psi4banner_numpy_interactive.png
>
> Proposed logo:
>
> https://github.com/loriab/psi4numpy/blob/0866d0fb67f2c9629e2ba37bc4a091e20695a09f/media/psi4numpybanner_eqn.png
>
> Project link:
> https://github.com/psi4/psi4numpy
>
> ChemRxiv:
>
> https://chemrxiv.org/articles/Psi4NumPy_An_Interactive_Quantum_Chemistry_Programming_Environment_for_Reference_Implementations_and_Rapid_Development/5746059
>
> Cheers,
> -Daniel
>
> —
> Daniel G. A. Smith
> Software Scientist
> The Molecular Sciences Software Institute (MolSSI )
> @dgas_smith
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
>
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] new NEP: np.AbstractArray and np.asabstractarray

2018-03-08 Thread Stephan Hoyer
Hi Nathaniel,

Thanks for starting the discussion!

Like Marten says, I think it would be useful to more clearly define what it
means to be an abstract array. ndarray has lots of methods/properties that
expose internal implementation (e.g., view, strides) that presumably we
don't want to require as part of this interfaces. On the other hand, dtype
and shape are almost assuredly part of this interface.

To help guide the discussion, it would be good to identify concrete
examples of types that should and should not satisfy this interface, e.g.,
Marten's case 1: works exactly like ndarray, but stores data differently:
parallel arrays (e.g., dask.array), sparse arrays (e.g.,
https://github.com/pydata/sparse), hypothetical non-strided arrays (e.g.,
always C ordered).
Marten's case 2: same methods as ndarray, but gives different results:
np.ma.MaskedArray, arrays with units (quantities), maybe labeled arrays
like xarray.DataArray

I don't think we have a hope of making a single base class for case 2 work
with everything in NumPy, but we can define interfaces with different
levels of functionality.

Because there is such a gradation of "duck array" types, I agree with
Marten that we should not deprecate NDArrayOperatorsMixin. It's useful for
types like xarray.Dataset that define __array_ufunc__ but cannot satisfy
the full abstract array interface.

Finally for the name, what about `asduckarray`? Thought perhaps that could
be a source of confusion, and given the gradation of duck array like types.

Cheers,
Stephan

On Thu, Mar 8, 2018 at 7:07 AM Marten van Kerkwijk <
m.h.vankerkw...@gmail.com> wrote:

> Hi Nathaniel,
>
> Overall, hugely in favour!  For detailed comments, it would be good to
> have a link to a PR; could you put that up?
>
> A larger comment: you state that you think `np.asanyarray` is a
> mistake since `np.matrix` and `np.ma.MaskedArray` would pass through
> and that those do not strictly mimic `NDArray`. Here, I agree with
> `matrix` (but since we're deprecating it, let's remove that from the
> discussion), but I do not see how your proposed interface would not
> let `MaskedArray` pass through, nor really that one would necessarily
> want that.
>
> I think it may be good to distinguish two separate cases:
> 1. Everything has exactly the same meaning as for `ndarray` but the
> data is stored differently (i.e., only `view` does not work). One can
> thus expect that for `output = function(inputs)`, at the end all
> `duck_output == ndarray_output`.
> 2. Everything is implemented but operations may give different output
> (depending on masks for masked arrays, units for quantities, etc.), so
> generally `duck_output != ndarray_output`.
>
> Which one of these are you aiming at? By including
> `NDArrayOperatorsMixin`, it would seem option (2), but perhaps not? Is
> there a case for both separately?
>
> Smaller general comment: at least in the NEP I would not worry about
> deprecating `NDArrayOperatorsMixin` - this may well be handy in itself
> (for things that implement `__array_ufunc__` but do not have shape,
> etc. (I have been doing some work on creating ufunc chains that would
> use this -- but they definitely are not array-like). Similarly, I
> think there is room for an `NDArrayShapeMixin` which might help with
> `concatenate` and friends.
>
> Finally, on the name: `asarray` and `asanyarray` are just shims over
> `array`, so one option would be to add an argument in `array` (or
> broaden the scope of `subok`).
>
> As an explicit suggestion, one could introduce a `duck` or `abstract`
> argument to `array` which is used in `asarray` and `asanyarray` as
> well (corresponding to options 1 and 2), and eventually default to
> something sensible (I would think `False` for `asarray` and `True` for
> `asanyarray`).
>
> All the best,
>
> Marten
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
>
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] new NEP: np.AbstractArray and np.asabstractarray

2018-03-08 Thread Stephan Hoyer
On Thu, Mar 8, 2018 at 5:54 PM Juan Nunez-Iglesias 
wrote:

> On Fri, Mar 9, 2018, at 5:56 AM, Stephan Hoyer wrote:
>
> Marten's case 1: works exactly like ndarray, but stores data differently:
> parallel arrays (e.g., dask.array), sparse arrays (e.g.,
> https://github.com/pydata/sparse), hypothetical non-strided arrays (e.g.,
> always C ordered).
>
>
> Two other "hypotheticals" that would fit nicely in this space:
> - the Open Connectome folks (https://neurodata.io) proposed linearising
> indices using space-filling curves, which minimizes cache misses (or IO
> reads) for giant volumes. I believe they implemented this but can't find it
> currently.
> - the N5 format for chunked arrays on disk:
> https://github.com/saalfeldlab/n5
>

I think these fall into another important category of duck arrays.
"Indexable" arrays the serve as storage, but that don't support
computation. These sorts of arrays typically support operations like
indexing and define handful of array-like properties (e.g., dtype and
shape), but not arithmetic, reductions or reshaping.

This means you can't quite use them as a drop-in replacement for NumPy
arrays in all cases, but that's OK. In contrast, both dask.array and sparse
do aspire to do fill out nearly the full numpy.ndarray API.
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Where to discuss NEPs (was: Re: new NEP: np.AbstractArray and np.asabstractarray)

2018-03-09 Thread Stephan Hoyer
I also have a slight preference for managing the discussion on GitHub,
which is a bit more fully featured than email for long discussion (e.g., it
supports code formatting and editing comments). But I'm really OK either
way as long as discussion is kept in one place.

We could still stipulate that NEPs are advertised on the mailing list:
first, to announce them, and second, before merging them marked as
accepted. We could even still merge rejected/abandoned NEPs as long as they
are clearly marked.

On Fri, Mar 9, 2018 at 7:24 AM Charles R Harris 
wrote:

> On Thu, Mar 8, 2018 at 11:26 PM, Ralf Gommers 
> wrote:
>
>>
>>
>> On Thu, Mar 8, 2018 at 8:22 PM, Nathaniel Smith  wrote:
>>
>>> On Thu, Mar 8, 2018 at 7:06 AM, Marten van Kerkwijk
>>>  wrote:
>>> > Hi Nathaniel,
>>> >
>>> > Overall, hugely in favour!  For detailed comments, it would be good to
>>> > have a link to a PR; could you put that up?
>>>
>>> Well, there's a PR here: https://github.com/numpy/numpy/pull/10706
>>>
>>> But, this raises a question :-). (One which also came up here:
>>> https://github.com/numpy/numpy/pull/10704#issuecomment-371684170)
>>>
>>> There are sensible two workflows we could use (or at least, two that I
>>> can think of):
>>>
>>> 1. We merge updates to the NEPs as we go, so that whatever's in the
>>> repo is the current draft. Anyone can go to the NEP webpage at
>>> http://numpy.org/neps (WIP, see #10702) to see the latest version of
>>> all NEPs, whether accepted, rejected, or in progress. Discussion
>>> happens on the mailing list, and line-by-line feedback can be done by
>>> quote-replying and commenting on individual lines. From time to time,
>>> the NEP author takes all the accumulated feedback, updates the
>>> document, and makes a new post to the list to let people know about
>>> the updated version.
>>>
>>> This is how python-dev handles PEPs.
>>>
>>> 2. We use Github itself to manage the review. The repo only contains
>>> "accepted" NEPs; draft NEPs are represented by open PRs, and rejected
>>> NEPs are represented by PRs that were closed-without-merging.
>>> Discussion uses Github's commenting/review tools, and happens in the
>>> PR itself.
>>>
>>> This is roughly how Rust handles their RFC process, for example:
>>> https://github.com/rust-lang/rfcs
>>>
>>> Trying to do some hybrid version of these seems like it would be
>>> pretty painful, so we should pick one.
>>>
>>> Given that historically we've tried to use the mailing list for
>>> substantive features/planning discussions, and that our NEP process
>>> has been much closer to workflow 1 than workflow 2 (e.g., there are
>>> already a bunch of old NEPs already in the repo that are effectively
>>> rejected/withdrawn), I think we should maybe continue that way, and
>>> keep discussions here?
>>>
>>> So my suggestion is discussion should happen on the list, and NEP
>>> updates should be merged promptly, or just self-merged. Sound good?
>>
>>
>> Agreed that overall (1) is better than (2), rejected NEPs should be
>> visible. However there's no need for super-quick self-merge, and I think it
>> would be counter-productive.
>>
>> Instead, just send a PR, leave it open for some discussion, and update
>> for detailed comments (as well as long in-depth discussions that only a
>> couple of people care about) in the Github UI and major ones on the list.
>> Once it's stabilized a bit, then merge with status "Draft" and update once
>> in a while. I think this is also much more in like with what python-dev
>> does, I have seen substantial discussion on Github and have not seen quick
>> self-merges.
>>
>>
> I have a slight preference for managing the discussion on Github. Note
> that I added a `component: NEP` label and that discussion can take place on
> merged/closed PRs, the index could also contain links to proposed NEP PRs.
> If we just left PR open until acceptance/rejection the label would allow
> the proposed NEPs to be easily found, especially if we include the NEP
> number in the title, something like `NEP-10111: ` .
>
> Chuck
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
>
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Where to discuss NEPs (was: Re: new NEP: np.AbstractArray and np.asabstractarray)

2018-03-09 Thread Stephan Hoyer
I'll note that we basically used GitHub for revising __array_ufunc__ NEP,
and I think that worked out better for everyone involved. The discussion
was a little too specialized and high volume to be well handled on the
mailing list.

On Fri, Mar 9, 2018 at 8:58 AM Stephan Hoyer  wrote:

> I also have a slight preference for managing the discussion on GitHub,
> which is a bit more fully featured than email for long discussion (e.g., it
> supports code formatting and editing comments). But I'm really OK either
> way as long as discussion is kept in one place.
>
> We could still stipulate that NEPs are advertised on the mailing list:
> first, to announce them, and second, before merging them marked as
> accepted. We could even still merge rejected/abandoned NEPs as long as they
> are clearly marked.
>
> On Fri, Mar 9, 2018 at 7:24 AM Charles R Harris 
> wrote:
>
>> On Thu, Mar 8, 2018 at 11:26 PM, Ralf Gommers 
>> wrote:
>>
>>>
>>>
>>> On Thu, Mar 8, 2018 at 8:22 PM, Nathaniel Smith  wrote:
>>>
>>>> On Thu, Mar 8, 2018 at 7:06 AM, Marten van Kerkwijk
>>>>  wrote:
>>>> > Hi Nathaniel,
>>>> >
>>>> > Overall, hugely in favour!  For detailed comments, it would be good to
>>>> > have a link to a PR; could you put that up?
>>>>
>>>> Well, there's a PR here: https://github.com/numpy/numpy/pull/10706
>>>>
>>>> But, this raises a question :-). (One which also came up here:
>>>> https://github.com/numpy/numpy/pull/10704#issuecomment-371684170)
>>>>
>>>> There are sensible two workflows we could use (or at least, two that I
>>>> can think of):
>>>>
>>>> 1. We merge updates to the NEPs as we go, so that whatever's in the
>>>> repo is the current draft. Anyone can go to the NEP webpage at
>>>> http://numpy.org/neps (WIP, see #10702) to see the latest version of
>>>> all NEPs, whether accepted, rejected, or in progress. Discussion
>>>> happens on the mailing list, and line-by-line feedback can be done by
>>>> quote-replying and commenting on individual lines. From time to time,
>>>> the NEP author takes all the accumulated feedback, updates the
>>>> document, and makes a new post to the list to let people know about
>>>> the updated version.
>>>>
>>>> This is how python-dev handles PEPs.
>>>>
>>>> 2. We use Github itself to manage the review. The repo only contains
>>>> "accepted" NEPs; draft NEPs are represented by open PRs, and rejected
>>>> NEPs are represented by PRs that were closed-without-merging.
>>>> Discussion uses Github's commenting/review tools, and happens in the
>>>> PR itself.
>>>>
>>>> This is roughly how Rust handles their RFC process, for example:
>>>> https://github.com/rust-lang/rfcs
>>>>
>>>> Trying to do some hybrid version of these seems like it would be
>>>> pretty painful, so we should pick one.
>>>>
>>>> Given that historically we've tried to use the mailing list for
>>>> substantive features/planning discussions, and that our NEP process
>>>> has been much closer to workflow 1 than workflow 2 (e.g., there are
>>>> already a bunch of old NEPs already in the repo that are effectively
>>>> rejected/withdrawn), I think we should maybe continue that way, and
>>>> keep discussions here?
>>>>
>>>> So my suggestion is discussion should happen on the list, and NEP
>>>> updates should be merged promptly, or just self-merged. Sound good?
>>>
>>>
>>> Agreed that overall (1) is better than (2), rejected NEPs should be
>>> visible. However there's no need for super-quick self-merge, and I think it
>>> would be counter-productive.
>>>
>>> Instead, just send a PR, leave it open for some discussion, and update
>>> for detailed comments (as well as long in-depth discussions that only a
>>> couple of people care about) in the Github UI and major ones on the list.
>>> Once it's stabilized a bit, then merge with status "Draft" and update once
>>> in a while. I think this is also much more in like with what python-dev
>>> does, I have seen substantial discussion on Github and have not seen quick
>>> self-merges.
>>>
>>>
>> I have a slight preference for managing the discussion on Github. Note
>> that I added a `component: NEP` label and that discussion can take place on
>> merged/closed PRs, the index could also contain links to proposed NEP PRs.
>> If we just left PR open until acceptance/rejection the label would allow
>> the proposed NEPs to be easily found, especially if we include the NEP
>> number in the title, something like `NEP-10111: ` .
>>
>> Chuck
>> ___
>> NumPy-Discussion mailing list
>> NumPy-Discussion@python.org
>> https://mail.python.org/mailman/listinfo/numpy-discussion
>>
>
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] PR to add an initializer kwarg to ufunc.reduce (and similar functions)

2018-03-26 Thread Stephan Hoyer
This looks like a very logical addition to the reduce interface. It has my
support!

I would have preferred the more descriptive name "initial_value", but
consistency with functools.reduce makes a compelling case for "initializer".

On Sun, Mar 25, 2018 at 1:15 PM Eric Wieser 
wrote:

> To reiterate my comments in the issue - I'm in favor of this.
>
> It seems seem especially valuable for identity-less functions (`min`,
> `max`, `lcm`), and the argument name is consistent with `functools.reduce`.
> too.
>
> The only argument I can see against merging this would be `kwarg`-creep of
> `reduce`, and I think this has enough use cases to justify that.
>
> I'd like to merge in a few days, if no one else has any opinions.
>
> Eric
>
> On Fri, 16 Mar 2018 at 10:13 Hameer Abbasi 
> wrote:
>
>> Hello, everyone. I’ve submitted a PR to add a initializer kwarg to
>> ufunc.reduce. This is useful in a few cases, e.g., it allows one to supply
>> a “default” value for identity-less ufunc reductions, and specify an
>> initial value for reductions such as sum (other than zero.)
>>
>> Please feel free to review or leave feedback, (although I think Eric and
>> Marten have picked it apart pretty well).
>>
>> https://github.com/numpy/numpy/pull/10635
>>
>> Thanks,
>>
>> Hameer
>> Sent from Astro  for Mac
>>
>> ___
>> NumPy-Discussion mailing list
>> NumPy-Discussion@python.org
>> https://mail.python.org/mailman/listinfo/numpy-discussion
>>
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
>
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Extending ufunc signature syntax for matmul, frozen dimensions

2018-04-30 Thread Stephan Hoyer
On Sun, Apr 29, 2018 at 2:48 AM Matti Picus  wrote:

> The  proposed solution to issue #9029 is to extend the meaning of a
> signature so "syntax like (n?,k),(k,m?)->(n?,m?) could mean that n and m
> are optional dimensions; if missing in the input, they're treated as 1, and
> then dropped from the output"


I agree that this is an elegant fix for matmul, but are there other
use-cases for "optional dimensions" in gufuncs?

It feels a little wrong to add gufunc features if we can only think of one
function that can use them.
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] summary of "office Hours" open discusison April 25

2018-05-01 Thread Stephan Hoyer
I'm happy to chat about how pandas has done things. It's worth noting that
although it may *look* like Jeff Reback is a full-time maintainer (he does
a lot of work!), he has actually been maintaining pandas as a side-project.
Mostly the project bumbles along without a clear direction, somewhat
similar to the case for NumPy for the past few years, with new
contributions coming from either interested users or core developers when
they have time and interest.

On Tue, May 1, 2018 at 10:00 AM Nelle Varoquaux 
wrote:

> Furher resources to consider:
>> - How did Jupyter organize their roadmap (ask Brian Granger)?
>> - How did Pandas run the project with a full time maintainer (Jeff
>> Reback)?
>> - Can we copy other projects' management guidelines?
>>
>
> scikit-learn also has a number of full time developers. Might be worth
> checking out what they did.
>
> Cheers,
> N
>
>
>>
>> We did not set a time for another online discussion, since it was felt
>> that maybe near/during the sprint in May would be appropriate.
>>
>> I apologize for any misrepresentation.
>>
>> Matti Picus
>>
>> ___
>> NumPy-Discussion mailing list
>> NumPy-Discussion@python.org
>> https://mail.python.org/mailman/listinfo/numpy-discussion
>>
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
>
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Extending ufunc signature syntax for matmul, frozen dimensions

2018-05-02 Thread Stephan Hoyer
On Wed, May 2, 2018 at 8:39 AM Marten van Kerkwijk <
m.h.vankerkw...@gmail.com> wrote:

> I think we should not decide too readily on what is "reasonable" to
> expect for a ufunc input.
>

I agree strongly with this.

I can think of a couple of other use-cases off hand:
- xarray.Dataset is a dict-like container of multiple arrays.
Matrix-multiplication with a numpy array could make sense (just map over
all the contained arrays), but xarray.Dataset itself is not an array and
thus does not define shape.
- tensorflow.Tensor can have a dynamic shape that is only known when
computation is explicitly run, not when computation is defined in Python.

The problem is even bigger for np.matmul because NumPy also wants to use
the same logic for overriding @, and Python's built-in operators definitely
should not have such restrictions.
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] NumPy sprint May 24-25 at BIDS

2018-05-18 Thread Stephan Hoyer
I will also be attending, on at least Thursday (and hopefully Friday, too).

Best,
Stephan

On Thu, May 17, 2018 at 1:40 PM Jaime Fernández del Río <
jaime.f...@gmail.com> wrote:

> $#!#, was looking at the wrong calendar month: Thursday half day, Friday
> all day.
>
> Jaime
>
> On Thu, May 17, 2018 at 4:37 PM Jaime Fernández del Río <
> jaime.f...@gmail.com> wrote:
>
>> OK, make that all day Friday only, if it's Friday and Saturday.
>>
>> Jaime
>>
>> On Thu, May 17, 2018 at 4:36 PM Jaime Fernández del Río <
>> jaime.f...@gmail.com> wrote:
>>
>>> Hi Matti,
>>>
>>> I will be joining you on Thursday, sometime around noon, and all day
>>> Friday.
>>>
>>> Jaime
>>>
>>> On Thu, May 17, 2018 at 4:11 PM Matti Picus 
>>> wrote:
>>>
 On 09/05/18 13:33, Matti Picus wrote:
 > A reminder - we will take advantage of a few NumPy developers being
 at
 > Berkeley to hold a two day sprint May 24-25
 > https://scisprints.github.io/#may-numpy-developer-sprint
 > .
 > We invite any core contributors who would like to attend and can help
 > if needed with travel and accomodations.
 >
 > Stefan and Matti
 So far I know about Stefan, Nathaniel, Chuck and me. Things will work
 better if we can get organized ahead of time. Anyone else planning on
 attending for both days or part of the sprint, please drop me a line.
 If
 there are any issues, pull requests, NEPs, or ideas you would like us
 to
 work on please let me know, or add it to the Trello card
 https://trello.com/c/fvSYkm2w

 Matti
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@python.org
 https://mail.python.org/mailman/listinfo/numpy-discussion

>>>
>>>
>>> --
>>> (\__/)
>>> ( O.o)
>>> ( > <) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus
>>> planes de dominación mundial.
>>>
>>
>>
>> --
>> (\__/)
>> ( O.o)
>> ( > <) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes
>> de dominación mundial.
>>
>
>
> --
> (\__/)
> ( O.o)
> ( > <) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes
> de dominación mundial.
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
>
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] matmul as a ufunc

2018-05-28 Thread Stephan Hoyer
On Mon, May 21, 2018 at 5:42 PM Matti Picus  wrote:

> - create a wrapper that can convince the ufunc mechanism to call
> __array_ufunc__ even on functions that are not true ufuncs
>

I am somewhat opposed to this approach, because __array_ufunc__ is about
overloading ufuncs, and as soon as we relax this guarantee the set of
invariants __array_ufunc__ implementors rely on becomes much more limited.

We really should have another mechanism for arbitrary function overloading
in NumPy (NEP to follow shortly!).


> - expand the semantics of core signatures so that a single matmul ufunc
> can implement matrix-matrix, vector-matrix, matrix-vector, and
> vector-vector multiplication.


I was initially concerned that adding optional dimensions for gufuncs would
introduce additional complexity for only the benefit of a single function
(matmul), but I'm now convinced that it makes sense:
1. All other arithmetic overloads use __array_ufunc__, and it would be nice
to keep @/matmul in the same place.
2. There's a common family of gufuncs for which optional dimensions like
np.matmul make sense: matrix functions where 1D arrays should be treated as
2D row- or column-vectors.

One example of this class of behavior would be np.linalg.solve, which could
support vectors like Ax=b and matrices like Ax=B with the signature
(m,m),(m,n?)->(m,n?). We couldn't immediately make np.linalg.solve a gufunc
since it uses a subtly different dispatching rule, but it's the same
use-case.

Another example would be the "matrix transpose" function that has been
occasionally proposed, to swap the last two dimensions of an array. It
could have the signature (m?,n)->(n,m?), which ensure that it is still well
defined (as the identity) on 1d arrays.
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] matmul as a ufunc

2018-05-28 Thread Stephan Hoyer
On Mon, May 28, 2018 at 7:36 PM Eric Wieser 
wrote:

> which ensure that it is still well defined (as the identity) on 1d arrays.
>
> This strikes me as a bad idea. There’s already enough confusion from
> beginners that array_1d.T is a no-op. If we introduce a matrix-transpose,
> it should either error on <1d inputs with a useful message, or insert the
> extra dimension. I’d favor the former.
>
To be clear: matrix transpose is an example use-case rather than a serious
proposal in this discussion.

But given that idiomatic NumPy code uses 1D arrays in favor of explicit
row/column vectors with shapes (1,n) and (n,1), I do think it does make
sense for matrix transpose on 1D arrays to be the identity, because matrix
transpose should convert back and forth between row and column vectors
representations.

Certainly, matrix transpose should error on 0d arrays, because it doesn't
make sense to transpose a scalar.
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Adding take_along_axis and put_along_axis functions

2018-05-28 Thread Stephan Hoyer
As I'm sure I stated in the GItHub discussion, I strongly support adding
these functions to NumPy. This logic is non-trivial to get right and is
quite broadly useful.

These names also seem natural to me.

On Mon, May 28, 2018 at 8:07 PM Eric Wieser 
wrote:

> These functions provide a vectorized way of using one array to look up
> items in another. In particular, they extend the 1d:
>
> a = np.array([4, 5, 6, 1, 2, 3])
> b = np.array(["four", "five", "six", "one", "two", "three"])
> i = a.argsort()
> b_sorted = b[i]
>
> To work for higher-dimensions:
>
> a = np.array([[4, 1], [5, 2], [6, 3]])
> b = np.array([["four", "one"],  ["five", "two"], ["six", "three"]])
> i = a.argsort(axis=1)
> b_sorted = np.take_along_axis(b, i, axis=1)
>
> put_along_axis is the obvious but less useful dual to this operation,
> inserting elements rather than extracting them. (Unlike put and take
> which are not obvious duals).
>
> These have been merged in gh-11105
> , but as a new addition this
> probably should have gone by the mailing list first.
>
> There was a lack of consensus in gh-8714
>  about how best to generalize
> to differing dimensions, so only the non-controversial case where the
> indices and array have the same dimensions was implemented.
>
> These names were chosen to mirror apply_along_axis, which behaves
> similarly. Do they seem reasonable?
>
> Eric
> ​
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
>
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Where to discuss NEPs (was: Re: new NEP: np.AbstractArray and np.asabstractarray)

2018-05-29 Thread Stephan Hoyer
Reviving this discussion --
I don't really care what our policy is, but can we make a decision one way
or the other about where we discuss NEPs? We've had a revival of NEP
writing recently, so this is very timely.

Previously, I was in slight favor of doing discussion on GitHub. Now that
I've started doing a bit of NEP writing, I've started to swing the other
way, since it would be nice to be able to reference draft/rejected NEPs in
a consistent way -- and rendered HTML is more readable than raw RST in pull
requests.

On Wed, Mar 14, 2018 at 6:52 PM Marten van Kerkwijk <
m.h.vankerkw...@gmail.com> wrote:

> Apparently, where and how to discuss enhancement proposals was
> recently a topic on the python mailing list as well -- see the
> write-up at LWN:
> https://lwn.net/SubscriberLink/749200/4343911ee71e35cf/
> The conclusion seems to be that one should switch to mailman3...
> -- Marten
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
>
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Allowing broadcasting of code dimensions in generalized ufuncs

2018-05-30 Thread Stephan Hoyer
On Wed, May 30, 2018 at 11:15 AM Marten van Kerkwijk <
m.h.vankerkw...@gmail.com> wrote:

> My PR provides the ability to indicate in the signature that a core
> dimension can be broadcast, by using a suffix of "|1". Thus, the
> signature of `all_equal` would become:
>
> ```
> (n|1),(n|1)->()
> ```
>

I read this as "dimensions may have size n or 1", which would exclude the
possibility of scalars.

For all_equal, I think you could also use a signature like "(m?),(n?)->()",
with a short-cut to automatically return False if m != n.
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Allowing broadcasting of code dimensions in generalized ufuncs

2018-05-31 Thread Stephan Hoyer
On Thu, May 31, 2018 at 4:21 AM Marten van Kerkwijk <
m.h.vankerkw...@gmail.com> wrote:

> I think the case for frozen dimensions is much more solid that just
> `cross1d` - there are many operations that work on size-3 vectors.
> Indeed, as I noted in the PR, I have just been wrapping a
> Standards-of-Astronomy library in gufuncs, and many of its functions
> require size-3 vectors or 3x3 matrices [1]. Of course, I can put
> checks on the sizes, and I've now done that in a custom type resolver
> (which I needed anyway since, as you say, user dtypes is currently not
> easy), but there is a real problem for functions that take scalars and
> produce vectors: with a signature like `(),()->(n)`, I am forced to
> pass in an output with size 3, which is very inconvenient (especially
> if I then also want to override with `__array_ufunc__` - now my
> Quantity implementation also has to start changing an output already
> put in. So, having frozen dimensions is definitely helpful for
> developers of new gufuncs.
>

I agree that the use-cases for frozen dimensions are well motivated. It's
not as common as writing code that supports arbitrary dimensions, but given
that the real world is three dimensional it comes up with some regularity.
Certainly for these use-cases it would add significant values (not
requiring pre-allocation of output arrays).

Furthermore, with frozen dimensions, the signature is not just
> immediately clear, `(),()->(3)` for the example above, it is also
> better in telling users about what a function does.
>

Yes, frozen dimensions really do feel like a natural fit. There is no
ambiguity about what an integer means in a gufunc signature, so the
complexity of the gufunc model (for users and __array_ufunc__ implementors)
would remain roughly fixed.

In contrast, broadcasting would certainly increase the complexity of the
model, as evidenced by the new syntax we would need. This may or may not be
justified. Currently I am at -0.5 along with Nathaniel here.


> Indeed, I think this addition has much more justification than the `?`
> which is much more complex than the fixed size, yet neither
> particularly clear nor useful beyond the single purpose of matmul. (It
> is just that that single purpose has fairly high weight...)


Agreed, though at least in principle there is the slightly broader use of
case of handling arguments that are either matrices or column/row vectors.
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Allowing broadcasting of code dimensions in generalized ufuncs

2018-05-31 Thread Stephan Hoyer
On Wed, May 30, 2018 at 5:01 PM Matthew Harrigan 
wrote:

> "short-cut to automatically return False if m != n", that seems like a
> silent bug
>

I guess it depends on the use-cases. This is how np.array_equal() works:
https://docs.scipy.org/doc/numpy/reference/generated/numpy.array_equal.html

We could even imagine incorporating this hypothetical "equality along some
axes with broadcasting" functionality into axis/axes arguments for
array_equal() if we choose this behavior.
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Allowing broadcasting of code dimensions in generalized ufuncs

2018-06-01 Thread Stephan Hoyer
On Fri, Jun 1, 2018 at 2:42 PM Marten van Kerkwijk <
m.h.vankerkw...@gmail.com> wrote:

> Having done that, I felt the examples actually justified the frozen
> dimensions quite well. Given that you're the who expressed most doubts
> about them, could you have a look? Ideally, I'd avoid having to write a NEP
> for this, and the examples do seem to make it quite obvious that this
> change to the signature is the way to go, as its meaning is dead obvious.
> And the implementation is super-straightforward...
>

I do think it would be valuable to have a brief NEP on this, especially on
the solution for matmul. NEPs don't have to be long, and don't need to go
into the full detail of implementations. But they are a nice place to
summarize design discussions.

In fact, I would say the text you have below is nearly enough for one or
two NEPs. The parts that are missing would be valuable to add anyways:
- A brief discussion (a sentence or two) of potential broader use-cases for
optional dimensions (ufuncs that act on row/column vectors and matrices).
- A brief discussion of rejected alternatives (only a few sentences for
each alternative).
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] NEP: Dispatch Mechanism for NumPy’s high level API

2018-06-02 Thread Stephan Hoyer
Matthew Rocklin and I have written NEP-18, which proposes a new dispatch
mechanism for NumPy's high level API:
http://www.numpy.org/neps/nep-0018-array-function-protocol.html

There has already been a little bit of scattered discussion on the pull
request (https://github.com/numpy/numpy/pull/11189), but per NEP-0 let's
try to keep high-level discussion here on the mailing list.

The full text of the NEP is reproduced below:

==
NEP: Dispatch Mechanism for NumPy's high level API
==

:Author: Stephan Hoyer 
:Author: Matthew Rocklin 
:Status: Draft
:Type: Standards Track
:Created: 2018-05-29

Abstact
---

We propose a protocol to allow arguments of numpy functions to define
how that function operates on them. This allows other libraries that
implement NumPy's high level API to reuse Numpy functions. This allows
libraries that extend NumPy's high level API to apply to more NumPy-like
libraries.

Detailed description


Numpy's high level ndarray API has been implemented several times
outside of NumPy itself for different architectures, such as for GPU
arrays (CuPy), Sparse arrays (scipy.sparse, pydata/sparse) and parallel
arrays (Dask array) as well as various Numpy-like implementations in the
deep learning frameworks, like TensorFlow and PyTorch.

Similarly there are several projects that build on top of the Numpy API
for labeled and indexed arrays (XArray), automatic differentation
(Autograd, Tangent), higher order array factorizations (TensorLy), etc.
that add additional functionality on top of the Numpy API.

We would like to be able to use these libraries together, for example we
would like to be able to place a CuPy array within XArray, or perform
automatic differentiation on Dask array code. This would be easier to
accomplish if code written for NumPy ndarrays could also be used by
other NumPy-like projects.

For example, we would like for the following code example to work
equally well with any Numpy-like array object:

.. code:: python

def f(x):
y = np.tensordot(x, x.T)
return np.mean(np.exp(y))

Some of this is possible today with various protocol mechanisms within
Numpy.

-  The ``np.exp`` function checks the ``__array_ufunc__`` protocol
-  The ``.T`` method works using Python's method dispatch
-  The ``np.mean`` function explicitly checks for a ``.mean`` method on
   the argument

However other functions, like ``np.tensordot`` do not dispatch, and
instead are likely to coerce to a Numpy array (using the ``__array__``)
protocol, or err outright. To achieve enough coverage of the NumPy API
to support downstream projects like XArray and autograd we want to
support *almost all* functions within Numpy, which calls for a more
reaching protocol than just ``__array_ufunc__``. We would like a
protocol that allows arguments of a NumPy function to take control and
divert execution to another function (for example a GPU or parallel
implementation) in a way that is safe and consistent across projects.

Implementation
--

We propose adding support for a new protocol in NumPy,
``__array_function__``.

This protocol is intended to be a catch-all for NumPy functionality that
is not covered by existing protocols, like reductions (like ``np.sum``)
or universal functions (like ``np.exp``). The semantics are very similar
to ``__array_ufunc__``, except the operation is specified by an
arbitrary callable object rather than a ufunc instance and method.

The interface
~

We propose the following signature for implementations of
``__array_function__``:

.. code-block:: python

def __array_function__(self, func, types, args, kwargs)

-  ``func`` is an arbitrary callable exposed by NumPy's public API,
   which was called in the form ``func(*args, **kwargs)``.
-  ``types`` is a list of types for all arguments to the original NumPy
   function call that will be checked for an ``__array_function__``
   implementation.
-  The tuple ``args`` and dict ``**kwargs`` are directly passed on from the
   original call.

Unlike ``__array_ufunc__``, there are no high-level guarantees about the
type of ``func``, or about which of ``args`` and ``kwargs`` may contain
objects
implementing the array API. As a convenience for ``__array_function__``
implementors of the NumPy API, the ``types`` keyword contains a list of all
types that implement the ``__array_function__`` protocol.  This allows
downstream implementations to quickly determine if they are likely able to
support the operation.

Still be determined: what guarantees can we offer for ``types``? Should
we promise that types are unique, and appear in the order in which they
are checked?

Example for a project implementing the NumPy API


Most implementations of ``__array_function__`` will start with two
checks:

1.  Is the given functio

Re: [Numpy-discussion] NEP: Dispatch Mechanism for NumPy’s high level API

2018-06-03 Thread Stephan Hoyer
On Sun, Jun 3, 2018 at 8:19 AM Marten van Kerkwijk <
m.h.vankerkw...@gmail.com> wrote:

> My more general comment is one of speed: for *normal* operation
> performance should be impacted as minimally as possible. I think this is a
> serious issue and feel strongly it *has* to be possible to avoid all
> arguments being checked for the `__array_function__` attribute, i.e., there
> should be an obvious way to ensure no type checking dance is done.
>

I agree that all we should try minimize the impact of dispatching on normal
operations. It would be helpful to identify examples of real workflows, so
we can measure the impact of doing these checks empirically. That said, I
think a small degradation in performance for code that works with small
arrays should be acceptable, because performance is an already an accepted
limitations of using NumPy/Python for these use cases.

In most cases, I suspect that the overhead of a function call and checking
several arguments for "__array_function__" will be negligible, like the
situation for __array_ufunc__. I'm not strongly opposed to either of your
proposed solutions, but I do think it would be a little strange to insist
that we need a solution for __array_function__ when __array_ufunc__ was
fine.


> A. Two "namespaces", one for the undecorated base functions, and one
> completely trivial one for the decorated ones. The idea would be that if
> one knows one is dealing with arrays only, one would do `import
> numpy.array_only as np` (i.e., the reverse of the suggestion currently in
> the NEP, where the decorated ones are in their own namespace - I agree with
> the reasons for discounting that one).
>

I will mention this as a possibility.

I do think there is something to be said for clear separation of overloaded
and non-overloaded APIs. But f I were to choose between adding numpy.api
and numpy.array_only, I would pick numpy.api, because of the virtue of
preserving the existing numpy namespace as it currently exists.


> B. Automatic insertion by the decorator of an `array_only=np._NoValue` (or
> `coerce` and perhaps `subok=...` if not present) in the function signature,
> so that users who know that they have arrays only could pass
> `array_only=True` (name to be decided).
>

Rather than adding another argument to every NumPy function, I would rather
encourage writing np.asarray() explicitly.


> Note that both A and B could also address, at least partially, the problem
> of sometimes wanting to just use the old coercion methods, i.e., not having
> to implement every possible numpy function in one go in a new
> `__array_function__` on one's class.
>

Yes, agreed.


> 1. I'm rather unclear about the use of `types`. It can help me decide what
> to do, but I would still have to find the argument in question (e.g., for
> Quantity, the unit of the relevant argument). I'd recommend passing instead
> a tuple of all arguments that were inspected, in the inspection order;
> after all, it is just a `arg.__class__` away from the type, and in your
> example you'd only have to replace `issubclass` by `isinstance`.
>

The virtue of a `types` argument is that we can deduplicate arguments once,
rather than in each __array_function__ check. This could result in
significantly more efficient code, e.g,. when np.concatenate() is called on
10,000 arrays with only two unique types, we don't need to loop through all
10,000 again objects to check that overloading is valid.

Even for Quantity, I suspect you will want two layers of checks:
1. A check to verify that every argument is a Quantity (or something
coercible to a Quantity). This could use `types` and return
`NotImplemented` when it fails.
2. A check to verify that units match. This will have custom logic for
different operations and will require checking all arguments -- not just
their unique types.

For many Quantity functions, the second check will indeed probably be super
simple (i.e., verifying that all units match). But the first check (with
`types`) really is something that basically very overload should do.


> 2. For subclasses, it would be very handy to have
> `ndarray.__array_function__`, so one can call super after changing
> arguments. (For `__array_ufunc__`, there was lots of question about whether
> this was useful, but it really is!!). [I think you already agreed with
> this, but want to have it in-place, as for subclasses of ndarray this is
> just as useful as it would be for subclasses of dask arrays.)
>

Yes, indeed.
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] NEP: Dispatch Mechanism for NumPy’s high level API

2018-06-03 Thread Stephan Hoyer
On Sun, Jun 3, 2018 at 11:12 AM Marten van Kerkwijk <
m.h.vankerkw...@gmail.com> wrote:

> On Sun, Jun 3, 2018 at 2:00 PM, Hameer Abbasi 
> wrote:
>
>>
>>- Objects that don’t implement ``__array_function__`` should be
>>treated as having returned ``np.NotImplementedButCoercible``.
>>   - This has the effect of coercing ``list``, etc.
>>   - At a minimum, to maintain compatibility, if all objects don’t
>>   implement ``__array_function__``, the old behaviour should stay.
>>
>> I think that in the proposed scheme this is effectively what happens.
>

The current proposal is to copy the behavior of __array_ufunc__. So the
non-existence of an __array_function__ attribute is indeed *not* equivalent
to returning NotImplemented: if no arguments implement __array_function__,
then yes they will all be coerced to NumPy arrays.

I do think there is elegance in defining a return value of
np.NotImplementedButCoercible as equivalent to the existence of
__array_function__. This resolves my design question about how coercible
arguments would be coerced with NotImplementedButCoercible: we would fall
back to the current behavior, which in most cases means all arguments are
coerced to NumPy arrays directly. Mixed return values of
NotImplementedButCoercible and NotImplemented would still result in
TypeError, and there would be no second chances for overloads.

This is simple enough that I am inclined to update the NEP to incorporate
the suggestion (thank you!).

My main question is whether we should also update __array_ufunc__ to
support returning NotImplementedButCoercible for consistency. My
inclination is yes: even though it's easy to implement a fallback of
converting all arguments to NumPy arrays for ufuncs, it is hard to do this
correctly from an __array_ufunc__ implementation, because __array_ufunc__
implementations do not know in what order they have been called.

The counter-argument would be that it's not worth adding new features to
__array_ufunc__ if use-cases haven't come up yet. But my guess is that most
users/implementors of __array_ufunc__ are ignorant of these finer details,
and not really worrying about them. Also, the list of binary operators in
Python is short enough that most implementations are OK with supporting
either all or none.

Actually, a return value of NotImplementedButCoercible would probably be
the right answer for some cases in xarray's current __array_ufunc__ method,
when we encounter ufunc methods for which we haven't written an
implementation (e.g., "outer" or "at").
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] NEP: Dispatch Mechanism for NumPy’s high level API

2018-06-03 Thread Stephan Hoyer
On Sun, Jun 3, 2018 at 4:25 PM Marten van Kerkwijk <
m.h.vankerkw...@gmail.com> wrote:

> I think one might still want to know *where* the type occurs (e.g., as an
> output or index would have different implications).
>

This in certainly true in general, but given the complete flexibility of
__array_function__ there's no way we can make every check convenient. The
best we can do is make it easy to handle the common cases, where the
argument position does not matter.


> Possibly, a solution would rely on the same structure as used for the
> "dance". But as a general point, I don't see the advantage of passing types
> rather than arguments - less information for no benefit.
>

Maybe this is premature optimization, but there will certainly be fewer
unique types than arguments to check for types. I suspect this may make for
a noticeable difference in performance in use cases involving a large
number of argument.

For example, suppose np.concatenate() is called on a list of 10,000 dask
arrays. Now dask.array.Array.__array_function__ needs to check all
arguments to decide whether it can use dask.array.concatenate() or needs to
return NotImplemented. By using the `types` argument, it only needs to do
isinstance() checks on the single argument in `types`, rather than all
10,000 overloaded function arguments.
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] NEP: Random Number Generator Policy

2018-06-03 Thread Stephan Hoyer
On Sat, Jun 2, 2018 at 12:06 PM Robert Kern  wrote:

> We propose first freezing ``RandomState`` as it is and developing a new RNG
> subsystem alongside it.  This allows anyone who has been relying on our old
> stream-compatibility guarantee to have plenty of time to migrate.
> ``RandomState`` will be considered deprecated, but with a long deprecation
> cycle, at least a few years.  Deprecation warnings will start silent but
> become
> increasingly noisy over time.  Bugs in the current state of the code will
> *not*
> be fixed if fixing them would impact the stream.  However, if changes in
> the
> rest of ``numpy`` would break something in the ``RandomState`` code, we
> will
> fix ``RandomState`` to continue working (for example, some change in the
> C API).  No new features will be added to ``RandomState``.  Users should
> migrate to the new subsystem as they are able to.
>

Robert, thanks for this proposal. I think it makes a lot of sense and will
help maintain the long-term viability of numpy.random.

The main clarification I would like to see addressed is what "freezing
RandomState" means for top level functions in numpy.random. I think we
could safely swap out the underlying implementation if numpy.random.seed()
is not explicitly called, but how would we handle cases where a seed is
explicitly set?

You and I both agree that this is an anti-pattern for numpy.random, but
certainly there is plenty of code that relies on the stability of random
numbers when seeds are set by np.random.seed(). Similar to the case for
RandomState, we would presumably need to start issuing warnings when seed()
is explicitly called, which begs the question of what (if anything) we
propose to replace seed() with. I suppose this will be your next NEP :).
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] NEP: Random Number Generator Policy

2018-06-03 Thread Stephan Hoyer
On Sun, Jun 3, 2018 at 5:39 PM Robert Kern  wrote:

> You and I both agree that this is an anti-pattern for numpy.random, but
>> certainly there is plenty of code that relies on the stability of random
>> numbers when seeds are set by np.random.seed(). Similar to the case for
>> RandomState, we would presumably need to start issuing warnings when seed()
>> is explicitly called, which begs the question of what (if anything) we
>> propose to replace seed() with.
>>
>
> Well, *I* propose `AttributeError`, myself…
>
>
>> I suppose this will be your next NEP :).
>>
>
> I deliberately left it out of this one as it may, depending on our
> choices, impinge upon the design of the new PRNG subsystem, which I
> declared out of scope for this NEP. I have ideas (besides the glib "Let
> them eat AttributeErrors!"), and now that I think more about it, that does
> seem like it might be in scope just like the discussion of freezing
> RandomState and StableRandom are. But I think I'd like to hold that thought
> a little bit and get a little more screaming^Wfeedback on the core proposal
> first. I'll return to this in a few days if not sooner.
>

For this NEP, it might be enough here to say that the current behavior of
np.random.seed() will be deprecated just like np.random.RandomState(),
since the current implementation of np.random.seed() is intimately tied to
RandomState.

The natural of the exact replacement (if any) can be left for future
discussion.
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] NEP: Dispatch Mechanism for NumPy’s high level API

2018-06-03 Thread Stephan Hoyer
On Sun, Jun 3, 2018 at 5:44 PM Marten van Kerkwijk <
m.h.vankerkw...@gmail.com> wrote:

> Although I'm still not 100% convinced by NotImplementedButCoercible, I do
> like the idea that this is the default for items that do not implement
> `__array_function__`. And it might help avoid trying to find oneself in a
> possibly long list.
>

Another potential consideration in favor of NotImplementedButCoercible is
for subclassing: we could use it to write the default implementations of
ndarray.__array_ufunc__ and ndarray.__array_function__, e.g.,

class ndarray:
def __array_ufunc__(self, *args, **kwargs):
return NotIImplementedButCoercible
def __array_function__(self, *args, **kwargs):
return NotIImplementedButCoercible

I think (not 100% sure yet) this would result in exactly equivalent
behavior to what ndarray.__array_ufunc__ currently does:
http://www.numpy.org/neps/nep-0013-ufunc-overrides.html#subclass-hierarchies
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] NEP: Random Number Generator Policy

2018-06-03 Thread Stephan Hoyer
On Sun, Jun 3, 2018 at 8:22 PM Ralf Gommers  wrote:

> It may be worth having a look at test suites for scipy, statsmodels,
> scikit-learn, etc. and estimate how much work this NEP causes those
> projects. If the devs of those packages are forced to do large scale
> migrations from RandomState to StableState, then why not instead keep
> RandomState and just add a new API next to it?
>

Tests that explicitly create RandomState objects would not be difficult to
migrate. The goal of "StableState" is that it could be used directly in
cases where RandomState is current used in tests, so I would guess that
"RandomState" could be almost mechanistically replaced by "StableState".

The challenging case are calls to np.random.seed(). If no replacement API
is planned, then these would need to be manually converted to use
StableState instead. This is probably not too onerous (and is a good
cleanup to do anyways) but it would be a bit of work.
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] NEP: Dispatch Mechanism for NumPy’s high level API

2018-06-03 Thread Stephan Hoyer
On Sun, Jun 3, 2018 at 9:54 PM Hameer Abbasi 
wrote:

> Mixed return values of NotImplementedButCoercible and NotImplemented would
> still result in TypeError, and there would be no second chances for
> overloads.
>
>
> I would like to differ with you here: It can be quite useful to have
> second chances for overloads. Think ``np.func(list, custom_array))``: If
> second rounds did not exist, custom_array would need to have a list of
> coercible types (which is not nice IMO).
>

Even if we did this, we would still want to preserve the equivalence
between:
1. Returning NotImplementedButCoercible from __array_ufunc__ or
__array_function__, and
2. Not implementing __array_ufunc__ or __array_function__ at all.

Changing __array_ufunc__ to do multiple rounds of checks could indeed be
useful in some cases, and you're right that it would not change existing
behavior (in these cases we currently raise TypeError). But I'd rather
leave that for a separate discussion, because it's orthogonal to our
proposal here for __array_function__.

(Personally, I don't think it would be worth the additional complexity.)
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] A roadmap for NumPy - longer term planning

2018-06-04 Thread Stephan Hoyer
PEP-574 isn't on the roadmap (yet!), but I think we would clearly welcome
it. Like all NumPy improvements, it would need to implemented by an
interested party.
On Mon, Jun 4, 2018 at 1:52 AM Antoine Pitrou  wrote:

>
> Hi,
>
> Do you plan to consider trying to add PEP 574 / pickle5 support? There's
> an implementation ready (and a PyPI backport) that you can play with.
> https://www.python.org/dev/peps/pep-0574/
>
> PEP 574 implicits targets Numpy arrays as one of its primary producers,
> since Numpy arrays is how large scientific or numerical data often ends
> up represented and where zero-copy is often desired by users.
>
> PEP 574 could certainly be useful even without Numpy arrays supporting
> it, but less so.  So I would welcome any feedback on that front (and,
> given that I'd like PEP 574 to be accepted in time for Python 3.8, I'd
> ideally like to have that feedback sometimes in the forthcoming months
> ;-)).
>
> Best regards
>
> Antoine.
>
>
> On Thu, 31 May 2018 16:50:02 -0700
> Matti Picus  wrote:
> > At the recent NumPy sprint at BIDS (thanks to those who made the trip)
> > we spent some time brainstorming about a roadmap for NumPy, in the
> > spirit of similar work that was done for Jupyter. The idea is that a
> > document with wide community acceptance can guide the work of the
> > full-time developer(s), and be a source of ideas for expanding
> > development efforts.
> >
> > I put the document up at
> > https://github.com/numpy/numpy/wiki/NumPy-Roadmap, and hope to discuss
> > it at a BOF session during SciPy in the middle of July in Austin.
> >
> > Eventually it could become a NEP or formalized in another way.
> >
> > Matti
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
>
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] NEP: Dispatch Mechanism for NumPy’s high level API

2018-06-05 Thread Stephan Hoyer
On Mon, Jun 4, 2018 at 5:39 AM Matthew Harrigan 
wrote:

> Should there be discussion of typing (pep-484) or abstract base classes in
> this nep?  Are there any requirements on the result returned by
> __array_function__?
>

This is a good question that should be addressed in the NEP. Currently, we
impose no limitations on the types returned by __array_function__ (or
__array_ufunc__, for that matter). Given the complexity of potential
__array_function__ implementations, I think this would be hard/impossible
to do in general.

I think the best case scenario we could hope for is that type checkers
would identify that result of NumPy functions as:
- numpy.ndarray if all inputs are numpy.ndarray objects
- Any if any non-numpy.ndarray inputs implement the __array_function__

Based on my understanding of proposed rules for typing protocols [1] and
overloads [2], I think this could just work, e.g.,

@overload
def func(array: np.ndarray) -> np.ndarray: ...
@overload
def func(array: ImplementsArrayFunction) -> Any: ...

[1] https://www.python.org/dev/peps/pep-0544/
[2] https://github.com/python/typing/issues/253#issuecomment-389262904
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] NEP: Dispatch Mechanism for NumPy’s high level API

2018-06-05 Thread Stephan Hoyer
On Mon, Jun 4, 2018 at 7:35 AM Marten van Kerkwijk <
m.h.vankerkw...@gmail.com> wrote:

> Hi Stephan,
>
> Another potential consideration in favor of NotImplementedButCoercible is
>> for subclassing: we could use it to write the default implementations of
>> ndarray.__array_ufunc__ and ndarray.__array_function__, e.g.,
>>
>> class ndarray:
>> def __array_ufunc__(self, *args, **kwargs):
>> return NotIImplementedButCoercible
>> def __array_function__(self, *args, **kwargs):
>> return NotIImplementedButCoercible
>>
>> I think (not 100% sure yet) this would result in exactly equivalent
>> behavior to what ndarray.__array_ufunc__ currently does:
>>
>> http://www.numpy.org/neps/nep-0013-ufunc-overrides.html#subclass-hierarchies
>>
>
> As written would not work for ndarray subclasses, because the subclass
> will generically change itself before calling super. At least for Quantity,
> say if I add two quantities, the quantities will both be converted to
> arrays (with one scaled so that the units match) and then the super call is
> done with those modified arrays. This expects that the super call will
> actually return a result (which it now can because all inputs are arrays).
>

Thanks for clarifying. This is definitely trickier than I had thought.

If Quantity.__array_ufunc__ implemented overrides by calling the public
ufunc method again (instead of calling super), then it would still work
fine with this change. But of course, in that case you would not need
ndarray.__array_ufunc__ defined at all.

I will say that personally, I find the complexity of the current
ndarray.__array_ufunc__ implementation a little inelegant, and I would
welcome simplifying it. But I also try to avoid implementation inheritance
entirely [2], for exactly the same reasons why refactoring
ndarray.__array_ufunc__ here would be difficult (inheritance is fragile).
So I would be happy to defer to your judgment, as someone who actually uses
subclassing.

https://hackernoon.com/inheritance-based-on-internal-structure-is-evil-7474cc8e64dc
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] NEP: Dispatch Mechanism for NumPy’s high level API

2018-06-05 Thread Stephan Hoyer
On Tue, Jun 5, 2018 at 12:35 PM Marten van Kerkwijk <
m.h.vankerkw...@gmail.com> wrote:

> Things would, I think, make much more sense if `ndarray.__array_ufunc__`
> (or `*_function__`) actually *were* the implementation for array-only. But
> while that is something I'd like to eventually get to, it seems out of
> scope for the current discussion.
>

If this is a desirable end-state, we should at least consider it now while
we are designing the __array_function__ interface.

With the current proposal, I think this would be nearly impossible. The
challenge is that ndarray.__array_function__ would somehow need to call the
non-overloaded version of the provided function provided that no other
arguments overload __array_function__. However, currently don't expose this
information in any way.

Some ways this could be done (including some of your prior suggestions):
- Add a coerce=True argument to all NumPy functions, which could be used by
non-overloaded implementations.
- A separate namespace for non-overloaded functions (e.g.,
numpy.array_only).
- Adding another argument to the __array_function__ interface to explicitly
provide the non-overloaded implementation (e.g., func_impl).

I don't like any of these options and I'm not sure I agree with your goal,
but the NEP should make clear that we are precluding this possibility.

Given that, I think that perhaps it is also best not to do
> `NotImplementedButCoercible` - as I think the implementers of
> `__array_function__` perhaps should just do that themselves. But I may well
> swing the other way again... Good examples of non-trivial benefits would
> help.
>

This would also be my default stance, and of course we can always add
NotImplementedButCoercible later.

I can think of two main use cases:
1. Libraries that only want to overload *some* NumPy functions, but want
the rest of NumPy's API by coercing arguments to NumPy arrays.
2. Library that want to eventually overload all of NumPy's high level API,
but need to do so incrementally, in a way that preserves backwards
compatibility.

I'm not sure I agree with use case 1. Arguably, libraries that only
overload a limited part of NumPy's API shouldn't encourage their users
their users to rely on it. This state of affairs is pretty confusing to
users.

However, case 2 is valid and potentially important. Consider the case of a
library with existing users that would like to start implementing
__array_function__ (e.g., dask, astropy, xarray, pandas). The right
strategy really depends upon whether the library considers the current
behavior of NumPy functions on their objects (silent coercion to numpy
arrays) a feature or a bug:
- If coercion is a bug and something that the library never intended to
support, then perhaps it would be OK to suddenly change all existing
overloads to return the correct type.
- However, if coercion is a feature (which is probably the attitude of at
least some users), ideally there really should be a graceful way to enable
the new overloaded behavior incrementally. For example, a library might
want to start issuing FutureWarning in version X, before switching over to
the new overloaded behavior in version X+1. I can't think of how to do this
without NotImplementedButCoercible.

For projects like dask and xarray, the benefits of __array_function__ are
so large that we will accept a hard transition that breaks some user code
without warning. But this may not be the case for other projects.
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] NEP: Dispatch Mechanism for NumPy’s high level API

2018-06-05 Thread Stephan Hoyer
On Tue, Jun 5, 2018 at 2:47 PM Matti Picus  wrote:

> What is the difference between the `func` provided as the first argument
> to `__array_function__` and `__array_ufunc__` and the "non-overloaded
> version of the provided function"?
>

The ""non-overloaded version of the provided function" is entirely
hypothetical at this point.

If we use a decorator to implement overloads, it would be the undecorated
function, e.g., the original definition of concatenate here:

@overload_for_array_function(['arrays', 'out'])def concatenate(arrays,
axis=0, out=None):
... # continue with the definition of concatenate


This NEP calls it an "arbitrary callable".
> In `__array_ufunc__` it turns out people count on it being exactly the
> `np.ufunc`.


Right, I think this is good guarantee to provide. Certainly it's one that
people fine useful.
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] NEP: Dispatch Mechanism for NumPy’s high level API

2018-06-08 Thread Stephan Hoyer
On Fri, Jun 8, 2018 at 8:58 AM Marten van Kerkwijk <
m.h.vankerkw...@gmail.com> wrote:

> I think we're getting to the stage where an updated text would be useful.
>

Yes, I plan to work on this over the weekend. Stay tuned!


> For that, you may want to consider an actual implementation of, e.g., a
> very simple function like `np.reshape` as well as a more complicated one
> like `np.concatenate`
>

Yes, I agree that actual implementation (in Python rather than C for now)
would be useful.


> and in particular how the implementation finds out where its own instances
> are located.
>

I think we've discussed this before, but I don't think this is feasible to
solve in general given the diversity of wrapped APIs. If you want to find
the arguments in which a class' own instances appear, you will need to do
that in your overloaded function.

That said, if merely pulling out the flat list of arguments that are
checked for and/or implement __array_function__ would be enough, we can
probably figure out a way to expose that information.
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] NEP: Dispatch Mechanism for NumPy’s high level API

2018-06-08 Thread Stephan Hoyer
(offlist)

To clarify, by "where_i_am" you mean something like the name of the
argument where it was found?

On Fri, Jun 8, 2018 at 4:49 PM Marten van Kerkwijk <
m.h.vankerkw...@gmail.com> wrote:

> and in particular how the implementation finds out where its own instances
>>> are located.
>>>
>>
>> I think we've discussed this before, but I don't think this is feasible
>> to solve in general given the diversity of wrapped APIs. If you want to
>> find the arguments in which a class' own instances appear, you will need to
>> do that in your overloaded function.
>>
>> That said, if merely pulling out the flat list of arguments that are
>> checked for and/or implement __array_function__ would be enough, we can
>> probably figure out a way to expose that information.
>>
>
> In the end, somewhere inside the "dance", you are checking for
> `__array_function` - it would seem to me that at that point you know
> exactly where you are, and it would not be difficult to something like
> ```
> types[new_type] += [where_i_am]
> ```
> (where here I assume types is a defaultdict(list))  - has the set of types
> in keys and locations as values.
>
> But easier to discuss whether this is easy with some sample code to look
> at!
>
> -- Marten
>
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
>
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Allowing broadcasting of code dimensions in generalized ufuncs

2018-06-10 Thread Stephan Hoyer
In Sun, Jun 10, 2018 at 4:31 PM Eric Wieser 
wrote:

> Thanks for the writeup Marten,
>
Indeed, thank you Marten!

> This hits on an interesting alternative to frozen dimensions - np.cross
> could just become a regular ufunc with signature np.dtype((float64, 3)),
> np.dtype((float64, 3)) → np.dtype((float64, 3))
>
> Another alternative to mention is returning multiple arrays, e.g., two
arrays for a fixed dimension of size 2.

That said, I still think frozen dimension are a better proposal than either
of these.


>- I’m -1 on optional dimensions: they seem to legitimize creating many
>overloads of gufuncs. I’m already not a fan of how matmul has special cases
>for lower dimensions that don’t generalize well. To me, the best way to
>handle matmul would be to use the proposed __array_function__ to
>handle the shape-based special-case dispatching, either by:
>   - Inserting dimensions, and calling the true gufunc
>   np.linalg.matmul_2d (which is a function I’d like direct access to
>   anyway).
>   - Dispatching to one of four ufuncs
>
> I don't understand your alternative here. If we overload np.matmul using
__array_function__, then it would not use *ether* of these options for
writing the operation in terms of other gufuncs. It would simply look for
an __array_function__ attribute, and call that method instead.

My concern with either inserting dimensions or dispatching to one of four
ufuncs is that some objects (e.g., xarray.DataArray) define matrix
multiplication, but in an incompatible way with NumPy (e.g., xarray sums
over axes with the same name, instead of last / second-to-last axes). NumPy
really ought to provide a way overload the either operation, without either
inserting/removing dummy dimensions or inspecting input shapes to dispatch
to other gufuncs.

That said, if you don't want to make np.matmul a gufunc, then I would much
rather use Python's standard overloading rules with __matmul__/__rmatmul__
than use __array_function__, for two reasons:
1. You *already* need to use __matmul__/__rmatmul__ if you want to support
matrix multiplication with @ on your class, so __array_function__ would be
additional and redundant. __array_function__ is really intended as a
fall-back, for cases where there is no other alternative.
2. With the current __array_function__ proposal, this would imply that
calling other unimplemented NumPy functions on your object would raise
TypeError rather than doing coercion. This sort of additional coupled
behavior is probably not what an implementor of operator.matmul/@ is
looking for.

In summary, I would either support:
1. (This proposal) Adding additional optional dimensions to gufuncs for
np.matmul/operator.matmul, or
2. Making operator.matmul a special case for mathematical operators that
always checks overloads with __matmul__/__rmatmul__ even if __array_ufunc__
is defined.

Either way, matrix-multiplication becomes somewhat of a special case. It's
just a matter of whether it's a special case for gufuncs (using optional
dimensions) or a special case for arithmetic overloads in NumPy (not using
__array_ufunc__). Given that I think optional dimensions have other
conceivable uses in gufuncs (for row/column vectors), I think that's the
better option.

I would not support either expand dimensions or dispatch to multiple
gufuncs in NumPy's implementation of operator.matmul (i.e.,
ndarray.__matmul__). We could potentially only do this for numpy.matmul
rather than operator.matmul/@, but that opens the door to potential
inconsistency between the NumPy version of an operator and Python's version
of an operator, which is something we tried very hard to avoid with
__arary_ufunc__.
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] NEP: Random Number Generator Policy

2018-06-10 Thread Stephan Hoyer
On Sun, Jun 10, 2018 at 8:10 PM Ralf Gommers  wrote:

> On Sun, Jun 10, 2018 at 5:57 PM, Robert Kern 
> wrote:
>
>> > Conclusion: the current proposal will cause work for the vast majority
>> of libraries that depends on numpy. The total amount of that work will
>> certainly not be counted in person-days/weeks, and more likely in years
>> than months. So I'm not convinced yet that the current proposal is the best
>> way forward.
>>
>
>> The mere usage of np.random.seed() doesn't imply that these packages
>> actually require stream-compatibility. Some might, for sure, like where
>> they are used in the unit tests, but that's not what you counted. At best,
>> these numbers just mean that we can't eliminate np.random.seed() in a new
>> system right away.
>>
>
> Well, mere usage has been called an antipattern (also on your behalf),
> plus for scipy over half of the usages do give test failures (Warren's
> quick test). So I'd say that counting usages is a decent proxy for the work
> that has to be done.
>

Let me suggest another possible concession for backwards compatibility. We
should make a dedicated module, e.g., "numpy.random.stable" that contains
functions implemented as methods on StableRandom. These functions should
include "seed", which is too pervasive to justify removing.

Transitioning to the new module should be as simple as mechanistically
replacing all uses of "numpy.random" with "numpy.random.stable".

This module would add virtually no maintenance overhead, because the
implementations would be entirely contained on StableRandom, and would
simply involve creating a single top-level StableRandom object (like what
is currently done in numpy.random).
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Dropping Python 3.4 support for NumPy 1.16

2018-06-13 Thread Stephan Hoyer
This sounds good to me. Most of the downstream projects I work with have
already dropped Python 3.4 support.

On Wed, Jun 13, 2018 at 2:30 PM Charles R Harris 
wrote:

> Hi All,
>
> I think NumPy 1.16 would be a good time to drop Python 3.4 support. We
> will want to do that anyway once we drop 2.7 so that we will only be using
> recent Windows compilers, and with Python 3.7 due at the end of the month I
> think supporting 3.5-7 for 1.16 should be sufficient.
>
> Thoughts?
>
> Chuck
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
>
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Allowing broadcasting of code dimensions in generalized ufuncs

2018-06-15 Thread Stephan Hoyer
On Mon, Jun 11, 2018 at 11:59 PM Eric Wieser 
wrote:

> I don’t understand your alternative here. If we overload np.matmul using
> *array_function*, then it would not use *ether* of these options for
> writing the operation in terms of other gufuncs. It would simply look for
> an *array_function* attribute, and call that method instead.
>
> Let me explain that suggestion a little more clearly.
>
>1. There’d be a linalg.matmul2d that performs the real matrix case,
>which would be easy to make as a ufunc right now.
>2. __matmul__ and __rmatmul__ would just call np.matmul, as they
>currently do (for consistency between np.matmul and operator.matmul,
>needed in python pre-@-operator)
>3. np.matmul would be implemented as:
>
>@do_array_function_overridesdef matmul(a, b):
>if a.ndim != 1 and b.ndim != 1:
>return matmul2d(a, b)
>elif a.ndim != 1:
>return matmul2d(a, b[:,None])[...,0]
>elif b.ndim != 1:
>return matmul2d(a[None,:], b)
>else:
># this one probably deserves its own ufunf
>return matmul2d(a[None,:], b[:,None])[0,0]
>
>4. Quantity can just override __array_ufunc__ as with any other ufunc
>5. DataArray, knowing the above doesn’t work, would implement
>something like
>
>@matmul.register_array_function(DataArray)def __array_function__(a, b):
>if a.ndim != 1 and b.ndim != 1:
>return matmul2d(a, b)
>else:
># either:
># - add/remove dummy dimensions in a dataarray-specific way
># - downcast to ndarray and do the dimension juggling there
>
>
> Advantages of this approach:
>
>-
>
>Neither the ufunc machinery, nor __array_ufunc__, nor the inner loop,
>need to know about optional dimensions.
>-
>
>We get a matmul2d ufunc, that all subclasses support out of the box if
>they support matmul
>
> Eric
>
OK, this sounds pretty reasonable to me -- assuming we manage to figure out
the __array_function__ proposal!

There's one additional ingredient we would need to make this work well:
some way to guarantee that "ndim" and indexing operations are available
without casting to a base numpy array.

For now, np.asanyarray() would probably suffice, but that isn't quite right
(e.g., this would fail for np.matrix).

In the long term, I think we need a new coercion protocol for "duck"
arrays. Nathaniel Smith and I started writing a NEP on this, but it isn't
quite ready yet.

> ​
>
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] NEP: Random Number Generator Policy

2018-06-16 Thread Stephan Hoyer
>
> This is a little weird; "mtrand" is an implementation detail already.
>> There's exactly 3 instances of that in scikit-learn, so replacing those
>> with a sane name (with a long timeline, say 4 numpy versions at least plus
>> a major version number bump) doesn't seem unreasonable.
>>
>
> Everything in this paragraph is explicitly just about the initial release
> with the new subsystem. A following paragraph says that we should revisit
> all of these in following releases.
>

This already read a little strangely to me -- it sounded like an indefinite
pronouncement. It would be good to clarify :).

Otherwise, I am quite happy with this NEP! It avoids unnecessary churn, and
opens the door to much needed improvements in numpy.random.
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] question about array slicing and element assignment

2018-06-19 Thread Stephan Hoyer
You will need to convert "a[(2,3,5),][mask]" into a single indexing
expression, e.g, by using utility functions like np.nonzero() on mask.
NumPy can't support assignment in chained indexing.

On Tue, Jun 19, 2018 at 1:25 PM Emil Sidky  wrote:

> Hello,
> The following is an example where an array element assignment didn't work
> as I expected.
> Create a 6 x 3 matrix:
>
> In [70]: a =  randn(6,3)
>
> In [71]: a
> Out[71]:
> array([[ 1.73266816,  0.948849  ,  0.69188222],
> [-0.61840161, -0.03449826,  0.15032552],
> [ 0.4963306 ,  0.77028209, -0.63076396],
> [-1.92273602, -1.03146536,  0.27744612],
> [ 0.70736325,  1.54687964, -0.75573888],
> [ 0.16316043, -0.34814532,  0.3683143 ]])
>
> Create a 3x3 boolean array:
> In [72]: mask = randn(3,3)>0.
>
> In [73]: mask
> Out[73]:
> array([[ True,  True,  True],
> [False,  True,  True],
> [ True, False,  True]], dtype=bool)
>
> Try to modify elements of "a" with the following line:
> In [74]: a[(2,3,5),][mask] = 1.
> No elements are changed in "a":
> In [75]: a
> Out[75]:
> array([[ 1.73266816,  0.948849  ,  0.69188222],
> [-0.61840161, -0.03449826,  0.15032552],
> [ 0.4963306 ,  0.77028209, -0.63076396],
> [-1.92273602, -1.03146536,  0.27744612],
> [ 0.70736325,  1.54687964, -0.75573888],
> [ 0.16316043, -0.34814532,  0.3683143 ]])
>
> Instead try to modify elements of "a" with this line:
> In [76]: a[::2,][mask] = 1.
>
> This time it works:
> In [77]: a
> Out[77]:
> array([[ 1.,  1.,  1.],
> [-0.61840161, -0.03449826,  0.15032552],
> [ 0.4963306 ,  1.,  1.],
> [-1.92273602, -1.03146536,  0.27744612],
> [ 1.,  1.54687964,  1.],
> [ 0.16316043, -0.34814532,  0.3683143 ]])
>
>
> Is there a way where I can modify the elements of "a" selected by an
> expression like "a[(2,3,5),][mask]" ?
>
> Thanks , Emil
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
>
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] NEP 21: Simplified and explicit advanced indexing

2018-06-25 Thread Stephan Hoyer
Sebastian and I have revised a Numpy Enhancement Proposal that he started
three years ago for overhauling NumPy's advanced indexing. We'd now like to
present it for official consideration.

Minor inline comments (e.g., typos) can be added to the latest pull request
(https://github.com/numpy/numpy/pull/11414/files), but otherwise let's keep
discussion on the mailing list. The NumPy website should update shortly
with a rendered version (
http://www.numpy.org/neps/nep-0021-advanced-indexing.html), but until then
please see the full text below.

Cheers,
Stephan

=
Simplified and explicit advanced indexing
=

:Author: Sebastian Berg
:Author: Stephan Hoyer 
:Status: Draft
:Type: Standards Track
:Created: 2015-08-27


Abstract


NumPy's "advanced" indexing support for indexing arrays with other arrays is
one of its most powerful and popular features. Unfortunately, the existing
rules for advanced indexing with multiple array indices are typically
confusing
to both new, and in many cases even old, users of NumPy. Here we propose an
overhaul and simplification of advanced indexing, including two new
"indexer"
attributes ``oindex`` and ``vindex`` to facilitate explicit indexing.

Background
--

Existing indexing operations


NumPy arrays currently support a flexible range of indexing operations:

- "Basic" indexing involving only slices, integers, ``np.newaxis`` and
ellipsis
  (``...``), e.g., ``x[0, :3, np.newaxis]`` for selecting the first element
  from the 0th axis, the first three elements from the 1st axis and
inserting a
  new axis of size 1 at the end. Basic indexing always return a view of the
  indexed array's data.
- "Advanced" indexing, also called "fancy" indexing, includes all cases
where
  arrays are indexed by other arrays. Advanced indexing always makes a copy:

  - "Boolean" indexing by boolean arrays, e.g., ``x[x > 0]`` for
selecting positive elements.
  - "Vectorized" indexing by one or more integer arrays, e.g., ``x[[0, 1]]``
for selecting the first two elements along the first axis. With multiple
arrays, vectorized indexing uses broadcasting rules to combine indices
along
multiple dimensions. This allows for producing a result of arbitrary
shape
with arbitrary elements from the original arrays.
  - "Mixed" indexing involving any combinations of the other advancing
types.
This is no more powerful than vectorized indexing, but is sometimes more
convenient.

For clarity, we will refer to these existing rules as "legacy indexing".
This is only a high-level summary; for more details, see NumPy's
documentation
and and `Examples` below.

Outer indexing
~~

One broadly useful class of indexing operations is not supported:

- "Outer" or orthogonal indexing treats one-dimensional arrays equivalently
to
  slices for determining output shapes. The rule for outer indexing is that
the
  result should be equivalent to independently indexing along each dimension
  with integer or boolean arrays as if both the indexed and indexing arrays
  were one-dimensional. This form of indexing is familiar to many users of
other
  programming languages such as MATLAB, Fortran and R.

The reason why NumPy omits support for outer indexing is that the rules for
outer and vectorized conflict. Consider indexing a 2D array by two 1D
integer
arrays, e.g., ``x[[0, 1], [0, 1]]``:

- Outer indexing is equivalent to combining multiple integer indices with
  ``itertools.product()``. The result in this case is another 2D array with
  all combinations of indexed elements, e.g.,
  ``np.array([[x[0, 0], x[0, 1]], [x[1, 0], x[1, 1]]])``
- Vectorized indexing is equivalent to combining multiple integer indices
with
  ``zip()``. The result in this case is a 1D array containing the diagonal
  elements, e.g., ``np.array([x[0, 0], x[1, 1]])``.

This difference is a frequent stumbling block for new NumPy users. The outer
indexing model is easier to understand, and is a natural generalization of
slicing rules. But NumPy instead chose to support vectorized indexing,
because
it is strictly more powerful.

It is always possible to emulate outer indexing by vectorized indexing with
the right indices. To make this easier, NumPy includes utility objects and
functions such as ``np.ogrid`` and ``np.ix_``, e.g.,
``x[np.ix_([0, 1], [0, 1])]``. However, there are no utilities for emulating
fully general/mixed outer indexing, which could unambiguously allow for
slices,
integers, and 1D boolean and integer arrays.

Mixed indexing
~~

NumPy's existing rules for combining multiple types of indexing in the same
operation are quite complex, involving a number of edge cases.

One reason why mixed indexing is particularly confusing is that at first

Re: [Numpy-discussion] NEP 21: Simplified and explicit advanced indexing

2018-06-26 Thread Stephan Hoyer
On Tue, Jun 26, 2018 at 4:34 PM Robert Kern  wrote:

> I maintain that considering deprecation is premature at this time. Please
> take it out of this NEP. Let us get a feel for how people actually use
> .oindex/.vindex. Then we can talk about deprecation. This NEP gets my
> enthusiastic approval, except for the deprecation. I will be happy to talk
> about deprecation with an open mind in a few years. With some more actual
> experience under our belt, rather than prediction and theory, we can be
> more confident about the approach we want to take. Deprecation is not a
> fundamental part of this NEP and can be decided independently at a later
> time.
>

I agree, we should scale back most of the deprecations proposed in this
NEP, leaving them for possible future work. In particular, you're not
convinced yet that "outer indexing" is a more intuitive default indexing
mode than "vectorized indexing", so it is premature to deprecate vectorized
indexing behavior that conflicts with outer indexing. OK, fair enough.

I would still like to include at least two more limited form of deprecation
that I hope will be less controversial:
- Mixed boolean/integer array indexing. This is not very intuitive nor
useful, and I don't think I've ever seen it used. Usually "outer indexing"
behavior is what is desired here.
- Mixed array/slice indexing, for cases with arrays separated by slices so
NumPy can't do the "intuitive" transpose on the output. As noted in the
NEP, this is a common source of bugs. Users who want this should really
switch to vindex.

In the long term, although I agree with Sebastian that "outer indexing" is
more intuitive for default indexing behavior, I would really like to
eliminate the "dimension reordering" behavior of mixed array/slice indexing
altogether. This is a weird special case that is different between indexing
like array[...] from array.vindex[...]. So if we don't choose to deprecate
all cases where [] and oindex[] are different, I would at least like to
deprecate all cases where [] and vindex[] are different.
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] NEP 21: Simplified and explicit advanced indexing

2018-06-26 Thread Stephan Hoyer
On Tue, Jun 26, 2018 at 9:38 AM Eric Wieser 
wrote:

> We can expose some of the internals
>
> These could be expressed as methods on the internal indexing objects I
> proposed in the first reply to this thread, which has seen no responses.
>
> I think Hameer Abbasi is looking for something like 
> OrthogonalIndexer(...).to_vindex()
> -> VectorizedIndexer such that arr.oindex[ind] selects the same elements
> as arr.vindex[OrthogonalIndexer(ind).to_vindex()]
>
> Eric
>

It is probably worth noting that xarray already uses very similar classes
internally for keeping track of indexing operations. See BasicIndexer,
OuterIndexer and VectorizedIndexer:
https://github.com/pydata/xarray/blob/v0.10.7/xarray/core/indexing.py#L295-L428

This turns out to be pretty convenient model even when not using
subclassing. In xarray, we use them internally in various "partial duck
array" classes that do some lazy computation upon indexing with
__getitem__. It's nice to simply be able to forward on Indexer objects
rather than implement separate vindex/oindex methods.

We also have utility functions for converting between different forms,
e.g., from OuterIndexer to VectorizedIndexer:
https://github.com/pydata/xarray/blob/v0.10.7/xarray/core/indexing.py#L654

I guess this is a case for using such classes internally in NumPy, and
possibly for exposing them publicly as well.
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] NEP 21: Simplified and explicit advanced indexing

2018-06-26 Thread Stephan Hoyer
On Tue, Jun 26, 2018 at 6:39 PM Robert Kern  wrote:

> I'd still prefer not talking deprecation, per se, in this NEP (but my
> objection is weaker). I would definitely start adding in informative, noisy
> warnings in these cases, though. Along the lines of, "Hey, this is a dodgy
> construction that typically gives unexpected results. Here are
> .oindex/.vindex that might do what you actually want, but you can use
> .legacy_index if you just want to silence this warning". Rather than "Hey,
> this is going to go away at some point."
>

Yes, agreed. These will use a new warning class, perhaps
numpy.IndexingWarning.
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] NEP 21: Simplified and explicit advanced indexing

2018-06-26 Thread Stephan Hoyer
On Tue, Jun 26, 2018 at 12:46 AM Robert Kern  wrote:

> I think having more self-contained descriptions of the semantics of each
> of these would be a good idea. The current description of `.vindex` spends
> more time talking about what it doesn't do, compared to the other methods,
> than what it does.
>

Will do.


> I'm still leaning towards not warning on current, unproblematic common
> uses. It's unnecessary churn for currently working, understandable code. I
> would still reserve warnings and deprecation for the cases where the
> current behavior gives us something that no one wants. Those are the real
> traps that people need to be warned away from.
>
> If someone is mixing slices and integer indices, that's a really good sign
> that they thought indexing behaved in a different way (e.g. orthogonal
> indexing).
>

I agree, but I'm still not  entirely sure where to draw the line on
behavior that should issue a warning. Some options, in roughly descending
order of severity:
1. Warn if [] would give a different result than .oindex[]. This is the
current proposal in the NEP, but based on the feedback we should hold back
on it for now.
2. Warn if there is a mixture of arrays/slice objects in indices for [],
even implicitly (e.g., including arr[idx] when is equivalent to arr[idx,
:]). In this case, indices end up at the end both for legacy_index and
vindex, but arguably that is only a happy coincidence.
3. Warn if [] would give a different result from .vindex[]. This is a
little weaker than the previous condition, because arr[idx, :] or arr[idx,
...] would not give a warning. However, cases like arr[..., idx] or arr[:,
idx, :] would still start to give warnings, even though they are arguably
well defined according to either outer indexing (if idx.ndim == 1) or
legacy indexing (due to dimension reordering rules that will be omitted
from vindex).
4. Warn if there are multiple arrays/integer indices separated by a slice
object, e.g., arr[idx1, :, idx2]. This is the edge case that really trips
up users.

As I said in my other response, in the long term, I would prefer to either
(a) drop support for vectorized indexing in [] or (b) if we stick with
supporting vectorized indexing in [], at least ensure consistent dimension
ordering rules for [] and vindex[]. That would suggest using either my
proposed rule 2 or 3.

I also agree with you that anyone mixing slices and integers probably is
confused about how indexing works, at least in edge cases. But given the
lengths that legacy indexing goes to to support "outer indexing-like"
behavior in the common case of a single integer array and many slices, I am
hesitant to start warning in this case. The result of arr[..., idx, :] is
relatively easy to understand, even though it uses its own set of rules,
which happen to be more consistent with oindex[] than vindex[].

We certainly could make the conservative choice of only adopting 4 for now
and leaving further cleanup for later. I guess this uncertainty about
whether direct indexing should be more like vindex[] or oindex[] in the
long term is a good argument for holding off on other warnings for now. But
I think we are almost certainly going to want to make further
warnings/deprecations of some form.
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] NEP 21: Simplified and explicit advanced indexing

2018-06-26 Thread Stephan Hoyer
On Tue, Jun 26, 2018 at 10:22 PM Robert Kern  wrote:

> We certainly could make the conservative choice of only adopting 4 for now
>> and leaving further cleanup for later. I guess this uncertainty about
>> whether direct indexing should be more like vindex[] or oindex[] in the
>> long term is a good argument for holding off on other warnings for now. But
>> I think we are almost certainly going to want to make further
>> warnings/deprecations of some form.
>>
>
> I'd prefer 4, could be talked into 3, but any higher is not a good idea, I
> don't think.
>

OK, I think 4 is the safe option for now.

Eventually, I want either 1 or 3. But:
- We don't agree yet on whether the right long-term solution would be for
[] to support vectorized indexing, outer indexing or neither.
- This will certainly cause some amount of churn, so let's save it for
later when vindex/oindex are widely used and libraries don't need to worry
about whether they're available or not they are available in all NumPy
versions they support.
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] Revised NEP-18, __array_function__ protocol

2018-06-26 Thread Stephan Hoyer
After much discussion (and the addition of three new co-authors!), I’m
pleased to present a significantly revision of NumPy Enhancement Proposal
18: A dispatch mechanism for NumPy's high level array functions:
http://www.numpy.org/neps/nep-0018-array-function-protocol.html

The full text is also included below.

Best,
Stephan

===
A dispatch mechanism for NumPy's high level array functions
===

:Author: Stephan Hoyer 
:Author: Matthew Rocklin 
:Author: Marten van Kerkwijk 
:Author: Hameer Abbasi 
:Author: Eric Wieser 
:Status: Draft
:Type: Standards Track
:Created: 2018-05-29

Abstact
---

We propose the ``__array_function__`` protocol, to allow arguments of NumPy
functions to define how that function operates on them. This will allow
using NumPy as a high level API for efficient multi-dimensional array
operations, even with array implementations that differ greatly from
``numpy.ndarray``.

Detailed description


NumPy's high level ndarray API has been implemented several times
outside of NumPy itself for different architectures, such as for GPU
arrays (CuPy), Sparse arrays (scipy.sparse, pydata/sparse) and parallel
arrays (Dask array) as well as various NumPy-like implementations in the
deep learning frameworks, like TensorFlow and PyTorch.

Similarly there are many projects that build on top of the NumPy API
for labeled and indexed arrays (XArray), automatic differentiation
(Autograd, Tangent), masked arrays (numpy.ma), physical units
(astropy.units,
pint, unyt), etc. that add additional functionality on top of the NumPy API.
Most of these project also implement a close variation of NumPy's level high
API.

We would like to be able to use these libraries together, for example we
would like to be able to place a CuPy array within XArray, or perform
automatic differentiation on Dask array code. This would be easier to
accomplish if code written for NumPy ndarrays could also be used by
other NumPy-like projects.

For example, we would like for the following code example to work
equally well with any NumPy-like array object:

.. code:: python

def f(x):
y = np.tensordot(x, x.T)
return np.mean(np.exp(y))

Some of this is possible today with various protocol mechanisms within
NumPy.

-  The ``np.exp`` function checks the ``__array_ufunc__`` protocol
-  The ``.T`` method works using Python's method dispatch
-  The ``np.mean`` function explicitly checks for a ``.mean`` method on
   the argument

However other functions, like ``np.tensordot`` do not dispatch, and
instead are likely to coerce to a NumPy array (using the ``__array__``)
protocol, or err outright. To achieve enough coverage of the NumPy API
to support downstream projects like XArray and autograd we want to
support *almost all* functions within NumPy, which calls for a more
reaching protocol than just ``__array_ufunc__``. We would like a
protocol that allows arguments of a NumPy function to take control and
divert execution to another function (for example a GPU or parallel
implementation) in a way that is safe and consistent across projects.

Implementation
--

We propose adding support for a new protocol in NumPy,
``__array_function__``.

This protocol is intended to be a catch-all for NumPy functionality that
is not covered by the ``__array_ufunc__`` protocol for universal functions
(like ``np.exp``). The semantics are very similar to ``__array_ufunc__``,
except
the operation is specified by an arbitrary callable object rather than a
ufunc
instance and method.

A prototype implementation can be found in
`this notebook <
https://nbviewer.jupyter.org/gist/shoyer/1f0a308a06cd96df20879a1ddb8f0006
>`_.

The interface
~

We propose the following signature for implementations of
``__array_function__``:

.. code-block:: python

def __array_function__(self, func, types, args, kwargs)

-  ``func`` is an arbitrary callable exposed by NumPy's public API,
   which was called in the form ``func(*args, **kwargs)``.
-  ``types`` is a ``frozenset`` of unique argument types from the original
NumPy
   function call that implement ``__array_function__``.
-  The tuple ``args`` and dict ``kwargs`` are directly passed on from the
   original call.

Unlike ``__array_ufunc__``, there are no high-level guarantees about the
type of ``func``, or about which of ``args`` and ``kwargs`` may contain
objects
implementing the array API.

As a convenience for ``__array_function__`` implementors, ``types``
provides all
argument types with an ``'__array_function__'`` attribute. This
allows downstream implementations to quickly determine if they are likely
able
to support the operation. A ``frozenset`` is used to ensure that
``__array_function__`` implementations cannot rely on the iteration order of
``types``, which would facilitate violating the well-defi

  1   2   3   >