[Numpy-discussion] Fwd: SciPy2017 Sprints FinAid for sprint leaders/core devs

2017-03-28 Thread Nathaniel Smith
In case anyone is interested in helping run a NumPy sprint at SciPy this year:

-- Forwarded message --
From: Jonathan Rocher 
Date: Mon, Mar 27, 2017 at 7:06 AM
Subject: SciPy2017 Sprints FinAid for sprint leaders/core devs
To: Nathaniel Smith , charlesr.har...@gmail.com,
jtaylor.deb...@googlemail.com, matthew.br...@gmail.com


Dear Nathaniel, Chuck, Julian and Matt,

cc: SciPy2017 FinAid and Sprints co-chairs

This year, SciPy2017 Sprint and Financial Aid comittees would like
like to encourage more Scientific Python leaders to lead sprints at
the conference and grow our foundations and the contributor base! As
such, we're launching, this year, a new "Sprint Leader FinAid", which
will cover two nights of lodging for sprint leaders who can offer
their knowledge of a package central to our community and pedagogical
skills to help make sprints more accessible and grow the contributor
pool.

Would you be interested in applying? If not, would you be kind enough
to help us encourage qualified individuals to lead a sprint for NumPy?
Please feel free to forward them our emails and this note!

Cheers,

Scott Collis, Eric Ma & Jonathan Rocher



--
Jonathan Rocher
Austin TX, USA
twitter:@jonrocher, linkedin:jonathanrocher
-


-- 
Nathaniel J. Smith -- https://vorpus.org
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] __array_ufunc__ counting down to launch, T-24 hrs.

2017-03-30 Thread Nathaniel Smith
On Thu, Mar 30, 2017 at 7:40 PM, Charles R Harris
 wrote:
> Hi All,
>
> Just a note that the __array_ufunc__ PR is ready to merge. If you are
> interested, you can review here.

I want to get this in too, but 24 hours seems like a very short
deadline for getting feedback on such a large and complex change? I'm
pretty sure the ndarray.__array_ufunc__ code that was just added a few
hours ago is wrong (see comments on the diff)...

My main comment, also relevant to the kind of high-level discussion we
tend to use the mailing list for:
  https://github.com/numpy/numpy/pull/8247#issuecomment-290616432

-n

-- 
Nathaniel J. Smith -- https://vorpus.org
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Fwd: [numfocus] Grants up to $3k available to NumFOCUS projects (sponsored & affiliated)

2017-03-31 Thread Nathaniel Smith
On Mar 31, 2017 1:15 AM, "Ralf Gommers"  wrote:



On Mon, Mar 27, 2017 at 11:42 PM, Ralf Gommers 
wrote:

>
>
> On Mon, Mar 27, 2017 at 11:33 PM, Julian Taylor <
> jtaylor.deb...@googlemail.com> wrote:
>
>> I have two ideas under one big important topic: make numpy python3
>> compatible.
>>
>> The first fits pretty well with the grant size and nobody wants to do it
>> for free:
>> - fix our text IO functions under python3 and support multiple
>> encodings, not only latin1.
>> Reasonably simple to do, slap encoding arguments on the functions,
>> generate test cases and somehow keep backward compatibility. Some
>> prelimary unfinished work is in https://github.com/numpy/numpy/pull/4208
>
>
> I like that idea, it's a recurring pain point. Are you interested to work
> on it, or are you thinking to advertise the idea here to see if anyone
> steps up?
>

More thoughts on this anyone? Or preferences for this idea or the numpy.org
one? Submission deadline is April 3rd and we can only put in one proposal
this time, so we need to (a) make a choice between these ideas, and (b)
write up a proposal.

If there's not enough replies to this so the choice is clear cut, I will
send out a poll to the core devs.


Do we have anyone interested in doing the work in either case? That seems
like the most important consideration to me...

-n
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] speed of random number generator compared to Julia

2017-04-03 Thread Nathaniel Smith
On Apr 3, 2017 8:59 AM, "Pierre Haessig"  wrote:

Le 03/04/2017 à 15:44, Jaime Fernández del Río a écrit :

This

says
that Julia uses this library
, which is
different from the home brewed version of the Mersenne twister in NumPy.
The second link I posted claims their speed comes from generating double
precision numbers directly, rather than generating random bytes that have
to be converted to doubles, as is the case of NumPy through this magical
incantation
.
They also throw the SIMD acronym around, which likely means their random
number generation is parallelized.

My guess is that most of the speed-up comes from the SIMD parallelization:
the Mersenne algorithm does a lot of work

to
produce 32 random bits, so that likely dominates over a couple of
arithmetic operations, even if divisions are involved.

Thanks for the feedback.

I'm not good in enough in reading Julia to be 100% sure, but I feel like
that the random.jl (https://github.com/JuliaLang/
julia/blob/master/base/random.jl) contains a Julia implementation of
Mersenne Twister... but I have no idea whether it is the "fancy" SIMD
version or the "old" 32bits version.


That code contains many references to "dSFMT", which is the name of the
"fancy" algorithm. IIUC dSFMT is related to the mersenne twister but is
actually a different generator altogether -- advertising that Julia uses
the mersenne twister is somewhat misleading IMHO. Of course this is really
the fault of the algorithm's designers for creating multiple algorithms
that have "mersenne twister" as part of their names...

-n
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Long term plans for dropping Python 2.7

2017-04-14 Thread Nathaniel Smith
On Fri, Apr 14, 2017 at 5:19 PM, Charles R Harris
 wrote:
> Hi All,
>
> It may be early to discuss dropping support for Python 2.7, but there is a
> disturbance in the force that suggests that it might be worth looking
> forward to the year 2020 when Python itself will drop support for 2.7. There
> is also a website, http://www.python3statement.org, where several projects
> in the scientific python stack have pledged to be Python 2.7 free by that
> date.  Given that, a preliminary discussion of the subject might be
> interesting, if only to gather information of where the community currently
> stands.

One reasonable position would that numpy releases that happen while
2.7 is supported upstream will also support 2.7, and releases after
that won't.

From numpy's perspective, I feel like the most important reason to
continue supporting 2.7 is our ability to convince people to keep
upgrading. (Not the only reason, but the most important.) What I mean
is: if we dropped 2.7 support tomorrow then it wouldn't actually make
numpy unavailable on python 2.7; it would just mean that lots of users
stayed at 1.12 indefinitely. Which is awkward, but it wouldn't be the
end of the world – numpy is mature software and 1.12 works pretty
well. The big problem IMO would be if this then meant that lots of
downstream projects felt that they had to continue supporting 1.12
going forward, which makes it very difficult for us to effectively
ship new features or even bug fixes – I mean, we can ship them, but
no-one will use them. And if a downstream project finds a bug in numpy
and can't upgrade numpy, then the tendency is to work around it
instead of reporting it upstream. I think this is the main thing we
want to avoid.

This kind of means that we're at the mercy of downstream projects,
though – if scipy/pandas/etc. decide they want to support 2.7 until
2022, it might be in our best interest to do the same. But there's a
collective action problem here: we want to keep supporting 2.7 so long
as they do, but at the same time they may feel they need to keep
supporting 2.7 as long as we do. And all of us would prefer to drop
2.7 support sooner rather than later, but we might all get stuck
because we're waiting for someone else to move first.

So my suggestion would be that numpy make some official announcement
that our plan is to drop support for python 2 immediately after
cpython upstream does. If worst comes to worst we can always decide to
extend it at the time... but if we make the announcement now, then
it's less likely that we'll need to :-).

Another interesting project to look at here is django, since they
occupy a similar place in the ecosystem (e.g. last I checked numpy and
django are the two most-imported python packages on github):
https://www.djangoproject.com/weblog/2015/jun/25/roadmap/
Their approach isn't directly applicable, because unlike us they have
a strict time-based release schedule, defined support period for each
release, and a distinction between regular and long-term support
releases, where regular releases act sort of like
pre-releases-on-steroids for the next LTS release. But basically what
they settled on is philosophically similar to what I'm suggesting:
they don't want an LTS to be supporting 2.7 beyond when cpython is
supporting it. Then on top of that they don't want to support 2.7 in
the regular releases leading up to that LTS either, so the net effect
is that their last release with 2.7 support came out last week, and it
will be supported until 2020 :-). And another useful precedent I think
is that they announced this two years ago, back in 2015; if we make an
announcement now, we'll be be giving a similar amount of warning.

-n

-- 
Nathaniel J. Smith -- https://vorpus.org
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Long term plans for dropping Python 2.7

2017-04-15 Thread Nathaniel Smith
On Fri, Apr 14, 2017 at 10:47 PM, Ralf Gommers  wrote:
>
>
> On Sat, Apr 15, 2017 at 5:19 PM, Nathaniel Smith  wrote:
[...]
>> From numpy's perspective, I feel like the most important reason to
>> continue supporting 2.7 is our ability to convince people to keep
>> upgrading. (Not the only reason, but the most important.) What I mean
>> is: if we dropped 2.7 support tomorrow then it wouldn't actually make
>> numpy unavailable on python 2.7; it would just mean that lots of users
>> stayed at 1.12 indefinitely. Which is awkward, but it wouldn't be the
>> end of the world – numpy is mature software and 1.12 works pretty
>> well. The big problem IMO would be if this then meant that lots of
>> downstream projects felt that they had to continue supporting 1.12
>> going forward, which makes it very difficult for us to effectively
>> ship new features or even bug fixes – I mean, we can ship them, but
>> no-one will use them. And if a downstream project finds a bug in numpy
>> and can't upgrade numpy, then the tendency is to work around it
>> instead of reporting it upstream. I think this is the main thing we
>> want to avoid.
>
>
> +1
>
>>
>>
>> This kind of means that we're at the mercy of downstream projects,
>> though – if scipy/pandas/etc. decide they want to support 2.7 until
>> 2022, it might be in our best interest to do the same. But there's a
>> collective action problem here: we want to keep supporting 2.7 so long
>> as they do, but at the same time they may feel they need to keep
>> supporting 2.7 as long as we do. And all of us would prefer to drop
>> 2.7 support sooner rather than later, but we might all get stuck
>>
>> because we're waiting for someone else to move first.
>
>
> I don't quite agree about being stuck. These kind of upgrades should and
> usually do go top of stack to bottom. Something like Jupyter which is mostly
> an end user tool goes first (they announced 2020 quite a while ago), domain
> specific packages go at a similar time, then scipy & co, and only after that
> numpy. Cython will be even later I'm sure - it still supports Python 2.6.

To make sure we're on the same page about what "2020" means here: the
latest release of IPython is 5.0, which came out in July last year.
This is the last release that supports py2; they dropped support for
py2 in master months ago, and 6.0 (whose schedule has been slipping,
but I think should be out Any Time Now?) won't support py2. Their plan
is to keep backporting bug fixes to 5.x until the end of 2017; after
that the core team won't support py2 at all. And they've also
announced that if volunteers want to step up to maintain 5.x after
that, then they're willing to keep accepting pull requests until July
2019.

Refs:
  https://blog.jupyter.org/2016/07/08/ipython-5-0-released/
  
https://github.com/jupyter/roadmap/blob/master/accepted/migration-to-python-3-only.md

I suspect that in practice that "end of 2017" date will the
end-of-support date for most intents and purposes. And for numpy with
its vaguely defined support periods, I think it makes most sense to
talk in terms of release dates; so if we want to compare
apples-to-apples, my suggestion is that numpy drops py2 support in
2020 and in that sense ipython dropped py2 support in july last year.

>>
>> So my suggestion would be that numpy make some official announcement
>> that our plan is to drop support for python 2 immediately after
>> cpython upstream does.
>
>
> Not quite sure CPython schedule is relevant - important bug fixes haven't
> been making it into 2.7 for a very long time now, so the only change is the
> rare security patch.

Huh? 2.7 gets tons of changes: https://github.com/python/cpython/commits/2.7
Officially CPython has 2 modes for releases: "regular support" and
"security fixes only". 2.7 is special – it get regular support, and
then on top of that it also has a special exception to allow certain
kinds of major changes, like the ssl module backports.

If you know of important bug fixes that they're missing then I think
they'd like to know :-).

Anyway, the reason the CPython schedule is relevant is that once they
drop support, it *will* stop getting security patches, so it will
become increasingly impossible to use safely.

>>
>> If worst comes to worst we can always decide to
>> extend it at the time... but if we make the announcement now, then
>> it's less likely that we'll need to :-).
>
>
> I'd be in favor of putting out a schedule in coordination with
> scipy/pandas/etc, but it probably should look more like
> - 2020: what's on http://www.python3sta

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-24 Thread Nathaniel Smith
On Apr 21, 2017 2:34 PM, "Stephan Hoyer"  wrote:

I still don't understand why a latin encoding makes sense as a preferred
one-byte-per-char dtype. The world, including Python 3, has standardized on
UTF-8, which is also one-byte-per-char for (ASCII) scientific data.


You may already know this, but probably not everyone reading does: the
reason why latin1 often gets special attention in discussions of Unicode
encoding is that latin1 is effectively "ucs1". It's the unique one byte
text encoding where byte N represents codepoint U+N.

I can't think of any reason why this property is particularly important for
numpy's usage, because we always have a conversion step anyway to get data
in and out of an array. The potential arguments for latin1 that I can think
of are:
- if we have to implement our own en/decoding code for some reason then
it's the most trivial encoding
- if other formats standardize on latin1-with-nul-padding and we want
in-memory/mmap compatibility
- if we really want a fixed width encoding for some reason but don't care
which one, then it's in some sense the most obvious choice

I can't think of many reasons why having a fixed width encoding is
particularly important though... For our current style of string storage,
even calculating the length of a string is O(n), and AFAICT the only way to
actually take advantage of the theoretical O(1) character indexing is to
make a uint8 view. I guess it would be useful if we had a string slicing
ufunc... But why would we?

That said, AFAICT what people actually want in most use cases is support
for arrays that can hold variable-length strings, and the only place where
the current approach is *optimal* is when we need mmap compatibility with
legacy formats that use fixed-width-nul-padded fields (at which point it's
super convenient). It's not even possible to *represent* all Python strings
or bytestrings in current numpy unicode or string arrays (Python
strings/bytestrings can have trailing nuls). So if we're talking about
tweaks to the current system it probably makes sense to focus on this use
case specifically.

>From context I'm assuming FITS files use fixed-width-nul-padding for
strings? Is that right? I know HDF5 doesn't.

-n
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-24 Thread Nathaniel Smith
On Mon, Apr 24, 2017 at 7:23 PM, Robert Kern  wrote:
> On Mon, Apr 24, 2017 at 7:07 PM, Nathaniel Smith  wrote:
>
>> That said, AFAICT what people actually want in most use cases is support
>> for arrays that can hold variable-length strings, and the only place where
>> the current approach is *optimal* is when we need mmap compatibility with
>> legacy formats that use fixed-width-nul-padded fields (at which point it's
>> super convenient). It's not even possible to *represent* all Python strings
>> or bytestrings in current numpy unicode or string arrays (Python
>> strings/bytestrings can have trailing nuls). So if we're talking about
>> tweaks to the current system it probably makes sense to focus on this use
>> case specifically.
>>
>> From context I'm assuming FITS files use fixed-width-nul-padding for
>> strings? Is that right? I know HDF5 doesn't.
>
> Yes, HDF5 does. Or at least, it is supported in addition to the
> variable-length ones.
>
> https://support.hdfgroup.org/HDF5/doc/Advanced/UsingUnicode/index.html

Doh, I found that page but it was (and is) meaningless to me, so I
went by http://docs.h5py.org/en/latest/strings.html, which says the
options are fixed-width ascii, variable-length ascii, or
variable-length utf-8 ... I guess it's just talking about what h5py
currently supports.

But also, is it important whether strings we're loading/saving to an
HDF5 file have the same in-memory representation in numpy as they
would in the file? I *know* [1] no-one is reading HDF5 files using
np.memmap :-). Is it important for some other reason?

Also, further searching suggests that HDF5 actually supports all of
nul termination, nul padding, and space padding, and that nul
termination is the default? How much does it help to have in-memory
compatibility with just one of these options (and not even the default
one)? Would we need to add the other options to be really useful for
HDF5? (Unlikely to happen within numpy itself, but potentially
something that could be done inside h5py or whatever if numpy's
user-defined dtype system were a little more useful.)

-n

[1] hope

-- 
Nathaniel J. Smith -- https://vorpus.org
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-25 Thread Nathaniel Smith
On Apr 25, 2017 11:53 AM, "Robert Kern"  wrote:

On Tue, Apr 25, 2017 at 11:18 AM, Charles R Harris <
charlesr.har...@gmail.com> wrote:
>
> On Tue, Apr 25, 2017 at 11:34 AM, Anne Archibald <
peridot.face...@gmail.com> wrote:

>> Clearly there is a need for fixed-storage-size zero-padded UTF-8; two
other packages are waiting specifically for it. But specifying this
requires two pieces of information: What is the encoding? and How is the
length specified? I know they're not numpy-compatible, but FITS header
values are space-padded; does that occur elsewhere? Are there other ways
existing data specifies string length within a fixed-size field? There are
some cryptographic length-specification tricks - ANSI X.293, ISO 10126,
PKCS7, etc. - but they are probably too specialized to need? We should make
sure we can support all the ways that actually occur.
>
>
> Agree with the UTF-8 fixed byte length strings, although I would tend
towards null terminated.

Just to clarify some terminology (because it wasn't originally clear to me
until I looked it up in reference to HDF5):

* "NULL-padded" implies that, for a fixed width of N, there can be up to N
non-NULL bytes. Any extra space left over is padded with NULLs, but no
space needs to be reserved for NULLs.

* "NULL-terminated" implies that, for a fixed width of N, there can be up
to N-1 non-NULL bytes. There must always be space reserved for the
terminating NULL.

I'm not really sure if "NULL-padded" also specifies the behavior for
embedded NULLs. It's certainly possible to deal with them: just strip
trailing NULLs and leave any embedded ones alone. But I'm also sure that
there are some implementations somewhere that interpret the requirement as
"stop at the first NULL or the end of the fixed width, whichever comes
first", effectively being NULL-terminated just not requiring the reserved
space.


And to save anyone else having to check, numpy's current NUL-padded dtypes
only strip trailing NULs, so they can round-trip strings that contain NULs,
just not strings where NUL is the last character.

So the set of strings representable by str/bytes is a strict superset of
the set of strings representable by numpy U/S dtypes, which in turn is a
strict superset of the set of strings representable by a hypothetical
NUL-terminated dtype.

(Of course this doesn't matter for most practical purposes, because people
rarely make strings with embedded NULs.)

-n
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-25 Thread Nathaniel Smith
On Apr 25, 2017 9:35 AM, "Chris Barker"  wrote:


 - filenames

File names are one of the key reasons folks struggled with the python3 data
model (particularly on *nix) and why 'surrogateescape' was added. It's
pretty common to store filenames in with our data, and thus in numpy arrays
-- we need to preserve them exactly and display them mostly right. Again,
euro-centric, but if you are euro-centric, then latin-1 is a good choice
for this.


Eh... First, on Windows and MacOS, filenames are natively Unicode. So you
don't care about preserving the bytes, only the characters. It's only Linux
and the other traditional unixes where filenames are natively bytestrings.
And then from in Python, if you want to actually work with those filenames
you need to either have a bytestring type or else a Unicode type that uses
surrogateescape to represent the non-ascii characters. I'm not seeing how
latin1 really helps anything here -- best case you still have to do
something like the wsgi "encoding dance" before you could use the
filenames. IMO if you have filenames that are arbitrary bytestrings and you
need to represent this properly, you should just use bytestrings -- really,
they're perfectly friendly :-).

-n
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-25 Thread Nathaniel Smith
On Apr 25, 2017 10:13 AM, "Anne Archibald" 
wrote:


On Tue, Apr 25, 2017 at 6:05 PM Chris Barker  wrote:

> Anyway, I think I made the mistake of mingling possible solutions in with
> the use-cases, so I'm not sure if there is any consensus on the use cases
> -- which I think we really do need to nail down first -- as Robert has made
> clear.
>

I would make my use-cases more user-specific:

1) User wants an array with numpy indexing tricks that can hold python
strings but doesn't care about the underlying representation.
-> Solvable with object arrays, or Robert's string-specific object arrays;
underlying representation is python objects on the heap. Sadly UCS-4, so
zillions are going to be a memory problem.


It's possible to do much better than this when defining a specialized
variable-width string dtype. E.g. make the itemsize 8 bytes (like an object
array, assuming a 64 bit system), but then for strings that can be encoded
in 7 bytes or less of utf8 store them directly in the array; else store a
pointer to a raw utf8 string on the heap. (Possibly with a reference count
- there are some interesting tradeoffs there. I suspect 1-byte reference
counts might be the way to go; if a logical copy would make it overflow
then make an actual copy instead.) Anything involving the heap is going to
have some overhead, but we don't need full fledged Python objects and once
we give up mmap compatibility then there's a lot of room to tune.

-n
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-25 Thread Nathaniel Smith
On Tue, Apr 25, 2017 at 4:11 PM, Chris Barker - NOAA Federal
 wrote:
>> On Apr 25, 2017, at 12:38 PM, Nathaniel Smith  wrote:
>
>> Eh... First, on Windows and MacOS, filenames are natively Unicode.
>
> Yeah, though once they are stored I. A text file -- who the heck
> knows? That may be simply unsolvable.
>> s. And then from in Python, if you want to actually work with those 
>> filenames you need to either have a bytestring type or else a Unicode type 
>> that uses surrogateescape to represent the non-ascii characters.
>
>
>> IMO if you have filenames that are arbitrary bytestrings and you need to 
>> represent this properly, you should just use bytestrings -- really, they're 
>> perfectly friendly :-).
>
> I thought the Python file (and Path) APIs all required (Unicode)
> strings? That was the whole complaint!

No, the path APIs all accept bytestrings (and ones that return
pathnames like listdir return bytestrings if given bytestrings). Or at
least they're supposed to.

The really urgent need for surrogateescape was things like sys.argv
and os.environ where arbitrary bytes might come in (on some systems)
but the API is restricted to strs.

> And no, bytestrings are not perfectly friendly in py3.

I'm not saying you should use them everywhere or that they remove the
need for an ergonomic text dtype, but when you actually want to work
with bytes they're pretty good (esp. in modern py3).

-n

-- 
Nathaniel J. Smith -- https://vorpus.org
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-26 Thread Nathaniel Smith
On Apr 26, 2017 9:30 AM, "Chris Barker - NOAA Federal" <
chris.bar...@noaa.gov> wrote:


UTF-8 does not match the character-oriented Python text model. Plenty
of people argue that that isn't the "correct" model for Unicode text
-- maybe so, but it is the model python 3 has chosen. I wrote a much
longer rant about that earlier.

So I think the easy to access, and particularly defaults, numpy string
dtypes should match it.


This seems a little vague? The "character-oriented Python text model" is
just that str supports O(1) indexing of characters. But... Numpy doesn't.
If you want to access individual characters inside a string inside an
array, you have to pull out the scalar first, at which point the data is
copied and boxed into a Python object anyway, using whatever representation
the interpreter prefers. So AFAICT​ it makes literally no difference to the
user whether numpy's internal representation allows for fast character
access.

-n
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-26 Thread Nathaniel Smith
On Apr 26, 2017 12:09 PM, "Robert Kern"  wrote:

On Wed, Apr 26, 2017 at 10:43 AM, Julian Taylor <
jtaylor.deb...@googlemail.com> wrote:
[...]
> I have read every mail and it has been a large waste of time, Everything
> has been said already many times in the last few years.
> Even if you memory map string arrays, of which I have not seen a
> concrete use case in the mails beyond "would be nice to have" without
> any backing in actual code, but I may have missed it.

Yes, we have stated that FITS files with string arrays are currently being
read via memory mapping.

  http://docs.astropy.org/en/stable/io/fits/index.html

You were even pointed to a minor HDF5 implementation that memory maps:

  https://github.com/jjhelmus/pyfive/blob/master/pyfive/low_
level.py#L682-L683

I'm afraid that I can't share the actual code of the full variety of
proprietary file formats that I've written code for, I can assure you that
I have memory mapped many string arrays in my time, usually embedded as
columns in structured arrays. It is not "nice to have"; it is "have done
many times and needs better support".


Since concrete examples are often helpful in focusing discussions, here's
some code for reading a lab-internal EEG file format:

https://github.com/rerpy/rerpy/blob/master/rerpy/io/erpss.py

See in particular _header_dtype with its embedded string fields, and the
code in _channel_names_from_header -- both of these really benefit from
having a quick and easy way to talk about fixed width strings of single
byte characters. (The history here of course is that the original tools for
reading/writing this format are written in C, and they just read in
sizeof(struct header) and cast to the header.)

_get_full_string in that file is also interesting: it's a nasty hack I
implemented because in some cases I actually needed *fixed width* strings,
not NUL padded ones, and didn't know a better way to do it. (Yes, there's
void, but I have no idea how those work. They're somehow related to buffer
objects, whatever those are?) In other cases though that file really does
want NUL padding.

Of course that file is python 2 and blissfully ignorant of unicode.
Thinking about what we'd want if porting to py3:

For the "pull out this fixed width chunk of the file" problem (what
_get_full_string does) then I definitely don't care about unicode; this
isn't text. np.void or an array of np.uint8 aren't actually too terrible I
suspect, but it'd be nice if there were a fixed-width dtype where indexing
gave back a native bytes or bytearray object, or something similar like
np.bytes_.

For the arrays of single-byte-encoded-NUL-padded text, then the fundamental
problem is just to convert between a chunk of bytes in that format and
something that numpy can handle. One way to do that would be with an dtype
that represented ascii-encoded-fixed-width-NUL-padded text, or any
ascii-compatible encoding. But honestly I'd be just as happy with
np.encode/np.decode ufuncs that converted between the existing S dtype and
any kind of text array; the existing U dtype would be fine given that.

The other thing that might be annoying in practice is that when writing
py2/py3 polyglot code, I can say "str" to mean "bytes on py2 and unicode on
py3", but there's no dtype with similar behavior. Maybe there's no good
solution and this just needs a few version-dependent convenience functions
stuck in a private utility library, dunno.


> What you save by having utf8 in the numpy array is replacing a decoding
> ane encoding step with a stripping null padding step.
> That doesn't seem very worthwhile compared to all their other overheads
> involved.

It's worthwhile enough that both major HDF5 bindings don't support Unicode
arrays, despite user requests for years. The sticking point seems to be the
difference between HDF5's view of a Unicode string array (defined in size
by the bytes of UTF-8 data) and numpy's current view of a Unicode string
array (because of UCS-4, defined by the number of
characters/codepoints/whatever). So there are HDF5 files out there that
none of our HDF5 bindings can read, and it is impossible to write certain
data efficiently.


I would really like to hear more from the authors of these libraries about
what exactly it is they feel they're missing. Is it that they want numpy to
enforce the length limit early, to catch errors when the array is modified
instead of when they go to write it to the file? Is it that they really
want an O(1) way to look at a array and know the maximum number of bytes
needed to represent it in utf-8? Is it that utf8<->utf-32 conversion is
really annoying and files that need it are rare so they haven't had the
motivation to implement it? My impression is similar to Julian's: you
*could* implement HDF5 fixed-length utf-8 <-> numpy U arrays with a few
dozen lines of code, which is nothing compared to all the other hoops these
libraries are already jumping through, so if this is really the roadblock
then I must be missing somet

Re: [Numpy-discussion] [NumPy-discussion] Wish List of Possible ufunc Enhancements

2017-04-28 Thread Nathaniel Smith
On Fri, Apr 28, 2017 at 9:53 AM, Matthew Harrigan
 wrote:
> Here is a link to a wish list of possible ufunc enhancements.  I would like
> to know what the community thinks.

It looks like a pretty good list of ideas worth thinking about as and
when someone has time :-). I'm not sure what feedback you're looking
for beyond that? Do you have a purpose in mind for this list?

The main thing I'd add is: making it possible for ufunc core loops to
access the dtype object. This is the main blocker on a *lot* of
things, probably more so than anything else on that list, because it
would allow ufunc operations to be defined for parametrized dtypes
like the S and U dtypes, categorical data, etc.

-n

-- 
Nathaniel J. Smith -- https://vorpus.org
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] NumPy v1.13.0rc1 released.

2017-05-10 Thread Nathaniel Smith
On Wed, May 10, 2017 at 7:06 PM, Nathan Goldbaum  wrote:
> Hi Chuck,
>
> Is there a docs build for this release somewhere? I'd like to find an
> authoritative reference about __array_ufunc__, which I'd hesistated on
> looking into until now for fear about the API changing.

A sort-of-rendered version of the end-user docs can be seen here:
  
https://github.com/numpy/numpy/blob/master/doc/source/reference/arrays.classes.rst

And the NEP has been updated to hopefully provide a more spec-like
description of the final version:
  https://github.com/numpy/numpy/blob/master/doc/neps/ufunc-overrides.rst

Note that the API is "provisional" in 1.13, i.e. it *might* change in
backwards-incompatible ways:
  https://docs.python.org/3/glossary.html#term-provisional-api

-n

-- 
Nathaniel J. Smith -- https://vorpus.org
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] UC Berkeley hiring developers to work on NumPy

2017-05-13 Thread Nathaniel Smith
Hi all,

As some of you know, I've been working for... quite some time now to
try to secure funding for NumPy. So I'm excited that I can now
officially announce that BIDS [1] is planning to hire several folks
specifically to work on NumPy. These will full time positions at UC
Berkeley, postdoc or staff, with probably 2 year (initial) contracts,
and the general goal will be to work on some of the major priorities
we identified at the last dev meeting: more flexible dtypes, better
interoperation with other array libraries, paying down technical debt,
and so forth. Though I'm sure the details will change as we start to
dig into things and engage with the community.

More details soon; universities move slowly, so nothing's going to
happen immediately. But this is definitely happening and I wanted to
get something out publicly before the conference season starts – so if
you're someone who might be interested in coming to work with me and
the other awesome folks at BIDS, then this is a heads-up: drop me a
line and we can chat! I'll be at PyCon next week if anyone happens to
be there. And feel free to spread the word.

-n

[1] http://bids.berkeley.edu/

-- 
Nathaniel J. Smith -- https://vorpus.org
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] UC Berkeley hiring developers to work on NumPy

2017-05-14 Thread Nathaniel Smith
On Sun, May 14, 2017 at 2:56 PM, Charles R Harris
 wrote:
>
>
> On Sat, May 13, 2017 at 11:45 PM, Nathaniel Smith  wrote:
>>
>> Hi all,
>>
>> As some of you know, I've been working for... quite some time now to
>> try to secure funding for NumPy. So I'm excited that I can now
>> officially announce that BIDS [1] is planning to hire several folks
>> specifically to work on NumPy. These will full time positions at UC
>> Berkeley, postdoc or staff, with probably 2 year (initial) contracts,
>> and the general goal will be to work on some of the major priorities
>> we identified at the last dev meeting: more flexible dtypes, better
>> interoperation with other array libraries, paying down technical debt,
>> and so forth. Though I'm sure the details will change as we start to
>> dig into things and engage with the community.
>>
>> More details soon; universities move slowly, so nothing's going to
>> happen immediately. But this is definitely happening and I wanted to
>> get something out publicly before the conference season starts – so if
>> you're someone who might be interested in coming to work with me and
>> the other awesome folks at BIDS, then this is a heads-up: drop me a
>> line and we can chat! I'll be at PyCon next week if anyone happens to
>> be there. And feel free to spread the word.
>
>
> Excellent news. Do you have any sort of timeline in mind?

The exact timeline's going to be determined in large part by
university+funder logistics. I thought it was going to happen last
year, so at this point I'm just going with the flow :-). The process
for hiring staff definitely takes a few months at a minimum; with
postdocs there's a little more flexibility.

-n

-- 
Nathaniel J. Smith -- https://vorpus.org
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] UC Berkeley hiring developers to work on NumPy

2017-05-14 Thread Nathaniel Smith
On Sun, May 14, 2017 at 2:11 PM, Chris Barker - NOAA Federal
 wrote:
> Awesome! This is really great news.
>
> Does this mean is several person-years of funding secured?

Yes – hoping to give more details there soon. (There's nothing dire
and secretive, it's just the logistics of getting an announcement
approved by funder communication people didn't work with getting
something out by PyCon, so this is the slightly confusing compromise.)

-n

-- 
Nathaniel J. Smith -- https://vorpus.org
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Proposal: np.search() to complement np.searchsorted()

2017-05-15 Thread Nathaniel Smith
On May 9, 2017 9:47 AM, "Martin Spacek"  wrote:

Hello,

I've opened up a pull request to add a function called np.search(), or
something like it, to complement np.searchsorted():

https://github.com/numpy/numpy/pull/9055

There's also this issue I opened before starting the PR:

https://github.com/numpy/numpy/issues/9052

Proposed API changes require discussion on the list, so here I am!

This proposed function (and perhaps array method?) does the same as
np.searchsorted(a, v), but doesn't require `a` to be sorted, and explicitly
checks if all the values in `v` are a subset of those in `a`. If not, it
currently raises an error, but that could be controlled via a kwarg.

As I mentioned in the PR, I often find myself abusing np.searchsorted() by
not explicitly checking these assumptions. The temptation to use it is
great, because it's such a fast and convenient function, and most of the
time that I use it, the assumptions are indeed valid. Explicitly checking
those assumptions each and every time before I use np.searchsorted() is
tedious, and easy to forget to do. I wouldn't be surprised if many others
abuse np.searchsorted() in the same way.


It's worth noting though that the "sorted" part is a critical part of what
makes it fast. If we're looking for k needles in an n-item haystack, then:

If the haystack is already sorted and we know it, using searchsorted does
it in k*log2(n) comparisons. (Could be reduced to average case O(k log log
n) for simple scalars by using interpolation search, but I don't think
searchsorted is that clever atm.)

If the haystack is not sorted, then sorting it and then using searchsorted
requires a total of O(n log n) + k*log2(n) comparisons.

And if the haystack is not sorted, then doing linear search to find the
first item like list.index does requires on average 0.5*k*n comparisons.

This analysis ignores memory effects, which are important -- linear memory
access is faster than random access, and searchsorted is all about making
memory access maximally unpredictable. But even so, I think
sorting-then-searching will be reasonably competitive pretty much from the
start, and for moderately large k and n values the difference between (n +
k)*log(n) and n*k is huge.

Another issue is that sorting requires an O(n)-sized temporary buffer
(assuming you can't mutate the haystack in place). But if your haystack is
a large enough fraction of memory that you can't afford is buffer, then
it's likely large enough that you can't afford linear searching either...


Looking at my own habits and uses, it seems to me that finding the indices
of matching values of one array in another is a more common use case than
finding insertion indices of one array into another sorted array. So, I
propose that np.search(), or something like it, could be even more useful
than np.searchsorted().


My main concern here would be creating a trap for the unwary, where people
use search() naively because it's so nice and convenient, and then
eventually get surprised by a nasty quadratic slowdown. There's a whole
blog about these traps :-) https://accidentallyquadratic.tumblr.com/

Otoh there are also huge number of numpy use cases where it doesn't matter
if some calculation is 1000x slower than it should be, as long as it works
and is discoverable...

So it sounds like one obvious thing would be to have a version of
searchsorted that checks for matches (maybe side="exact"? Though that's not
easy to find...). That's clearly useful, and orthogonal to the
linear/binary search issue, so we shouldn't make it a reason people are
tempted to choose the inferior algorithm.

...ok, how's this for a suggestion. Give np.search a strategy= kwarg, with
options "linear", "searchsorted", and "auto". Linear does the obvious
thing, searchsorted generates a sorter array using argsort (unless the user
provided one) and then calls searchsorted, and auto picks one of them
depending on whether a sorter array was provided and how large the arrays
are. The default is auto. In all cases it looks for exact matches.

I guess by default "not found" should be signaled with an exception, and
then there should be some option to have it return a sentinel value
instead? The problem is that since we're returning integers then there's no
sentinel value that's necessarily an invalid index (e.g. indexing will
happily accept -1).

-n
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] UC Berkeley hiring developers to work on NumPy

2017-05-19 Thread Nathaniel Smith
Okay, a few more details :-)

The initial funding here is a grant from the Gordon and Betty Moore
Foundation to UCB with me as PI, in the amount of $645,020. There's
also another thing in the pipeline that might supplement that, but
it'll be
~6 months yet before we know for sure. So keep your fingers crossed I guess.

Here's some text from the proposal (the references to "this year" may
give some sense of how long this has taken...):
  
https://docs.google.com/document/d/1xHjQqc8V8zJk7WSCyw9NPCpMYZ2Urh0cmFm2vDd14ZE/edit

-n

On Sun, May 14, 2017 at 3:40 PM, Nathaniel Smith  wrote:
> On Sun, May 14, 2017 at 2:11 PM, Chris Barker - NOAA Federal
>  wrote:
>> Awesome! This is really great news.
>>
>> Does this mean is several person-years of funding secured?
>
> Yes – hoping to give more details there soon. (There's nothing dire
> and secretive, it's just the logistics of getting an announcement
> approved by funder communication people didn't work with getting
> something out by PyCon, so this is the slightly confusing compromise.)
>
> -n
>
> --
> Nathaniel J. Smith -- https://vorpus.org



-- 
Nathaniel J. Smith -- https://vorpus.org
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] UC Berkeley hiring developers to work on NumPy

2017-05-19 Thread Nathaniel Smith
On Mon, May 15, 2017 at 1:43 AM, Matthew Brett  wrote:
> Hi,
>
> On Sun, May 14, 2017 at 10:56 PM, Charles R Harris
>  wrote:
>>
>>
>> On Sat, May 13, 2017 at 11:45 PM, Nathaniel Smith  wrote:
>>>
>>> Hi all,
>>>
>>> As some of you know, I've been working for... quite some time now to
>>> try to secure funding for NumPy. So I'm excited that I can now
>>> officially announce that BIDS [1] is planning to hire several folks
>>> specifically to work on NumPy. These will full time positions at UC
>>> Berkeley, postdoc or staff, with probably 2 year (initial) contracts,
>>> and the general goal will be to work on some of the major priorities
>>> we identified at the last dev meeting: more flexible dtypes, better
>>> interoperation with other array libraries, paying down technical debt,
>>> and so forth. Though I'm sure the details will change as we start to
>>> dig into things and engage with the community.
>>>
>>> More details soon; universities move slowly, so nothing's going to
>>> happen immediately. But this is definitely happening and I wanted to
>>> get something out publicly before the conference season starts – so if
>>> you're someone who might be interested in coming to work with me and
>>> the other awesome folks at BIDS, then this is a heads-up: drop me a
>>> line and we can chat! I'll be at PyCon next week if anyone happens to
>>> be there. And feel free to spread the word.
>>
>>
>> Excellent news. Do you have any sort of timeline in mind?
>>
>> It will be interesting to see what changes this leads to, both in the code
>> and in the project sociology.
>
> I was thinking the same thing - if this does come about, it would
> likely have a big impact on practical governance.  It could also mean
> that more important development conversations happen off-list.   It
> seems to me it would be good to plan for this consciously.

Yeah, definitely. Being able to handle changes like this was one of
the major motivations for all the governance discussions we started a
few years ago, and it's something we'll need to keep an eye on going
forward. To state it explicitly though: the idea is to fund folks so
that they can contribute to numpy within our existing process of open
community review, and preserving and growing that community is very
much one of the grant's goals; no-one should get special privileges
because of where their paycheck is coming from. If at some point you
(or anyone) feel like we're deviating from that please speak up.

-n

-- 
Nathaniel J. Smith -- https://vorpus.org
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] UC Berkeley hiring developers to work on NumPy

2017-05-24 Thread Nathaniel Smith
On Mon, May 22, 2017 at 10:04 AM, Sebastian Berg
 wrote:
> On Mon, 2017-05-22 at 17:35 +0100, Matthew Brett wrote:
>> Hi,
>>
>> On Mon, May 22, 2017 at 4:52 PM, Marten van Kerkwijk
>>  wrote:
>> > Hi Matthew,
>> >
>> > > it seems to me that we could get 80% of the way to a reassuring
>> > > blueprint with a relatively small amount of effort.
>> >
>> > My sentence "adapt the typical academic rule for conflicts of
>> > interests to PRs, that non-trivial ones cannot be merged by someone
>> > who has a conflict of interest with the author, i.e., it cannot be
>> > a
>> > superviser, someone from the same institute, etc." was meant as a
>> > suggestion for part of this blueprint!
>> >
>> > I'll readily admit, though, that since I'm not overly worried, I
>> > haven't even looked at the policies that are in place, nor do I
>> > intend
>> > to contribute much beyond this e-mail. Indeed, it may be that the
>> > old
>> > adage "every initiative is punishable" holds here...
>>
>> I understand what you're saying, but I think a more helpful way of
>> thinking of it, is putting the groundwork in place for the most
>> fruitful possible collaboration.
>>
>> > would you, or one
>> > of the others who feels it is important to have a blueprint, be
>> > willing to provide a concrete text for discussion?
>>
>> It doesn't make sense for me to do that, I'm #13 for commits in the
>> last year.  I'm just one of the many people who completely depend on
>> numpy.  Also, taking a little time to think these things through
>> seems
>> like a small investment with the potential for significant gain, in
>> terms of improving communication and mitigating risk.
>>
>> So, I think my suggestion is that it would be a good idea for
>> Nathaniel and the current steering committee to talk through how this
>> is going to play out, how the work will be selected and directed, and
>> so on.
>>
>
> Frankly, I would suggest to wait for now and ask whoever is going to
> get the job to work out how they think it should be handled. And then
> we complain if we expect more/better ;).

This is roughly where I am as well. Certainly this is an important
issue, but we've already done a lot of groundwork in the abstract –
the dev meeting, formalizing the governance document, and so forth
(and recall that "let's get to a point where we can apply for grants"
was explicitly one of the goals in those discussions). I think at this
point the most productive thing to do is wait until we have a more
concrete picture of who/what/when will be happening, so we can make a
concrete plan.

> For now I only would say that I will expect more community type of work
> then we now often manage to do. And things such as meticulously
> sticking to writing NEPs.
> So the only thing I can see that might be good is putting "community
> work" or something like it specifically as part of the job description,

Definitely.

> and thats up to Nathaniel probably.
>
> Some things like not merging large changes by two people sittings in
> the same office should be obvious (and even if it happens, we can
> revert). But its nothing much new there I think.

-n

-- 
Nathaniel J. Smith -- https://vorpus.org
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Future of ufuncs

2017-05-29 Thread Nathaniel Smith
On Mon, May 29, 2017 at 1:51 PM, Charles R Harris
 wrote:
>
>
> On Mon, May 29, 2017 at 12:32 PM, Marten van Kerkwijk
>  wrote:
>>
>> Hi Chuck,
>>
>> Like Sebastian, I wonder a little about what level you are talking
>> about. Presumably, it is the actual implementation of the ufunc? I.e.,
>> this is not about the upper logic that decides which `__array_ufunc__`
>> to call, etc.
>>
>> If so, I agree with you that it would seem to make most sense to move
>> the implementation to `multiarray`; the current structure certainly is
>> a major hurdle to understanding how things work!
>>
>> Indeed, I guess in terms of my earlier suggestion to make much of a
>> ufunc happen in `ndarray.__array_ufunc__`, one could seem the type
>> resolution and iteration happening there. If one were to expose the
>> inner loops, anyone working with buffers could then use the ufuncs by
>> defining their own __array_ufunc__.
>
>
> The idea of separating ufuncs from ndarray was put forward many years ago,
> maybe five or six. What I seek here is a record that we have given up on
> that ambition, so do not need to take it into consideration in the future.
> In particular, we can feel free to couple ufuncs even more tightly with
> ndarray.

I think we do want to separate ufuncs from ndarray semantically: it
should be possible to use ufuncs on sparse arrays, dask arrays, etc.
etc.

But I don't think that altering ufuncs to work directly on
buffer/memoryview objects, or shipping them as a separate package from
the rest of numpy, is a useful step towards this goal.

Right now, handling buffers/memoryviews is easy: one can trivially
convert between them and ndarray without making any copies. I don't
know of any interesting problems that are blocked because ufuncs work
on ndarrays instead of buffer/memoryview objects. The interesting
problems are where there's a fundamentally different storage strategy
involved, like sparse/dask/... arrays.

And similarly, I don't see what problems are solved by splitting them
out for building or distribution.

OTOH, trying to accomplish either of these things definitely has a
cost in terms of churn, complexity, double the workload for
release-management, etc. Even the current split between the multiarray
and umath modules causes problems all the time. It's mostly boring
problems like having little utility functions that are needed in both
places but awkward to share, or problems caused by the complicated
machinery needed to let them interact properly (set_numeric_ops and
all that) – this doesn't seem like stuff that's adding any value.

Plus, there's a major problem that buffers/memoryviews don't have any
way to represent all the dtypes we currently support (e.g. datetime64)
and don't have any way to add new ones, and the only way to fix this
would be to write a PEP, shepherding patches through python-dev,
waiting for the next python major release and then dropping support
for all older Python releases. None of this is going to happen soon;
probably we should plan on the assumption that it will never happen.
So I don't see how this could work at all.

So my vote is for merging the multiarray and umath code bases
together, and then taking advantage of the resulting flexibility to
refactor the internals to provide cleanly separated interfaces at the
API level.

-n

-- 
Nathaniel J. Smith -- https://vorpus.org
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] [SciPy-Dev] PyRSB: Python interface to librsb sparse matrices library

2017-06-24 Thread Nathaniel Smith
On Jun 24, 2017 7:29 AM, "Sylvain Corlay"  wrote:


Also, one quick question: is the LGPL license a deliberate choice or is it
not important to you? Most projects in the Python scientific stack are BSD
licensed. So the LGPL choice makes it unlikely that a higher-level project
adopts it as a dependency. If you are the only copyright holder, you would
still have the possibility to license it under a more permissive license
such as BSD or MIT...


Why would LGPL be a problem in a dependency? That doesn't stop you making
your code BSD, and it's less restrictive license-wise than depending on MKL
or the windows C runtime...

-n
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Boolean binary '-' operator

2017-06-26 Thread Nathaniel Smith
On Sun, Jun 25, 2017 at 9:45 AM, Stefan van der Walt
 wrote:
> Hi Chuck
>
> On Sun, Jun 25, 2017, at 09:32, Charles R Harris wrote:
>
>> The boolean binary '-' operator was deprecated back in NumPy 1.9 and changed
>> to an error in 1.13. This caused a number of failures in downstream
>> projects. The choices now are to continue the deprecation for another couple
>> of releases, or simply give up on the change. For booleans,  `a - b` was
>> implemented as `a xor b`, which leads to the somewhat unexpected identity `a
>> - b == b - a`, but it is a handy operator that allows simplification of some
>> functions, `numpy.diff` among therm. At this point I'm inclined to give up
>> on the deprecation and retain the old behavior. It is a bit impure but
>> perhaps we can consider it a feature rather than a bug.
>
>
> What was the original motivation behind the deprecation?  `xor` seems like
> exactly what one would expect when subtracting boolean arrays.
>
> But, in principle, I'm not against the deprecation (we've had to fix a few
> problems that arose in skimage, but nothing big).

I believe that this happened as part of a review of the whole
arithmetic system for np.bool_. Traditionally, we have + is "or",
binary - is "xor", and unary - is "not".

Here are some identities you might expect, if 'a' and 'b' are np.bool_ objects:

a - b = a + (-b)
a + b - b = a
bool(a + b) = bool(a) + bool(b)
bool(a - b) = bool(a) - bool(b)
bool(-a) = -bool(a)

But in fact none of these identities hold. Furthermore, the np.bool_
arithmetic operations are all confusing synonyms for operations that
could be written more clearly using the proper boolean operators |, ^,
~, so they violate TOOWTDI. So I think the general idea was to
deprecate all of this nonsense.

It looks like what actually happened is that binary - and unary - got
deprecated a while back and are now raising errors in 1.13.0, but +
did not. This is sort of unfortunate, because binary - is the only one
of these that's somewhat defensible (it doesn't match the builtin bool
type, but it does at least correspond to subtraction in Z/2, so
identities like 'a - (b - b) = a' do hold).

I guess my preference would be:
1) deprecate +
2) move binary - back to deprecated-but-not-an-error
3) fix np.diff to use logical_xor when the inputs are boolean, since
that seems to be what people expect
4) keep unary - as an error

And if we want to be less aggressive, then a reasonable alternative would be:
1) deprecate +
2) un-deprecate binary -
3) keep unary - as an error

-n

-- 
Nathaniel J. Smith -- https://vorpus.org
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Boolean binary '-' operator

2017-06-27 Thread Nathaniel Smith
On Jun 26, 2017 6:56 PM, "Charles R Harris" 
wrote:


> On 27 Jun 2017, 9:25 AM +1000, Nathaniel Smith , wrote:
>
I guess my preference would be:
> 1) deprecate +
> 2) move binary - back to deprecated-but-not-an-error
> 3) fix np.diff to use logical_xor when the inputs are boolean, since
> that seems to be what people expect
> 4) keep unary - as an error
>
> And if we want to be less aggressive, then a reasonable alternative would
> be:
> 1) deprecate +
> 2) un-deprecate binary -
> 3) keep unary - as an error
>
>
Using '+' for 'or' and '*' for 'and' is pretty common and the variation of
'+' for 'xor' was common back in the day because 'and' and 'xor' make
boolean algebra a ring, which appealed to mathematicians as opposed to
everyone else ;)


'+' for 'xor' and '*' for 'and' is perfectly natural; that's just + and *
in Z/2. It's not only a ring, it's a field! '+' for 'or' is much weirder;
why would you use '+' for an operation that's not even invertible? I guess
it's a semi-ring. But we have the '|' character right there; there's no
expectation that every weird mathematical notation will be matched in
numpy... The most notable is that '*' doesn't mean matrix multiplication.


You can see the same progression in measure theory where eventually
intersection and xor (symmetric difference) was replaced with union and
complement. Using '-' for xor is something I hadn't seen outside of numpy,
but I suspect it must be standard somewhere.  I would leave '*' and '+'
alone, as the breakage and inconvenience from removing them would be
significant.


'*' doesn't bother me, because it really does have only one sensible
behavior; even built-in bool() effectively uses 'and' for '*'.

But, now I remember... The major issue here is that some people want dot(a,
b) on Boolean matrices to use these semantics, right? Because in this
particular case it leads to some useful connections to the matrix
representation for logical relations [1]. So it's sort of similar to the
diff() case. For the basic operation, using '|' or '^' is fine, but there
are these derived operations like 'dot' and 'diff' where people have
different expectations.

I guess Juan's example of 'sum' is relevant here too. It's pretty weird
that if 'a' and 'b' are one-dimensional boolean arrays, 'a @ b' and 'sum(a
* b)' give totally different results.

So that's the fundamental problem: there are a ton of possible conventions
that are each appealing in one narrow context, and they all contradict each
other, so trying to shove them all into numpy simultaneously is messy.

I'm glad we at least seem to have succeeded in getting rid of unary '-',
that one was particularly indefensible in the context of everything else
:-). For the rest, I'm really not sure whether it's better to deprecate
everything and tell people to use specialized tools for specialized
purposes (e.g. add a 'logical_dot'), or to special case the high-level
operations people want (make 'dot' and 'diff' continue to work, but
deprecate + and -), or just leave the whole incoherent mish-mash alone.

-n

[1] https://en.wikipedia.org/wiki/Logical_matrix
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Boolean binary '-' operator

2017-06-27 Thread Nathaniel Smith
On Tue, Jun 27, 2017 at 3:09 PM, Robert Kern  wrote:
> On Tue, Jun 27, 2017 at 3:01 PM, Benjamin Root  wrote:
>>
>> Forgive my ignorance, but what is "Z/2"?
>
> https://groupprops.subwiki.org/wiki/Cyclic_group:Z2
> https://en.wikipedia.org/wiki/Cyclic_group

This might be a slightly better link?
https://en.wikipedia.org/wiki/Modular_arithmetic#Integers_modulo_n

Anyway, it's a math-nerd way of saying "the integers modulo two", i.e.
the numbers 0 and 1 with * as AND and + as XOR. But the nice thing
about Z/2 is that if you know some abstract algebra, then one of the
most fundamental theorems is that if p is prime then Z/p is a "field",
meaning that * and + are particularly well-behaved. And 2 is a prime,
so pointing out that the bools with AND and XOR is the same as Z/2 is
a way of saying "this way of defining * and + is internally consistent
and well-behaved".

-n

-- 
Nathaniel J. Smith -- https://vorpus.org
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposed changes to array printing in 1.14

2017-06-30 Thread Nathaniel Smith
On Fri, Jun 30, 2017 at 7:23 PM, Juan Nunez-Iglesias  wrote:
> I agree that shipping a sane/sanitising doctest runner would go 95% of the
> way to alleviating my concerns.
>
> Regarding 2.0, this is the whole point of semantic versioning: downstream
> packages can pin their dependency as 1.x and know that they
> - will continue to work with any updates
> - won’t make their users choose between new NumPy 1.x features and running
> their software.

Semantic versioning is somewhere between useless and harmful for
non-trivial projects. It's a lovely idea, it would be lovely if it
worked, but in practice it either means you make every release a major
release, which doesn't help anything, or else you never make a major
release until eventually everyone gets so frustrated that they fork
the project or do a python 3 style break-everything major release,
which is a cure that's worse than the original disease.

NumPy's strategy instead is to make small, controlled, rolling
breaking changes in 1.x releases. Every release breaks something for
someone somewhere, but ideally only after debate and appropriate
warning, and hopefully most release don't break things for *you*.
Change is going to happen one way or another, and it's easier to
manage a small amount of breakage every few releases than to manage a
giant chunk all at once. (The latter just seems easier because it's in
the future, so your brain is like "eh I'm sure I'll be fine" until you
get there and realize how doomed you are.)

Plus, the reality is that every numpy release ever made has
accidentally broken something for someone somewhere, so instead of
lying to ourselves and pretending that we can keep things perfectly
backwards compatible at all times, we might as well acknowledge that
and try to manage the cost of breakage and make them worthwhile. Heck,
even bug fixes are frequently compatibility-breaking changes in
reality, and here we are debating whether tweaking whitespace in reprs
is a compatibility-breaking change. There's no line of demarcation
between breaking changes and non-breaking changes, just shades of
grey, and we can do better engineering if our processes acknowledge
that.

Another critique of semantic versioning:
  https://gist.github.com/jashkenas/cbd2b088e20279ae2c8e

The Google philosophy of "error budgets", which is somewhat analogous
to the argument I'm making for a compatibility-breakage budget:
  https://www.usenix.org/node/189332
  
https://landing.google.com/sre/book/chapters/service-level-objectives.html#xref_risk-management_global-chubby-planned-outage

-n

-- 
Nathaniel J. Smith -- https://vorpus.org
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Vector stacks

2017-07-01 Thread Nathaniel Smith
On Sat, Jul 1, 2017 at 3:31 PM, Charles R Harris
 wrote:
> Hi All,
>
> The  '@' operator works well with stacks of matrices, but not with stacks of
> vectors. Given the recent addition of '__array_ufunc__',  and the intent to
> make `__matmul__` use a ufunc, I've been wondering is it would make sense to
> add ndarray subclasses 'rvec' and 'cvec' that would override that operator
> so as to behave like stacks of row/column vectors. Any other ideas for the
> solution to stacked vectors are welcome.

I feel like the lesson of np.matrix is that subclassing ndarray to
change the meaning of basic operators creates more problems than it
solves?

Some alternatives include:
- if you specifically want a stack of row vectors or column vectors,
insert a new axis at position -1 or -2
- if you want a stack of 1d vectors that automatically act as rows on
the left of @ and columns on the right, then we could have vecvec,
matvec, vecmat gufuncs that do that -- which isn't quite as terse as
@, but not everything can be and at least it'd be explicit what was
going on.

-n

-- 
Nathaniel J. Smith -- https://vorpus.org
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Making a 1.13.2 release

2017-07-06 Thread Nathaniel Smith
It's also possible to work around the 3.6.1 problem with a small
preprocessor hack. On my phone but there's a link in the bug report
discussion.

On Jul 6, 2017 6:10 AM, "Charles R Harris" 
wrote:

> Hi All,
>
> I've delayed the NumPy 1.13.2 release hoping for Python 3.6.2 to show up
> fixing #29943   so we can close #9272
> , but the Python release has
> been delayed to July 11 (expected). The Python problem means that NumPy
> compiled with Python 3.6.1 will not run in Python 3.6.0. However, I've also
> been asked to have a bugfixed version of 1.13 available for Scipy 2017 next
> week. At this point it looks like the best thing to do is release 1.13.1
> compiled with Python 3.6.1 and ask folks to upgrade Python if they have a
> problem, and then release 1.13.2 as soon as 3.6.2 is released.
>
> Thoughts?
>
> Chuck
>
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
>
>
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] NumPy steering councils members

2017-07-21 Thread Nathaniel Smith
On Jul 21, 2017 9:36 AM, "Sebastian Berg" 
wrote:

On Fri, 2017-07-21 at 16:58 +0200, Julian Taylor wrote:
> On 21.07.2017 08:52, Ralf Gommers wrote:
> > Hi all,
> >
> > It has been well over a year since we put together the governance
> > structure and steering council
> > (https://docs.scipy.org/doc/numpy-dev/dev/governance/people.html#go
> > vernance-people).
> > We haven't reviewed the people on the steering council in that
> > time.
> > Based on the criteria for membership I would like to make the
> > following
> > suggestion (note, not discussed with everyone in private
> > beforehand):
> >
> > Adding the following people to the steering council:
> > - Eric Wieser
> > - Marten van Kerkwijk
> > - Stephan Hoyer
> > - Allan Haldane
> >
>
>
> Eric and Marten have only been members with commit rights for 6
> months,
> While they have been contributing and very valuable to the project
> for
> significantly longer, I do think this it is a bit to short time to be
> considered for the steering council.
> I certainly approve of them becoming members at some point, but I do
> want to avoid the steering council to grow to large to quick as long
> as
> it does not need more members to do its job.
> What I do want to avoid is that the steering council becomes like our
> committers list, a group that only grows and never shrinks as long as
> the occasional heartbeat is heard.
>
> That said if we think the current steering council is not able to
> fulfil
> its purpose I do offer my seat for a replacement as I currently have
> not
> really been contributing much.

I doubt that ;). IIRC the rules were "at least one year", so you are
probably right that we should delay the official status until then, but
I care much personally.


Fwiw, the rule to qualify is at least one year of "contributions" that are
"sustained" and "substantial". Having a commit bit definitely helps with
some kinds of contributions (merging PRs, triaging bugs), but there's no
clock that starts ticking when someone gets a commit bit; contributions
before that count too.

"""
To become eligible to join the Steering Council, an individual must be a
Project Contributor who has produced contributions that are substantial in
quality and quantity, and sustained over at least one year. Potential
Council Members are nominated by existing Council members, and become
members following consensus of the existing Council members, and
confirmation that the potential Member is interested and willing to serve
in that capacity. [...]

When considering potential Members, the Council will look at candidates
with a comprehensive view of their contributions. This will include but is
not limited to code, code review, infrastructure work, mailing list and
chat participation, community help/building, education and outreach, design
work, etc.
"""

Also FWIW, the jupyter steering council is currently 15 people, or 16
including Fernando:
  https://github.com/jupyter/governance/blob/master/people.md

By comparison, Numpy's currently has 8, so Ralf's proposal would bring it
to 11:

https://docs.scipy.org/doc/numpy-dev/dev/governance/people.html#governance-people

Looking at the NumPy council, then with the exception of Alex who I haven't
heard from in a while, it looks like a list of people who regularly speak
up and have sensible things to say, so I don't personally see any problem
with keeping everyone around. It's not like the council is an active
working group; it's mainly for occasional oversight and boring logistics.

-n
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Dropping support for Accelerate

2017-07-23 Thread Nathaniel Smith
I've been wishing we'd stop shipping Accelerate for years, because of
how it breaks multiprocessing – that doesn't seem to be on your list
yet.

On Sat, Jul 22, 2017 at 3:50 AM, Ilhan Polat  wrote:
> A few months ago, I had the innocent intention to wrap LDLt decomposition
> routines of LAPACK into SciPy but then I am made aware that the minimum
> required version of LAPACK/BLAS was due to Accelerate framework. Since then
> I've been following the core SciPy team and others' discussion on this
> issue.
>
> We have been exchanging opinions for quite a while now within various SciPy
> issues and PRs about the ever-increasing Accelerate-related issues and I've
> compiled a brief summary about the ongoing discussions to reduce the
> clutter.
>
> First, I would like to kindly invite everyone to contribute and sharpen the
> cases presented here
>
> https://github.com/scipy/scipy/wiki/Dropping-support-for-Accelerate
>
> The reason I specifically wanted to post this also in NumPy mailing list is
> to probe for the situation from the NumPy-Accelerate perspective. Is there
> any NumPy specific problem that would indirectly effect SciPy should the
> support for Accelerate is dropped?
>
>
>
>
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
>



-- 
Nathaniel J. Smith -- https://vorpus.org
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Dropping support for Accelerate

2017-07-25 Thread Nathaniel Smith
I updated the bit about OpenBLAS wheel with some more information on
the status of that work. It's not super important, but FYI.

I also want to disagree with this characterization of the
Accelerate/multiprocessing issue: "This problem was due to a bug in
multiprocessing and is fixed in Python 3.4 and later; Accelerate was
POSIX compliant but multiprocessing was not."

In 3.4 it became possible to *work around* this issue, but it requires
configuring the multiprocessing module in a non-default way, which
means that the common end-user experience is still that they try using
multiprocessing, and they get random hangs with no other feedback, and
then spend hours or days debugging before they discover this
configuration option. (And the problem occurs on MacOS only, so you
get extra fun when e.g. a module is developed on Windows or Linux and
then you give it to a less-technical collaborator on MacOS and it
breaks on their computer and you have no idea why.)

And the workaround is suboptimal -- fork()'s memory-sharing semantics
are very powerful. I've had cases where I could easily and efficiently
solve a problem using multiprocessing in fork() mode, but where
enabling the workaround for Accelerate would have made it impossible.
(Specifically this happened because I had a huge read-only data
structure that I could load once in the parent process, and then all
the child processes could share it through fork's virtual memory
magic; I didn't have enough memory to load two copies of it, yet fork
let me have 10 or 20 virtual copies.)

Technically, yes, mixing threads and fork can't be done in a
POSIX-compliant manner. But no-one runs their code on an abstract
POSIX machine, and on actual systems it's totally possible to make
this work reliably. OpenBLAS does it. Users don't care if Apple is
technically correct, they just want their stuff to work.

-n

On Tue, Jul 25, 2017 at 4:57 AM, Matthew Brett  wrote:
> Hi,
>
> On Sun, Jul 23, 2017 at 5:07 PM, Ilhan Polat  wrote:
>> Ouch, that's from 2012 :(  I'll add this thread as a reference to the wiki
>> list.
>>
>>
>> On Sun, Jul 23, 2017 at 5:22 PM, Nathan Goldbaum 
>> wrote:
>>>
>>> See
>>> https://mail.scipy.org/pipermail/numpy-discussion/2012-August/063589.html
>>> and replies in that thread.
>>>
>>> Quote from an Apple engineer in that thread:
>>>
>>> "For API outside of POSIX, including GCD and technologies like Accelerate,
>>> we do not support usage on both sides of a fork(). For this reason among
>>> others, use of fork() without exec is discouraged in general in processes
>>> that use layers above POSIX."
>>>
>>> On Sun, Jul 23, 2017 at 10:16 AM, Ilhan Polat 
>>> wrote:
>>>>
>>>> That's probably because I know nothing about the issue, is there any
>>>> reference I can read about?
>>>>
>>>> But in general, please feel free populate new items in the wiki page.
>>>>
>>>> On Sun, Jul 23, 2017 at 11:15 AM, Nathaniel Smith  wrote:
>>>>>
>>>>> I've been wishing we'd stop shipping Accelerate for years, because of
>>>>> how it breaks multiprocessing – that doesn't seem to be on your list
>>>>> yet.
>>>>>
>>>>> On Sat, Jul 22, 2017 at 3:50 AM, Ilhan Polat 
>>>>> wrote:
>>>>> > A few months ago, I had the innocent intention to wrap LDLt
>>>>> > decomposition
>>>>> > routines of LAPACK into SciPy but then I am made aware that the
>>>>> > minimum
>>>>> > required version of LAPACK/BLAS was due to Accelerate framework. Since
>>>>> > then
>>>>> > I've been following the core SciPy team and others' discussion on this
>>>>> > issue.
>>>>> >
>>>>> > We have been exchanging opinions for quite a while now within various
>>>>> > SciPy
>>>>> > issues and PRs about the ever-increasing Accelerate-related issues and
>>>>> > I've
>>>>> > compiled a brief summary about the ongoing discussions to reduce the
>>>>> > clutter.
>>>>> >
>>>>> > First, I would like to kindly invite everyone to contribute and
>>>>> > sharpen the
>>>>> > cases presented here
>>>>> >
>>>>> > https://github.com/scipy/scipy/wiki/Dropping-support-for-Accelerate
>>>>> >
>>>>> > The reason I specifically wanted to post this also in NumPy mailing
>

Re: [Numpy-discussion] Dropping support for Accelerate

2017-07-25 Thread Nathaniel Smith
On Tue, Jul 25, 2017 at 6:48 AM, Matthew Brett  wrote:
> On Tue, Jul 25, 2017 at 2:19 PM, Nathaniel Smith  wrote:
>> I updated the bit about OpenBLAS wheel with some more information on
>> the status of that work. It's not super important, but FYI.
>
> Maybe remove the bit (of my text) that you crossed out, or removed the
> strikethrough and qualify?  At the moment it's confusing, because I
> believe what I wrote is correct, so leaving in there and crossed out
> looks kinda weird.

Eh, it's a little weird because there's no specification needed
really, we can implement it any time we want to. It was stalled for a
long time because I ran into arcane technical problems dealing with
the MacOS linker, but that's solved and now it's just stalled due to
lack of attention.

I deleted the text but feel free to qualify further if you think it's useful.

-n

-- 
Nathaniel J. Smith -- https://vorpus.org
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Dropping support for Accelerate

2017-07-25 Thread Nathaniel Smith
On Tue, Jul 25, 2017 at 7:05 AM, Matthew Brett  wrote:
> On Tue, Jul 25, 2017 at 3:00 PM, Nathaniel Smith  wrote:
>> On Tue, Jul 25, 2017 at 6:48 AM, Matthew Brett  
>> wrote:
>>> On Tue, Jul 25, 2017 at 2:19 PM, Nathaniel Smith  wrote:
>>>> I updated the bit about OpenBLAS wheel with some more information on
>>>> the status of that work. It's not super important, but FYI.
>>>
>>> Maybe remove the bit (of my text) that you crossed out, or removed the
>>> strikethrough and qualify?  At the moment it's confusing, because I
>>> believe what I wrote is correct, so leaving in there and crossed out
>>> looks kinda weird.
>>
>> Eh, it's a little weird because there's no specification needed
>> really, we can implement it any time we want to. It was stalled for a
>> long time because I ran into arcane technical problems dealing with
>> the MacOS linker, but that's solved and now it's just stalled due to
>> lack of attention.
>>
>> I deleted the text but feel free to qualify further if you think it's useful.
>
> Are you saying that we should consider this specification approved
> already?  Or that we should go ahead without waiting for approval?  I
> guess the latter.  I guess you're saying you think there would be no
> bad consequences for doing this if the spec subsequently changed
> before being approved?  It might be worth adding something like that
> to the text, in case there's somebody who wants to do some work on
> that.

It's not a PEP. It will never be approved because there is no-one to
approve it :-). The only reason for writing it as a spec is to
potentially help coordinate with others who want to get in on making
these kinds of packages themselves, and the main motivator for that
will be if one of us starts doing it and proves it works...

-n

-- 
Nathaniel J. Smith -- https://vorpus.org
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] ENH: ratio function to mimic diff

2017-07-29 Thread Nathaniel Smith
I'd also like to see a more detailed motivation for this.

And, if it is useful, then that would make 3 operations that have special
case pairwise moving window variants (subtract, floor_divide, true_divide).
3 is a lot of special cases. Should there instead be a generic mechanism
for doing this for arbitrary binary operations?

-n

On Jul 28, 2017 3:25 PM, "Joseph Fox-Rabinovitz" 
wrote:

> I have created PR#9481 to introduce a `ratio` function that behaves very
> similarly to `diff`, except that it divides successive elements instead of
> subtracting them. It has some handling built in for zero division, as well
> as the ability to select between `/` and `//` operators.
>
> There is currently no masked version. Perhaps someone could suggest a
> simple mechanism for hooking np.ma.true_divide and np.ma.floor_divide in as
> the operators instead of the regular np.* versions.
>
> Please let me know your thoughts.
>
> Regards,
>
> -Joe
>
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
>
>
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Why are empty arrays False?

2017-08-18 Thread Nathaniel Smith
On Fri, Aug 18, 2017 at 2:45 PM, Michael Lamparski
 wrote:
> Greetings, all.  I am troubled.
>
> The TL;DR is that `bool(array([])) is False` is misleading, dangerous, and
> unnecessary. Let's begin with some examples:
>
 bool(np.array(1))
> True
 bool(np.array(0))
> False
 bool(np.array([0, 1]))
> ValueError: The truth value of an array with more than one element is
> ambiguous. Use a.any() or a.all()
 bool(np.array([1]))
> True
 bool(np.array([0]))
> False
 bool(np.array([]))
> False
>
> One of these things is not like the other.
>
> The first three results embody a design that is consistent with some of the
> most fundamental design choices in numpy, such as the choice to have
> comparison operators like `==` work elementwise.  And it is the only such
> design I can think of that is consistent in all edge cases. (see footnote 1)
>
> The next two examples (involving arrays of shape (1,)) are a straightforward
> extension of the design to arrays that are isomorphic to scalars.  I can't
> say I recall ever finding a use for this feature... but it seems fairly
> harmless.
>
> So how about that last example, with array([])?  Well... it's /kind of/ like
> how other python containers work, right? Falseness is emptiness (see
> footnote 2)...  Except that this is actually *a complete lie*, due to /all
> of the other examples above/!

Yeah, numpy tries to follow Python conventions, except sometimes you
run into these cases where it's trying to simultaneously follow two
incompatible extensions and things get... problematic.

> Here's what I would like to see:
>
 bool(np.array([]))
> ValueError: The truth value of a non-scalar array is ambiguous. Use a.any()
> or a.all()
>
> Why do I care?  Well, I myself wasted an hour barking up the wrong tree
> while debugging some code when it turned out that I was mistakenly using
> truthiness to identify empty arrays. It just so happened that the arrays
> always contained 1 or 0 elements, so it /appeared/ to work except in the
> rare case of array([0]) where things suddenly exploded.

Yeah, we should probably deprecate and remove this (though it will
take some time).

> 2: np.array() is also False, which makes this an interesting sort of
> n-dimensional emptiness test; but if that's really what you're looking for,
> you can achieve this much more safely with `np.all(x.shape)` or
> `bool(x.flat)`

x.size is also useful for emptiness checking.

-n

-- 
Nathaniel J. Smith -- https://vorpus.org
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Why are empty arrays False?

2017-08-19 Thread Nathaniel Smith
On Fri, Aug 18, 2017 at 7:34 PM, Eric Firing  wrote:
> I don't agree.  I think the consistency between bool([]) and bool(array([]))
> is worth preserving.  Nothing you have shown is inconsistent with "Falseness
> is emptiness", which is quite fundamental in Python.  The inconsistency is
> in distinguishing between 1 element and more than one element.  To be
> consistent, bool(array([0])) and bool(array([0, 1])) should both be True.
> Contrary to the ValueError message, there need be no ambiguity, any more
> than there is an ambiguity in bool([1, 2]).

Yeah, this is a mess. But we're definitely not going to make
bool(array([0])) be True. That would break tons of code that currently
relies on the current behavior. And the current behavior does make
sense, in every case except empty arrays: bool broadcasts over the
array, and then, oh shoot, Python requires that bool's return value be
a scalar, so if this results in anything besides an array of size 1,
raise an error.

OTOH you can't really write code that depends on using the current
bool(array([])) semantics for emptiness checking, unless the only two
cases you care about are "empty" and "non-empty with exactly one
element and that element is truthy". So it's much less likely that
changing that will break existing code, plus any code that does break
was already likely broken in subtle ways.

The consistency-with-Python argument cuts two ways: if an array is a
container, then for consistency bool should do emptiness checking. If
an array is a bunch of scalars with broadcasting, then for consistency
bool should do truthiness checking on the individual elements and
raise an error on any array with size != 1. So we can't just rely on
consistency-with-Python to resolve the argument -- we need to pick one
:-). Though internal consistency within numpy would argue for the
latter option, because numpy almost always prefers the bag-of-scalars
semantics over the container semantics, e.g. for + and *, like Eric
Wieser mentioned. Though there are exceptions like iteration.

...Though actually, iteration and indexing by scalars tries to be
consistent with Python in yet a third way. They pretend that an array
is a unidimensional container holding a bunch of arrays:

In [3]: np.array([[1]])[0]
Out[3]: array([1])

In [4]: next(iter(np.array([[1]])))
Out[4]: array([1])

So according to this model, bool(np.array([])) should be False, but
bool(np.array([[]])) should be True (note that with lists, bool([[]])
is True). But alas:

In [5]: bool(np.array([])), bool(np.array([[]]))
Out[5]: (False, False)

-n

-- 
Nathaniel J. Smith -- https://vorpus.org
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Proposal - change to OpenBLAS for Windows wheels

2017-09-25 Thread Nathaniel Smith
Makes sense to me.

On Sep 25, 2017 05:54, "Matthew Brett"  wrote:

> Hi,
>
> I suggest we switch from ATLAS to OpenBLAS for our Windows wheels:
>
> * OpenBLAS is much faster, at least when Tony Kelman tested it last year
> [1];
> * We now have an automated Appveyor build for OpenBLAS [2, 3];
> * Tests are passing with 32-bit and 64-bit wheels [4];
> * The next Scipy release will have OpenBLAS wheels;
>
> Any objections / questions / alternatives?
>
> Cheers,
>
> Matthew
>
> [1] https://github.com/numpy/numpy/issues/5479#issuecomment-185033668
> [2] https://github.com/matthew-brett/build-openblas
> [3] https://ci.appveyor.com/project/matthew-brett/build-openblas
> [4] https://ci.appveyor.com/project/matthew-brett/numpy-
> wheels/build/1.0.50
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
>
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] numpy grant update

2017-10-18 Thread Nathaniel Smith
Hi all,

I wanted to give everyone an update on what's going on with the NumPy
grant [1]. As you may have noticed, things have been moving a bit
slower than originally hoped -- unfortunately my health is improving
but has continued to be rocky [2].

Fortunately, I have awesome co-workers, and BIDS has an institutional
interest/mandate for figuring out how to make these things happen, so
after thinking it over we've decided to reorganize how we're doing
things internally and split up the work to let me focus on the core
technical/community aspects without getting overloaded. Specifically,
Fernando Pérez and Jonathan Dugan [3] are taking on PI/administration
duties, Stéfan van der Walt will focus on handling day-to-day
management of the incoming hires, and Nelle Varoquaux & Jarrod Millman
will also be joining the team (exact details TBD).

This shouldn't really affect any of you, except that you might see
some familiar faces with @berkeley.edu emails becoming more engaged.
I'm still leading the Berkeley effort, and in any case it's still
ultimately the community and NumPy steering council who will be making
decisions about the project – this is just some internal details about
how we're planning to manage our contributions. But in the interest of
full transparency I figured I'd let you know what's happening.

In other news, the job ad to start the official hiring process has now
been submitted for HR review, so it should hopefully be up soon --
depending on how efficient the bureaucracy is. I'll definitely let
everyone know as soon as its posted.

I'll also be giving a lunch talk at BIDS tomorrow to let folks locally
know about what's going on, which I think will be recorded – I'll send
around a link after in case others are interested.

-n

[1] https://mail.python.org/pipermail/numpy-discussion/2017-May/076818.html
[2] https://vorpus.org/blog/emerging-from-the-underworld/
[3] https://bids.berkeley.edu/people/jonathan-dugan


-- 
Nathaniel J. Smith -- https://vorpus.org
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] numpy grant update

2017-10-26 Thread Nathaniel Smith
On Wed, Oct 18, 2017 at 10:24 PM, Nathaniel Smith  wrote:
> I'll also be giving a lunch talk at BIDS tomorrow to let folks locally
> know about what's going on, which I think will be recorded – I'll send
> around a link after in case others are interested.

Here's that link: https://www.youtube.com/watch?v=fowHwlpGb34

-n

-- 
Nathaniel J. Smith -- https://vorpus.org
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] numpy grant update

2017-10-26 Thread Nathaniel Smith
On Thu, Oct 26, 2017 at 1:14 PM, Marten van Kerkwijk
 wrote:
> Hi Nathaniel,
>
> Thanks for the link. The plans sounds great! You'll not be surprised
> to hear I'm particularly interested in the units aspect (and, no, I
> don't mind at all if we can stop subclassing ndarray...). Is the idea
> that there will be a general way for allow a dtype to define how to
> convert an array to one with another dtype? (Just as one now
> implicitly is able to convert between, say, int and float.) And, if
> so, is the idea that one of those conversion possibilities might
> involve checking units? Or were you thinking of implementing units
> more directly? The former would seem most sensible, if only so you can
> initially focus on other things than deciding how to support, say, esu
> vs emu units, or whether or not to treat radians as equal to
> dimensionless (which they formally are, but it is not always handy to
> do so).

Well, to some extent the answers here are going to be "you tell me"
:-). I'm not an expert in unit handling, and these plans are pretty
high-level right now -- there will be lots more discussions to work
out details once we've hired people and they're ramping up, and as we
work out the larger context around how to improve the dtype system.

But, generally, yeah, one of the things that a custom dtype will need
to be able to do is to hook into the casting and ufunc dispatch
systems. That means, when you define a dtype, you get to answer
questions like "can you cast yourself into float32 without loss of
precision?", or "can you cast yourself into int64, truncating values
if you have to?". (Or even, "can you cast yourself to ?", which would presumably trigger unit conversion.) And you'd
also get to define how things like overriding how np.add and
np.multiply work for your dtype -- it's already the case that ufuncs
have multiple implementations for different dtypes and there's
machinery to pick the best one; this would just be extending that to
these new dtypes as well.

One possible approach that I think might be particularly nice would be
to implement units as a "wrapper dtype". The idea would be that if we
have a standard interface that dtypes implement, then not only can you
implement those methods yourself to make a new dtype, but you can also
call those methods on an existing dtype. So you could do something
like:

class WithUnits(np.dtype):
def __init__(self, inner_dtype, unit):
self.inner_dtype = np.dtype(inner_dtype)
self.unit = unit

# Simple operations like bulk data copying are delegated to the inner dtype
# (Invoked by arr.copy(), making temporary buffers for calculations, etc.)
def copy_data(self, source, dest):
return self.inner_dtype.copy_data(source, dest)

# Other operations like casting can do some unit-specific stuff and then
# delegate
def cast_to(self, other_dtype, source, dest):
if isinstance(other_dtype, WithUnits):
if other_dtype.unit == self.unit:
# Something like casting WithUnits(float64, meters) ->
WithUnits(float32, meters)
# So no unit trickiness needed; delegate to the inner
dtype to handle the storage
# conversion (e.g. float64 -> float32)
self.inner_dtype.cast_to(other_dtype.inner_dtype, source, dest)
# ... other cases to handle unit conversion, etc. ...

And then as a user you'd use it like np.array([1, 2, 3],
dtype=WithUnits(float, meters)) or whatever. (Or some convenience
function that ultimately does this.)

This is obviously a hand-wavey sketch, I'm sure the actual details
will look very different. But hopefully it gives some sense of the
kind of possibilities here?

-n

-- 
Nathaniel J. Smith -- https://vorpus.org
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] numpy grant update

2017-10-26 Thread Nathaniel Smith
On Thu, Oct 26, 2017 at 2:11 PM, Nathan Goldbaum  wrote:
> My understanding of this is that the dtype will only hold the unit metadata.
> So that means units would propogate through calculations automatically, but
> the dtype wouldn't be able to manipulate the array data (in an in-place unit
> conversion for example).

I think that'd be fine actually... dtypes have methods[1] that are
invoked to do any operation that involves touching the actual array
data. For example, when you copy array data from one place to another
(because someone called arr.copy(), or did x[...] = y, or because the
ufunc internals need to copy part of the array into a temporary bounce
buffer, etc.), you have to let the dtype do that, because only the
dtype knows how to safely copy entries of this dtype. (For many dtypes
it's just a simple (strided) memmove, but then for the object dtype
you have to take care of refcounting...)

Similarly, if your unit dtype implemented casting, then array(...,
dtype=WithUnits(float, meters)).astype(WithUnits(float, feet)) would
Just Work.

It looks like we don't currently expose a user-level API for doing
in-place dtype conversions, but there's no reason we can't add one;
all the underlying casting machinery already exists and works on
arbitrary memory buffers. (And in the mean time there's a cute trick
here [2] you could use to implement it yourself.) And if we do add
one, then you could use it equally well to do in-place conversion from
float64->int64 as for float64-in-meters to float64-in-feet.

[1] Well, technically right now they're not methods, but instead a
bunch of instance attributes holding C level function pointers that
act like methods. But basically this is just an obfuscated way of
implementing methods; it made sense at the time, but in retrospect
making them use the more usual Python machinery for this will make
things easier.
[2] https://stackoverflow.com/a/4396247/

> In this world, astropy quantities and yt's YTArray would become containers
> around an ndarray that would make use of the dtype metadata but also
> implement all of the unit semantics that they already implement. Since they
> would become container classes and would no longer be ndarray subclasses,
> that avoids most of the pitfalls one encounters these days.

I don't think you'd need a container class for basic functionality,
but it might turn out to be useful for some kind of
convenience/backwards-compatibility issues. For example, right now
with Quantity you can do 'arr.unit' to get the unit and 'arr.value' to
get the raw values with units stripped. It should definitely be
possible to support these with spellings like 'arr.dtype.unit' and
'asarray(arr, dtype=float)' (or 'astropy.quantities.value(arr)'), but
maybe not the short array attribute based spellings? We'll have to
have the discussion about whether we want to provide some mechanism
for *dtypes* to add new attributes to the *ndarray* namespace.
(There's some precedent in numpy's built-in .real and .imag, but OTOH
this is a kind of 'import *' feature that can easily be confusing and
create backwards compatibility issues -- what if ndarray and the dtype
have a name clash? Keeping in mind that it could be a clash between a
third-party dtype we don't even know about and a new ndarray attribute
that didn't exist when the third-party dtype was created...)

-n

-- 
Nathaniel J. Smith -- https://vorpus.org
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] is __array_ufunc__ ready for prime-time?

2017-11-07 Thread Nathaniel Smith
On Nov 6, 2017 4:19 PM, "Chris Barker"  wrote:

On Sat, Nov 4, 2017 at 6:47 AM, Marten van Kerkwijk <
m.h.vankerkw...@gmail.com> wrote:

>
> You just summarized excellently why I'm on a quest to change `asarray`
> to `asanyarray` within numpy


+1 -- we should all be using asanyarray() most of the time.


The problem is that if you use 'asanyarray', then you're claiming that your
code works correctly for:
- regular ndarrays
- np.matrix
- np.ma masked arrays
- and every third party subclass, regardless of their semantics, regardless
of whether you've heard of them or not

If subclasses followed the Liskov substitution principle, and had different
internal implementations but the same public ("duck") API, then this would
be fine. But in practice, numpy limitations mean that ndarrays subclasses
have to have the same internal implementation, so the only reason to make
an ndarray subclass is if you want to make something with a different
public API. Basically the whole system is designed for subclasses to be
incompatible.

The end result is that if you use asanyarray, your code is definitely
wrong, because there's no way you're actually doing the right thing for
arbitrary ndarray subclasses. But if you don't use asanyarray, then yeah,
that's also wrong, because it won't work on mostly-compatible subclasses
like astropy's. Given this, different projects reasonably make different
choices -- it's not just legacy code that uses asarray. In the long run we
obviously need to come up with new options that don't have these tradeoffs
(that's why we want to let units to to dtypes, implement methods like
__array_ufunc__ to enable duck arrays, etc.) let's try to be sympathetic to
other projects that are doing their best :-).

-n
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Proposal of timeline for dropping Python 2.7 support

2017-11-07 Thread Nathaniel Smith
On Nov 7, 2017 2:15 PM, "Chris Barker"  wrote:

On Mon, Nov 6, 2017 at 6:14 PM, Charles R Harris 
wrote:

> Also -- if py2.7 continues to see the use I expect it will well past when
>>> pyton.org officially drops it, I wouldn't be surprised if a Python2.7
>>> Windows build based on a newer compiler would come along -- perhaps by
>>> Anaconda or conda-forge, or ???
>>>
>>
>> I suspect that this will indeed happen. I am aware of multiple companies
>> following this path already (building python + numpy themselves with a
>> newer MS compiler).
>>
>
> I think Anaconda is talking about distributing a compiler, but what that
> will be on windows is anyone's guess. When we drop 2.7, there is a lot of
> compatibility crud that it would be nice to get rid of, and if we do that
> then NumPy will no longer compile against 2.7. I suspect some companies
> have just been putting off the task of upgrading to Python 3, which should
> be pretty straight forward these days apart from system code that needs to
> do a lot of work with bytes.
>

I agree, and if there is a compelling reason to upgrade, folks WILL do it.
But I've been amazed over the years at folks' desire to stick with what
they have! And I'm guilty too, anything new I start with py3, but older
larger codebases are still py2, I just can't find the energy to spend a the
week or so it would probably take to update everything...

But in the original post, the Windows Compiler issue was mentioned, so
there seems to be two reasons to drop py2:

A) wanting to use py3 only features.
B) wanting to use never C (C++?) compiler features.

I suggest we be clear about which of these is driving the decisions, and
explicit about the goals. That is, if (A) is critical, we don't even have
to talk about (B)

But we could choose to do (B) without doing (A) -- I suspect there will be
a user base for that


The problem is it's hard to predict the future. Right now neither PyPI nor
conda provide any way to distribute binaries for py27-but-with-a-newer-ABI,
and maybe they never will; or maybe they will eventually, but not enough
people use them to justify keeping py2 support given the other overheads;
or... who knows, really.

Right now, the decision in front of us is what to tell people who ask about
numpy's py2 support plans, so that they can make their own plans. Given
what we know right now, I don't think we should promise to keep support
past 2018. If we get there and the situation's changed, and there's both
desire and means to extend support we can revisit that. But's better to
under-promise and possibly over-deliver, instead of promising to support
py2 until after it becomes a millstone around our necks and then realizing
we haven't warned anyone and are stuck supporting it another year beyond
that...

-n
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] deprecate updateifcopy in nditer operand, flags?

2017-11-08 Thread Nathaniel Smith
At a higher level:

The issue here is that we need to break the nditer API. This might
affect you if you np.nditer (in Python) or the NpyIter_* APIs (in C).
The exact cases affected are somewhat hard to describe because
nditer's flag processing is complicated [1], but basically it's cases
where you are writing to one of the arrays being iterated over and
then something else non-trivial happens.

The problem is that the API currently uses NumPy's odd UPDATEIFCOPY
feature. What it does is give you an "output" array which is not your
actual output array, but instead some other temporary array which you
can modify freely, and whose contents are later written back to your
actual output array.

When does this copy happen? Since this is an iterator, then most of
the time we can do the writeback for iteration N when we start
iteration N+1. However, this doesn't work for the final iteration. On
the final iteration, currently the writeback happens when the
temporary is garbage collected. *Usually* this happens pretty
promptly, but this is dependent on some internal details of how
CPython's garbage collector works that are explicitly not part of the
Python language spec, and on PyPy you silently and
non-deterministically get incorrect results. Plus it's error-prone
even on CPython -- if you accidentally have a dangling reference to
one array, then suddenly another array will have the wrong contents.

So we have two options:

- We could stop supporting this mode entirely. Unfortunately, it's
hard to know if anyone is using this, since the conditions to trigger
it are so complicated, and not necessarily very exotic (e.g. it can
happen if you have a function that uses nditer to read one array and
write to another, and then someone calls your function with two arrays
whose memory overlaps).

- We could adjust the API so that there's some explicit operation to
trigger the final writeback. At the Python level this would probably
mean that we start supporting the use of nditer as a context manager,
and eventually start raising an error if you're in one of the "unsafe"
case and not using the context manager form. At the C level we
probably need some explicit "I'm done with this iterator now" call.

One question is which cases exactly should produce warnings/eventually
errors. At the Python level, I guess the simplest rule would be that
if you have any write/readwrite arrays in your iterator, then you have
to use a 'with' block. At the C level, it's a little trickier, because
it's hard to tell up-front whether someone has updated their code to
call a final cleanup function, and it's hard to emit a warning/error
on something that *doesn't* happen. (You could print a warning when
the nditer object is GCed if the cleanup function wasn't called, but
you can't raise an error there.) I guess the only reasonable option is
to deprecate NPY_ITER_READWRITE and NP_ITER_WRITEONLY, and make people
switch to passing new flags that have the same semantics but also
promise that the user has updated their code to call the new cleanup
function.

Does that work? Any objections?

-n

[1] The affected cases are the ones that reach this line:

   
https://github.com/numpy/numpy/blob/c276f326b29bcb7c851169d34f4767da0b4347af/numpy/core/src/multiarray/nditer_constr.c#L2926

So it's something like
- all of these things are true:
  - you have a writable array (nditer flags "write" or "readwrite")
  - one of these things is true:
- you passed the "forcecopy" flag
- all of these things are true:
  - you requested casting
  - you requested updateifcopy
- there's a memory overlap between this array and another of the
arrays being iterated over

On Wed, Nov 8, 2017 at 11:31 AM, Matti Picus  wrote:
>
> Date: Wed, 8 Nov 2017 18:41:03 +0200
> From: Matti Picus 
> To: numpy-discussion@python.org
> Subject: [Numpy-discussion] deprecate updateifcopy in nditer operand
> flags?
> Message-ID: 
> Content-Type: text/plain; charset=utf-8; format=flowed
>
> I filed issue 9714 https://github.com/numpy/numpy/issues/9714 and wrote
> a mail in September trying to get some feedback on what to do with
> updateifcopy semantics and user-exposed nditer.
> It garnered no response, so I am trying again.
> For those who are unfamiliar with the issue see below for a short
> summary and issue 7054 for a lengthy discussion.
> Note that pull request 9639 which should be merged very soon changes the
> magical UPDATEIFCOPY into WRITEBACKIFCOPY, and hopefully will appear in
> NumPy 1.14.
>
> As I mention in the issue, there is a magical update done in this
> snippet in the next-to-the-last line:
>
> |a = np.arange(24, dtype='f8').reshape(2, 3, 4).T i = np.nditer(a, [],
> [['readwrite', 'updateifcopy']], casting='same_kind',
> op_dtypes=[np.dtype('f4')]) # Check that UPDATEIFCOPY is activated
> i.operands[0][2, 1, 1] = -12.5 assert a[2, 1, 1] != -12.5 i = None #
> magic!!! assert a[2, 1, 1] == -12.5|
>
> Formatting
>
> a = np.arange(24, dtype='f8').reshape(2, 3, 4).T
> 

Re: [Numpy-discussion] Proposal of timeline for dropping Python 2.7 support

2017-11-08 Thread Nathaniel Smith
On Nov 8, 2017 16:51, "Matthew Brett"  wrote:

Hi,

On Wed, Nov 8, 2017 at 7:08 PM, Julian Taylor
 wrote:
> On 06.11.2017 11:10, Ralf Gommers wrote:
>>
>>
>> On Mon, Nov 6, 2017 at 7:25 AM, Charles R Harris
>> mailto:charlesr.har...@gmail.com>> wrote:
>>
>> Hi All,
>>
>> Thought I'd toss this out there. I'm tending towards better sooner
>> than later in dropping Python 2.7 support as we are starting to run
>> up against places where we would like to use Python 3 features. That
>> is particularly true on Windows where the 2.7 compiler is really old
>> and lacks C99 compatibility.
>>
>>
>> This is probably the most pressing reason to drop 2.7 support. We seem
>> to be expending a lot of effort lately on this stuff. I was previously
>> advocating being more conservative than the timeline you now propose,
>> but this is the pain point that I think gets me over the line.
>
>
> Would dropping python2 support for windows earlier than the other
> platforms a reasonable approach?
> I am not a big fan of to dropping python2 support before 2020, but I
> have no issue with dropping python2 support on windows earlier as it is
> our largest pain point.

I wonder about this too.  I can imagine there are a reasonable number
of people using older Linux distributions on which they cannot upgrade
to a recent Python 3,


My impression is that this is increasingly rare, actually. I believe RHEL
is still shipping 2.6 by default, which we've already dropped support for,
and if you want RH python then they provide supported 2.7 and 3.latest
through exactly the same channels. Ubuntu 14.04 is end-of-life in April
2019, so pretty irrelevant if we're talking about 2019 for dropping
support, and 16.04 ships with 3.5. Plus with docker, conda, PPAs, etc.,
getting a recent python is easier than its ever been.

> but

is that likely to be true for Windows?

We'd have to make sure we could persuade pypi to give the older
version for Windows, by default - I don't know if that is possible.


Currently it's not – if pip doesn't see a Windows wheel, it'll try
downloading and building an sdist. There's a mechanism for sdists to
declare what version of python they support but (thanks to the jupyter
folks for implementing this), but that's all. The effect is that if we
release a version that drops support for py2 entirely, then 'pip install'
on py2 will continue to work and give the last supported version, but if we
release a version that drops py2 on Windows but keeps it on other platforms
then 'pip install' on py2 on Windows will just stop working entirely.

This is possible to fix – it's just software – but I'm not volunteering...

-n
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Proposal of timeline for dropping Python 2.7 support

2017-11-09 Thread Nathaniel Smith
On Nov 8, 2017 23:59, "Ralf Gommers"  wrote:

Regarding http://www.python3statement.org/: I'd say that as long as there
are people who want to spend their energy on the LTS release (contributors
*and* enough maintainer power to review/merge/release), we should not
actively prevent them from doing that.


Yeah, agreed. I don't feel like this is incompatible with the spirit of
python3statement.org, though looking at the text I can see how it's not
clear. My guess is they'd be happy to adjust the text, especially if it
lets them add numpy :-). CC'ing Thomas and Matthias.

-n
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Proposal of timeline for dropping Python 2.7 support

2017-11-09 Thread Nathaniel Smith
See Thomas's reply quoted below (it was rejected by the mailing list since
he's not subscribed):

On Nov 9, 2017 01:24, "Thomas Kluyver"  wrote:

On Thu, Nov 9, 2017, at 08:52 AM, Nathaniel Smith wrote:

On Nov 8, 2017 23:59, "Ralf Gommers"  wrote:

Regarding http://www.python3statement.org/: I'd say that as long as there
are people who want to spend their energy on the LTS release (contributors
*and* enough maintainer power to review/merge/release), we should not
actively prevent them from doing that.


Yeah, agreed. I don't feel like this is incompatible with the spirit of
python3statement.org, though looking at the text I can see how it's not
clear. My guess is they'd be happy to adjust the text, especially if it
lets them add numpy :-). CC'ing Thomas and Matthias.


Thanks Nathaniel. We have (IMO) left a degree of deliberate ambiguity
around precisely what 'drop support' means, because it's not going to be
the same for all projects. The nature of open source also means that there
can be ambiguity over what 'support' entails and who is considered part of
the project.

I would say that the idea of the statement is compatible with an LTS
release series receiving critical bugfixes beyond 2020, while the main
energy of the project is focused on Py3-only feature releases.

[If numpy-discussion doesn't allow non-member posts, feel free to pass this
on or quote it in on-list messages]

Thomas
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Proposal of timeline for dropping Python 2.7 support

2017-11-09 Thread Nathaniel Smith
Fortunately we can wait until we're a bit closer before we have to
make any final decision on the version numbering :-)

Right now though it would be good to start communicating to
users/downstreams about whatever our plans our though, so they can
make plans. Here's a first attempt at some text we can put in the
documentation and point people to -- any thoughts, on either the plan
or the wording?

 DRAFT TEXT - NOT FINAL - DO NOT POST THIS TO HACKERNEWS OK? OK 

The Python core team plans to stop supporting Python 2 in 2020. The
NumPy project has supported both Python 2 and Python 3 in parallel
since 2010, and has found that supporting Python 2 is an increasing
burden on our limited resources; thus, we plan to eventually drop
Python 2 support as well. Now that we're entering the final years of
community-supported Python 2, the NumPy project wants to clarify our
plans, with the goal of to helping our downstream ecosystem make plans
and accomplish the transition with as little disruption as possible.

Our current plan is as follows:

Until **December 31, 2018**, all NumPy releases will fully support
both Python 2 and Python 3.

Starting on **January 1, 2019**, any new feature releases will support
only Python 3.

The last Python-2-supporting release will be designated as a long-term
support (LTS) release, meaning that we will continue to merge
bug-fixes and make bug-fix releases for a longer period than usual.
Specifically, it will be supported by the community until **December
31, 2019**.

On **January 1, 2020** we will raise a toast to Python 2, and
community support for the last Python-2-supporting release will come
to an end. However, it will continue to be available on PyPI
indefinitely, and if any commercial vendors wish to extend the LTS
support past this point then we are open to letting them use the LTS
branch in the official NumPy repository to coordinate that.

If you are a NumPy user who requires ongoing Python 2 support in 2020
or later, then please contact your vendor. If you are a vendor who
wishes to continue to support NumPy on Python 2 in 2020+, please get
in touch; ideally we'd like you to get involved in maintaining the LTS
before it actually hits end-of-life, so we can make a clean handoff.

To minimize disruption, running 'pip install numpy' on Python 2 will
continue to give the last working release in perpetuity; but after
January 1, 2019 it may not contain the latest features, and after
January 1, 2020 it may not contain the latest bug fixes.

For more information on the scientific Python ecosystem's transition
to Python-3-only, see: http://www.python3statement.org/

For more information on porting your code to run on Python 3, see:
https://docs.python.org/3/howto/pyporting.html



Thoughts?

-n

On Thu, Nov 9, 2017 at 12:53 PM, Marten van Kerkwijk
 wrote:
> In astropy we had a similar discussion about version numbers, and
> decided to make 2.0 the LTS that still supports python 2.7 and 3.0 the
> first that does not.  If we're discussing jumping a major number, we
> could do the same for numpy.  (Admittedly, it made a bit more sense
> with the numbering scheme astropy had adopted anyway.) -- Marten
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion



-- 
Nathaniel J. Smith -- https://vorpus.org
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] deprecate updateifcopy in nditer operand, flags?

2017-11-10 Thread Nathaniel Smith
On Wed, Nov 8, 2017 at 2:13 PM, Allan Haldane  wrote:
> On 11/08/2017 03:12 PM, Nathaniel Smith wrote:
>> - We could adjust the API so that there's some explicit operation to
>> trigger the final writeback. At the Python level this would probably
>> mean that we start supporting the use of nditer as a context manager,
>> and eventually start raising an error if you're in one of the "unsafe"
>> case and not using the context manager form. At the C level we
>> probably need some explicit "I'm done with this iterator now" call.
>>
>> One question is which cases exactly should produce warnings/eventually
>> errors. At the Python level, I guess the simplest rule would be that
>> if you have any write/readwrite arrays in your iterator, then you have
>> to use a 'with' block. At the C level, it's a little trickier, because
>> it's hard to tell up-front whether someone has updated their code to
>> call a final cleanup function, and it's hard to emit a warning/error
>> on something that *doesn't* happen. (You could print a warning when
>> the nditer object is GCed if the cleanup function wasn't called, but
>> you can't raise an error there.) I guess the only reasonable option is
>> to deprecate NPY_ITER_READWRITE and NP_ITER_WRITEONLY, and make people
>> switch to passing new flags that have the same semantics but also
>> promise that the user has updated their code to call the new cleanup
>> function.
> Seems reasonable.
>
> When people use the Nditer C-api, they (almost?) always call
> NpyIter_Dealloc when they're done. Maybe that's a place to put a warning
> for C-api users. I think you can emit a warning there since that
> function calls the GC, not the other way around.
>
> It looks like you've already discussed the possibilities of putting
> things in NpyIter_Dealloc though, and it could be tricky, but if we only
> need a warning maybe there's a way.
> https://github.com/numpy/numpy/pull/9269/files/6dc0c65e4b2ea67688d6b617da3a175cd603fc18#r127707149

Oh, hmm, yeah, on further examination there are some more options here.

I had missed that for some reason NpyIter isn't actually a Python
object, so actually it's never subject to GC and you always need to
call NpyIter_Deallocate when you are finished with it. So that's a
natural place to perform writebacks. We don't even need a warning.
(Which is good, because warnings can be set to raise errors, and while
the docs say that NpyIter_Deallocate can fail, in fact it never has
been able to in the past and none of the code in numpy or the examples
in the docs actually check the return value. Though I guess in theory
writeback can also fail so I suppose we need to start returning
NPY_FAIL in that case. But it should be vanishingly rare in practice,
and it's not clear if anyone is even using this API outside of numpy.)

And for the Python-level API, there is the option of performing the
final writeback when the iterator is exhausted. The downside to this
is that if someone only goes half-way through the iteration and then
aborts (e.g. by raising an exception), then the last round of
writeback won't happen. But maybe that's fine, or at least better than
forcing the use of 'with' blocks everywhere? If we do this then I
think we'd at least want to make sure that the writeback really never
happens, as opposed to happening at some random later point when the
Python iterator object is GCed. But I'd appreciate if anyone would
express a preference between these :-)

-n

-- 
Nathaniel J. Smith -- https://vorpus.org
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Proposal of timeline for dropping Python 2.7 support

2017-11-12 Thread Nathaniel Smith
On Nov 12, 2017 1:12 PM, "Todd"  wrote:


Might it make sense to do this in a synchronized manner with scipy?  So
both numpy and scipy drop support for python 2 on the first release after
December 31 2018, and numpy's first python3-only release comes before (or
simultaneously with) scipy's. Then scipy can set is minimum supported numpy
version to be the first python3-only version.

That allows scipy to have a clean, obvious point where scipy supports only
the latest numpy. This will diverge later, but it seems to be a relatively
safe place to bring them back into sync.


That's really a question for the scipy devs on the scipy mailing list.
There's substantial overlap between the numpy and scipy communities, but
not everyone is on both lists and they're distinct projects that sometimes
have unique issues to worry about.

I'd like to see numpy's downstream projects become more aggressive about
dropping support for old numpy versions in general, but there's no
technical reason that scipy's first 3-only release couldn't continue to
support one or more numpy 2+3 releases.

-n
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Proposal of timeline for dropping Python 2.7 support

2017-11-13 Thread Nathaniel Smith
On Nov 13, 2017 12:03, "Gael Varoquaux" 
wrote:

On Mon, Nov 13, 2017 at 10:26:31AM -0800, Matthias Bussonnier wrote:
> This behavior is "new" (Nov/Dec 2016). [snip]
> It _does_ require to have a version of pip which is not decades old

Just to check that I am not misunderstanding: the version of pip should
not be more than a year old; "decades old" is just French hyperbola? Do I
understand right?


Right, the requirement is pip 9, which is currently one year old and will
be >2 years old by the time this matters for numpy.

It does turn out that there's a bimodal distribution in the wild, where
people tend to either use an up to date pip, or else use some truly ancient
pip that some Linux LTS distro shipped 5 years ago. Numpy isn't the only
project that will be forcing people to upgrade, though, so I think this
will work itself out. Especially since in the broken case what happens is
that users end up running our setup.py on an unsupported version of python,
so we'll be able to detect that and print some loud and informative message.

-n
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] numpy grant update

2017-11-13 Thread Nathaniel Smith
On Thu, Oct 26, 2017 at 12:40 PM, Nathaniel Smith  wrote:
> On Wed, Oct 18, 2017 at 10:24 PM, Nathaniel Smith  wrote:
>> I'll also be giving a lunch talk at BIDS tomorrow to let folks locally
>> know about what's going on, which I think will be recorded – I'll send
>> around a link after in case others are interested.
>
> Here's that link: https://www.youtube.com/watch?v=fowHwlpGb34

Still no update on that job ad (though we're learning interesting
things about Berkeley's HR system!), but we did make a little scratch
repo to start brainstorming. This is mostly for getting our own
thoughts in order, but if anyone's curious then here it is:

https://github.com/njsmith/numpy-grant-planning/

-n

-- 
Nathaniel J. Smith -- https://vorpus.org
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Proposal of timeline for dropping Python 2.7 support

2017-11-14 Thread Nathaniel Smith
Apparently this is actually uncontroversial, the discussion's died
down (see also the comments on Chuck's PR [1]), and anyone who wanted
to object has had more than a week to do so, so... I guess we can say
this is what's happening and start publicizing it to our users!

A direct link to the rendered NEP in the repo is:
https://github.com/numpy/numpy/blob/master/doc/neps/dropping-python2.7-proposal.rst

(I guess that at some point it will also show up on docs.scipy.org.)

-n

[1] https://github.com/numpy/numpy/pull/10006

On Thu, Nov 9, 2017 at 5:52 PM, Nathaniel Smith  wrote:
> Fortunately we can wait until we're a bit closer before we have to
> make any final decision on the version numbering :-)
>
> Right now though it would be good to start communicating to
> users/downstreams about whatever our plans our though, so they can
> make plans. Here's a first attempt at some text we can put in the
> documentation and point people to -- any thoughts, on either the plan
> or the wording?
>
>  DRAFT TEXT - NOT FINAL - DO NOT POST THIS TO HACKERNEWS OK? OK 
>
> The Python core team plans to stop supporting Python 2 in 2020. The
> NumPy project has supported both Python 2 and Python 3 in parallel
> since 2010, and has found that supporting Python 2 is an increasing
> burden on our limited resources; thus, we plan to eventually drop
> Python 2 support as well. Now that we're entering the final years of
> community-supported Python 2, the NumPy project wants to clarify our
> plans, with the goal of to helping our downstream ecosystem make plans
> and accomplish the transition with as little disruption as possible.
>
> Our current plan is as follows:
>
> Until **December 31, 2018**, all NumPy releases will fully support
> both Python 2 and Python 3.
>
> Starting on **January 1, 2019**, any new feature releases will support
> only Python 3.
>
> The last Python-2-supporting release will be designated as a long-term
> support (LTS) release, meaning that we will continue to merge
> bug-fixes and make bug-fix releases for a longer period than usual.
> Specifically, it will be supported by the community until **December
> 31, 2019**.
>
> On **January 1, 2020** we will raise a toast to Python 2, and
> community support for the last Python-2-supporting release will come
> to an end. However, it will continue to be available on PyPI
> indefinitely, and if any commercial vendors wish to extend the LTS
> support past this point then we are open to letting them use the LTS
> branch in the official NumPy repository to coordinate that.
>
> If you are a NumPy user who requires ongoing Python 2 support in 2020
> or later, then please contact your vendor. If you are a vendor who
> wishes to continue to support NumPy on Python 2 in 2020+, please get
> in touch; ideally we'd like you to get involved in maintaining the LTS
> before it actually hits end-of-life, so we can make a clean handoff.
>
> To minimize disruption, running 'pip install numpy' on Python 2 will
> continue to give the last working release in perpetuity; but after
> January 1, 2019 it may not contain the latest features, and after
> January 1, 2020 it may not contain the latest bug fixes.
>
> For more information on the scientific Python ecosystem's transition
> to Python-3-only, see: http://www.python3statement.org/
>
> For more information on porting your code to run on Python 3, see:
> https://docs.python.org/3/howto/pyporting.html
>
> 
>
> Thoughts?
>
> -n
>
> On Thu, Nov 9, 2017 at 12:53 PM, Marten van Kerkwijk
>  wrote:
>> In astropy we had a similar discussion about version numbers, and
>> decided to make 2.0 the LTS that still supports python 2.7 and 3.0 the
>> first that does not.  If we're discussing jumping a major number, we
>> could do the same for numpy.  (Admittedly, it made a bit more sense
>> with the numbering scheme astropy had adopted anyway.) -- Marten
>> ___
>> NumPy-Discussion mailing list
>> NumPy-Discussion@python.org
>> https://mail.python.org/mailman/listinfo/numpy-discussion
>
>
>
> --
> Nathaniel J. Smith -- https://vorpus.org



-- 
Nathaniel J. Smith -- https://vorpus.org
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] Upcoming revision of the BLAS standard

2017-11-14 Thread Nathaniel Smith
Hi NumPy and SciPy developers,

Apparently there is some work afoot to update the BLAS standard, with
a working document here:

https://docs.google.com/document/d/1DY4ImZT1coqri2382GusXgBTTTVdBDvtD5I14QHp9OE/edit

This seems like something where we might want to get involved in, so
that the new standard works for us, and James Demmel (the first author
on that proposal and a professor here at Berkeley) suggested they'd be
interested to hear our thoughts.

I'm not sure exactly what the process is here -- apparently there have
been some workshops, and there was going to be a BoF today at
Supercomputing, but I don't know what the schedule is or how they'll
be making decisions. It's possible for anyone interested to click on
that google doc above and make "suggestions", but it seems like maybe
it would be useful for the NumPy/SciPy teams to come up with some sort
of shared document on what we want?

I'm really, really not the biggest linear algebra expert on these
lists, so I'm hoping those with more experience will jump in, but to
get started here are some initial ideas for things we might want to
ask for:

- Support for arbitrary strided memory layout
- Replacing xerbla with proper error codes (already in that proposal)
- There's some discussion about NaN handling where I think we might
have opinions. (Am I remember right that currently we have to check
for NaNs ourselves all the time because there are libraries that blow
up if we don't, and we don't know which ones those are?)
- Where the spec ends up giving implementors flexibility, some way to
detect at compile time what options they chose.

-n

-- 
Nathaniel J. Smith -- https://vorpus.org
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Type annotations for NumPy

2017-11-25 Thread Nathaniel Smith
On Sat, Nov 25, 2017 at 3:09 PM, Juan Nunez-Iglesias  wrote:
> This is a complete outsider’s perspective but
>
> (a) it would be good if NumPy type annotations could include an “array_like”
> type that allows lists, tuples, etc.

I'm sure this will exist.

> (b) I’ve always thought (since PEP561) that it would be cool for type
> annotations to replace compiler type annotations for e.g. Cython and Numba.
> Is this in the realm of possibility for the future?

It turns out that the PEP 484 type system is *mostly* not useful for
this. They're really designed for checking consistency across a large
code-base, not for enabling compiler speedups. For example, if you
annotate something as an int, that means "this object is a subclass of
int". This is enough to let mypy catch your mistake if you
accidentally pass in a float instead, but it's not enough to tell you
anything at all about the object's behavior -- you could make a wacky
int subclass that acts like a string or something.

Probably there are some benefits that compilers can get from PEP 484
annotations, but you should think of them as largely an orthogonal
thing.

-n

-- 
Nathaniel J. Smith -- https://vorpus.org
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Deprecate matrices in 1.15 and remove in 1.17?

2017-11-30 Thread Nathaniel Smith
On Thu, Nov 30, 2017 at 11:39 AM, Charles R Harris
 wrote:
>
>
> On Thu, Nov 30, 2017 at 11:43 AM, Ralf Gommers 
> wrote:
>> I'd suggest any release in the next couple of years is fine,but the one
>> where we drop Python 2 support is probably the worst choice. That's one of
>> the few things the core Python devs got 100% right with the Python 3 move:
>> advocate that in the 2->3 transition packages would not make any API changes
>> in order to make porting the least painful.
>
>
> Agree, we don't want to pile in too many changes at once. I think the big
> sticking point is the sparse matrices in SciPy, even issuing a
> DeprecationWarning could be problematic as long as there are sparse
> matrices. May I suggest that we put together an NEP for the NumPy side of
> things? Ralf, does SciPy have a mechanism for proposing such changes?

Agreed here as well... while I want to get rid of np.matrix as much as
anyone, doing that anytime soon would be *really* disruptive.

- There are tons of little scripts out there written by people who
didn't know better; we do want them to learn not to use np.matrix but
breaking all their scripts is a painful way to do that

- There are major projects like scikit-learn that simply have no
alternative to using np.matrix, because of scipy.sparse.

So I think the way forward is something like:

- Now or whenever someone gets together a PR: issue a
PendingDeprecationWarning in np.matrix.__init__ (unless it kills
performance for scikit-learn and friends), and put a big warning box
at the top of the docs. The idea here is to not actually break
anyone's code, but start to get out the message that we definitely
don't think anyone should use this if they have any alternative.

- After there's an alternative to scipy.sparse: ramp up the warnings,
possibly all the way to FutureWarning so that existing scripts don't
break but they do get noisy warnings

- Eventually, if we think it will reduce maintenance costs: split it
into a subpackage

I expect that one way or another we'll be maintaining matrix for quite
some time, and I agree with whoever said that most of the burden seems
to be in keeping the rest of numpy working sensibly with it, so I
don't think moving it into a subpackage is itself going to make a big
different either way. To me the logic is more like, if/when we decide
to actually break everyone's code by making `np.matrix` raise
AttributeError, then we should probably provide some package they can
import to get their code limping along again, and if we're going to do
that anyway then probably we should split it out first and shake out
any bugs before we make `np.matrix` start raising errors. But it's
going to be quite some time until we reach the "break everyone's code"
stage, given just how much code is out there using matrix, so there's
no point in making detailed plans right now.

-n

-- 
Nathaniel J. Smith -- https://vorpus.org
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Type annotations for NumPy

2017-12-05 Thread Nathaniel Smith
On Tue, Dec 5, 2017 at 10:04 AM, Stephan Hoyer  wrote:
> This discussion has died down, but I don't want to lose momentum .
>
> It sounds like there is at least strong interest from a subset of our
> community in type annotations. Are there any objections to the first part of
> my plan, to start developing type stubs for NumPy in separate repository?

I think there's been plenty of time for folks to object to this if
they wanted, so we can assume consensus until we hear otherwise.

-n

-- 
Nathaniel J. Smith -- https://vorpus.org
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] NEP process update

2017-12-05 Thread Nathaniel Smith
On Tue, Dec 5, 2017 at 4:12 PM, Ralf Gommers  wrote:
> On Wed, Dec 6, 2017 at 12:31 PM, Jarrod Millman 
> wrote:
>> Assuming that sounds good, my tentative next steps are:
>>
>> - I'll draft a purpose and process NEP based on PEP 1 and a few other
>> projects.
>> - I'll also create a draft NEP template.
>
>
> sounds good
>
>> - I'll move the NEPs into their own repo (something like numpy/neps),
>
> This doesn't sound ideal to me - NEPs are important pieces of documentation,
> so I'd rather keep them included in the main docs.
>
>>   and set up an automated system (RTD or Github pages) to
>>   render and publish them with some useful index.
>
>
> If you could copy over the scipy method to rebuild the docs on each merge
> into master, that would achieve the same purpose. Compare
> https://docs.scipy.org/doc/numpy-dev/reference/ (outdated) vs
> https://docs.scipy.org/doc/scipy-dev/reference/ (redirects to
> http://scipy.github.io/devdocs/, always up-to-date).

Yeah, we were debating back and forth on this -- I can see arguments
either way. The reasons we were leaning towards splitting them out
are:

- it would be great to make our regular docs auto-generated, but we
didn't necessarily want to block this on that
- part of the idea is to auto-generate the NEP index out of the
metadata inside each NEP file, which is going to involve writing some
code and integrating it into the NEP build. This seems easier if we
don't have to integrate it into the overall doc build process too,
which already has a lot of custom code.
- NEPs are really part of the development process, not an output for
end-users -- they're certainly useful to have available as a
reference, but if we're asking end-users to look at them on a regular
basis then I think we've messed up and should improve our actual
documentation :-)
- NEPs have a different natural life-cycle than numpy itself. Right
now, if I google "numpy neps", the first hit is the 1.13 version of
the NEPs, and the third hit is someone else's copy of the 1.9 version
of the NEPs. What you actually want in every case is the latest
development version of the NEPs, and the idea of "numpy 1.13 NEPs"
doesn't even make sense, because NEPs are not describing a specific
numpy release.

-n

-- 
Nathaniel J. Smith -- https://vorpus.org
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] NEP process update

2017-12-05 Thread Nathaniel Smith
On Tue, Dec 5, 2017 at 5:32 PM, Ralf Gommers  wrote:
>
>
> On Wed, Dec 6, 2017 at 1:49 PM, Nathaniel Smith  wrote:
>> - NEPs are really part of the development process, not an output for
>> end-users -- they're certainly useful to have available as a
>> reference, but if we're asking end-users to look at them on a regular
>> basis then I think we've messed up and should improve our actual
>> documentation :-)
>> - NEPs have a different natural life-cycle than numpy itself. Right
>> now, if I google "numpy neps", the first hit is the 1.13 version of
>> the NEPs, and the third hit is someone else's copy of the 1.9 version
>> of the NEPs. What you actually want in every case is the latest
>> development version of the NEPs, and the idea of "numpy 1.13 NEPs"
>> doesn't even make sense, because NEPs are not describing a specific
>> numpy release.
>
>
> The last two points are good arguments, I agree that they shouldn't serve as
> documentation. A separate repo has downsides though (discoverability etc.),
> we also keep our dev docs within the numpy repo and you can make exactly the
> same argument about those as about NEPs. So I'd still suggest keeping them
> where they are. Or otherwise move all development related docs.

Are these the dev docs you're thinking of?
https://docs.scipy.org/doc/numpy-dev/dev/index.html

Regarding discoverability, right now it looks like the only way to
find the latest NEPs on google is by searching for something like
"numpy-dev neps", which is pretty obscure. (It took me 4 tries to find
something that worked. "numpy neps" seemed to work, but actually sent
me to an out-of-date snapshot.) In Python, the PEP web pages are
rebuilt on something like a 6 hour timer, and it's actually super
annoying, because it means that when someone posts to the list like
"hey, I just pushed a new version, tell me what you think", everyone
goes and finds the old stale version, sometimes people start
critiquing it, ... it's just confusing all around. So I do think we
want to make sure there's some simple way to find them, and that it
leads to the latest version, not a stale build or an old snapshot.

Moving NEPs + development docs to their own dedicated repo would
resolve this and seems like a plausible option to me. We could
probably do better than we are now with the regular docs too. Though
the experience with PEPs does make me a bit nervous about having
versioned snapshots of the NEPs in all our old versioned manuals
(which have tons of google-juice).

-n

-- 
Nathaniel J. Smith -- https://vorpus.org
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Which rule makes x[np.newaxis, :] and x[np.newaxis] equivalent?

2017-12-12 Thread Nathaniel Smith
On Tue, Dec 12, 2017 at 12:02 AM, Joe  wrote:
> Hi,
>
> question says it all. I looked through the basic and advanced indexing,
> but I could not find the rule that is applied to make
> x[np.newaxis,:] and x[np.newaxis] the same.

I think it's the general rule that all indexing expressions have an
invisible "..." on the right edge. For example, x[i][j][k] is an
inefficient and IMO somewhat confusing way to write x[i, j, k],
because x[i][j][k] is interpreted as:

-> x[i, ...][j, ...][k, ...]
-> x[i, :, :][j, :][k]

That this also applies to newaxis is a little surprising, but I guess
consistent.

-n

-- 
Nathaniel J. Smith -- https://vorpus.org
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] building numpy with python3.7

2017-12-15 Thread Nathaniel Smith
Try upgrading cython?

On Fri, Dec 15, 2017 at 2:11 AM, Hannes Breytenbach  wrote:
> Hi devs!
>
> This is my first post to the discussion list!
>
> Has anyone tried to build numpy with python3.7.0a3?
>
> I get the following gcc errors during compile:
>
> .
> .
> .
>
> compiling C sources
> C compiler: gcc -pthread -Wno-unused-result -Wsign-compare -DNDEBUG -g 
> -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC
>
> compile options: '-D_FILE_OFFSET_BITS=64 -D_LARGEFILE_SOURCE=1 
> -D_LARGEFILE64_SOURCE=1 -Inumpy/core/include 
> -Ibuild/src.linux-x86_64-3.7/numpy/core/include/numpy 
> -Inumpy/core/src/private -Inumpy/core/src -Inumpy/core 
> -Inumpy/core/src/npymath -Inumpy/core/src/multiarray -Inumpy/core/src/umath 
> -Inumpy/core/src/npysort -I/usr/local/include/python3.7m 
> -Ibuild/src.linux-x86_64-3.7/numpy/core/src/private 
> -Ibuild/src.linux-x86_64-3.7/numpy/core/src/npymath 
> -Ibuild/src.linux-x86_64-3.7/numpy/core/src/private 
> -Ibuild/src.linux-x86_64-3.7/numpy/core/src/npymath 
> -Ibuild/src.linux-x86_64-3.7/numpy/core/src/private 
> -Ibuild/src.linux-x86_64-3.7/numpy/core/src/npymath -c'
> gcc: numpy/random/mtrand/mtrand.c
> numpy/random/mtrand/mtrand.c: In function ‘__Pyx__ExceptionSave’:
> numpy/random/mtrand/mtrand.c:40970:19: error: ‘PyThreadState {aka struct 
> _ts}’ has no member named ‘exc_type’; did you mean ‘curexc_type’?
>  *type = tstate->exc_type;
>^~
> .
> .
> .
>
> error: Command "gcc -pthread -Wno-unused-result -Wsign-compare -DNDEBUG 
> -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -D_FILE_OFFSET_BITS=64 
> -D_LARGEFILE_SOURCE=1 -D_LARGEFILE64_SOURCE=1 -Inumpy/core/include 
> -Ibuild/src.linux-x86_64-3.7/numpy/core/include/numpy 
> -Inumpy/core/src/private -Inumpy/core/src -Inumpy/core 
> -Inumpy/core/src/npymath -Inumpy/core/src/multiarray -Inumpy/core/src/umath 
> -Inumpy/core/src/npysort -I/usr/local/include/python3.7m 
> -Ibuild/src.linux-x86_64-3.7/numpy/core/src/private 
> -Ibuild/src.linux-x86_64-3.7/numpy/core/src/npymath 
> -Ibuild/src.linux-x86_64-3.7/numpy/core/src/private 
> -Ibuild/src.linux-x86_64-3.7/numpy/core/src/npymath 
> -Ibuild/src.linux-x86_64-3.7/numpy/core/src/private 
> -Ibuild/src.linux-x86_64-3.7/numpy/core/src/npymath -c 
> numpy/random/mtrand/mtrand.c -o 
> build/temp.linux-x86_64-3.7/numpy/random/mtrand/mtrand.o -MMD -MF 
> build/temp.linux-x86_64-3.7/numpy/random/mtrand/mtrand.o.d" failed with exit 
> status 1
>
>
>
> Version info:
> -
> gcc --version: gcc (Ubuntu 6.3.0-12ubuntu2) 6.3.0 2017040
> numpy version: 1.15.0.dev0+d233e1f
>
>
> The same error comes up when building via pip.  I don't know enough about the 
> underlying C code to know how to debug this. Any help would be greatly 
> appreciated!
>
> Cheers,
>
> --
> Hannes Breytenbach
>
> PhD Candidate
> South African Astronomical Observatory
> +27 82 726 9311
>
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion



-- 
Nathaniel J. Smith -- https://vorpus.org
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] building numpy with python3.7

2017-12-15 Thread Nathaniel Smith
On Fri, Dec 15, 2017 at 2:42 AM, Hannes Breytenbach  wrote:
>
> I don't think this is a cython version issue - cloned the latest version from 
> git yesterday...
>
> python3.7 -c "import cython; print(cython.__version__)"
> 0.28a0

It is a cython version issue: https://github.com/cython/cython/issues/1955

It's supposed to be fixed though, so I don't know why you it isn't
working for you. Are you sure that cython is installed in the same
virtualenv as you're using to build numpy? If you were using a numpy
sdist then it would make sense because we include the pre-generated .c
files in the sdists instead of running cython, but given that you're
building from a numpy checkout I dunno.

-n

-- 
Nathaniel J. Smith -- https://vorpus.org
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] Another grant update, & numpy job ad is up!

2017-12-21 Thread Nathaniel Smith
Hi all,

Two exciting bits of news:

1) We just posted the announcement of a second grant to BIDS for
NumPy, this time from the Sloan Foundation:

https://bids.berkeley.edu/news/bids-receives-sloan-foundation-grant-contribute-numpy-development

This is for $659,359 over two years, very similar to the
previously-announced Moore Foundation grant. These two grants were
originally written together as one, then split in half between the two
foundations, then the schedules drifted apart... I'm excited to
finally have this all sorted out so we can move ahead with the
original plan!

2) We have successfully navigated UC Berkeley's administrative systems
and posted an actual job opening, which you can apply to *right now*:

https://jobsprod.is.berkeley.edu/psp/jobsprod/EMPLOYEE/HRMS/c/HRS_HRAM.HRS_CE.GBL?Page=HRS_CE_JOB_DTL&Action=A&JobOpeningId=24142&SiteId=1&PostingSeq=1

We're hoping to hire 2-3 people under that job description, so please
tell your friends, relatives, enemies, whoever. This is a fully open
search and we want our candidate pool to be as large and diverse (in
all senses) as possible, so if you're on the fence about applying, why
not give it a shot? Or if you want to know more, feel free to send any
questions to me (n...@berkeley.edu) and/or Stéfan van der Walt
(stef...@berkeley.edu) -- we'd love to chat.

We'll also do another round of publicity/reminders in the new year
after everyone's had a chance to recover from the holidays, but wanted
to at least get this out now so those who are interested can start
putting their application together...

-n

-- 
Nathaniel J. Smith -- https://vorpus.org
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Another grant update, & numpy job ad is up!

2017-12-22 Thread Nathaniel Smith
On Dec 22, 2017 8:35 AM, "Charles R Harris" 
wrote:



On Thu, Dec 21, 2017 at 6:38 PM, Nathaniel Smith  wrote:

> Hi all,
>
> Two exciting bits of news:
>
> 1) We just posted the announcement of a second grant to BIDS for
> NumPy, this time from the Sloan Foundation:
>
> https://bids.berkeley.edu/news/bids-receives-sloan-foundatio
> n-grant-contribute-numpy-development
>
> This is for $659,359 over two years, very similar to the
> previously-announced Moore Foundation grant. These two grants were
> originally written together as one, then split in half between the two
> foundations, then the schedules drifted apart... I'm excited to
> finally have this all sorted out so we can move ahead with the
> original plan!
>

Do the two grants concurrent, overlapping, or consecutive?


Concurrent. We're in the process now of sorting out the dates so that they
match, even though we got them at different times.

-n
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] NumPy 1.14.0 release

2018-01-07 Thread Nathaniel Smith
On Sun, Jan 7, 2018 at 12:59 PM, Allan Haldane  wrote:
> On 01/07/2018 12:37 PM, Ralf Gommers wrote:
>>
>>
>>
>> On Sun, Jan 7, 2018 at 2:00 PM, Charles R Harris
>> mailto:charlesr.har...@gmail.com>> wrote:
>>
>> Hi All,
>>
>> On behalf of the NumPy team, I am pleased to announce NumPy 1.14.0.
>>
>>
>> Thanks for doing the heavy lifting to get this release out the door Chuck!
>>
>> Ralf
>
>
> Yes, I am always very impressed and appreciative of all the work Chuck does
> for Numpy. Thank you very much!

+1

-- 
Nathaniel J. Smith -- https://vorpus.org
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] RFC: comments to BLAS committee from numpy/scipy devs

2018-01-09 Thread Nathaniel Smith
Hi all,

As mentioned earlier [1][2], there's work underway to revise and
update the BLAS standard -- e.g. we might get support for strided
arrays and lose xerbla! There's a draft at [3]. They're interested in
feedback from users, so I've written up a first draft of comments
about what we would like as NumPy/SciPy developers. This is very much
a first attempt -- I know we have lots of people who are more expert
on BLAS than me on these lists :-). Please let me know what you think.

-n

[1] https://mail.python.org/pipermail/numpy-discussion/2017-November/077420.html
[2] https://mail.python.org/pipermail/scipy-dev/2017-November/022267.html
[3] 
https://docs.google.com/document/d/1DY4ImZT1coqri2382GusXgBTTTVdBDvtD5I14QHp9OE/edit

-

# Comments from NumPy / SciPy developers on "A Proposal for a
Next-Generation BLAS"

These are comments on [A Proposal for a Next-Generation
BLAS](https://docs.google.com/document/d/1DY4ImZT1coqri2382GusXgBTTTVdBDvtD5I14QHp9OE/edit#)
(version as of 2017-12-13), from the perspective of the developers of
the NumPy and SciPy libraries. We hope this feedback is useful, and
welcome further discussion.

## Who are we?

NumPy and SciPy are the two foundational libraries of the Python
numerical ecosystem, and one of their duties is to wrap BLAS and
expose it for the use of other Python libraries. (NumPy primarily
provides a GEMM wrapper, while SciPy exposes more specialized
operations.) It's unclear how many users we have exactly, but we
certainly ship multiple million copies of BLAS every month, and
provide one of the most popular numerical toolkits for both novice and
expert users.

Looking at the original BLAS and LAPACK interfaces, it often seems
that their imagined user is something like a classic supercomputer
consumer, who writes code directly in Fortran or C against the BLAS
API, and where the person writing the code and running the code are
the same. NumPy/SciPy are coming from a very different perspective:
our users generally know nothing about the details of the underlying
BLAS; they just want to describe their problem in some high-level way,
and the library is responsible for making it happen as efficiently as
possible, and is often integrated into some larger system (e.g. a
real-time analytics platform embedded in a web server).

When it comes to our BLAS usage, we mostly use only a small subset of
the routines. However, as "consumer software" used by a wide variety
of users with differing degress of technical expertise, we're expected
to Just Work on a wide variety of systems, and with as many different
vendor BLAS libraries as possible. On the other hand, the fact that
we're working with Python means we don't tend to worry about small
inefficiencies that will be lost in the noise in any case, and are
willing to sacrifice some performance to get more reliable operation
across our diverse userbase.

## Comments on specific aspects of the proposal

### Data Layout

We are **strongly in favor** of the proposal to support arbitrary
strided data layouts. Ideally, this would support strides *specified
in bytes* (allowing for unaligned data layouts), and allow for truly
arbitrary strides, including *zero or negative* values. However, we
think it's fine if some of the weirder cases suffer a performance
penalty.

Rationale: NumPy – and thus, most of the scientific Python ecosystem –
only has one way of representing an array: the `numpy.ndarray` type,
which is an arbitrary dimensional tensor with arbitrary strides. It is
common to encounter matrices with non-trivial strides. For example::

# Make a 3-dimensional tensor, 10 x 9 x 8
t = np.zeros((10, 9, 8))
# Considering this as a stack of eight 10x9 matrices, extract the first:
mat = t[:, :, 0]

Now `mat` has non-trivial strides on both axes. (If running this in a
Python interpreter, you can see this by looking at the value of
`mat.strides`.) Another case where interesting strides arise is when
performing 
["broadcasting"](https://docs.scipy.org/doc/numpy-1.13.0/user/basics.broadcasting.html),
which is the name for NumPy's rules for stretching arrays to make
their shapes match. For example, in an expression like::

np.array([1, 2, 3]) + 1

the scalar `1` is "broadcast" to create a vector `[1, 1, 1]`. This is
accomplished without allocating memory, by creating a vector with
settings length = 3, strides = 0 – so all the elements share a single
location in memory. Similarly, by using negative strides we can
reverse an array without allocating memory::

a = np.array([1, 2, 3])
a_flipped = a[::-1]

Now `a_flipped` has the value `[3, 2, 1]`, while sharing storage with
the array `a = [1, 2, 3]`. Misaligned data is also possible (e.g. an
array of 8-byte doubles with a 9-byte stride), though it arises more
rarely. (An example of when it might occurs is in an on-disk data
format that alternates between storing a double value and then a
single byte value, which is then memory-mapped.)

While this array representati

Re: [Numpy-discussion] [SciPy-Dev] RFC: comments to BLAS committee from numpy/scipy devs

2018-01-09 Thread Nathaniel Smith
On Tue, Jan 9, 2018 at 3:40 AM, Ilhan Polat  wrote:
> I couldn't find an item to place this but I think ilaenv and also calling
> the function twice (one with lwork=-1 and reading the optimal block size and
> the call the function again properly with lwork=) in LAPACK needs to
> be gotten rid of.
>
> That's a major annoyance during the wrapping of LAPACK routines for SciPy.
>
> I don't know if this is realistic but the values ilaenv needed can be
> computed once (or again if hardware is changed) at the install and can be
> read off by the routines.

Unfortunately I think this effort is just to revise BLAS, not LAPACK.
Maybe you should try starting a conversation with the LAPACK
developers though – I don't know much about how they work but maybe
they'd be interested in feedback.

-n

-- 
Nathaniel J. Smith -- https://vorpus.org
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] [SciPy-Dev] RFC: comments to BLAS committee from numpy/scipy devs

2018-01-09 Thread Nathaniel Smith
On Tue, Jan 9, 2018 at 12:53 PM, Tyler Reddy  wrote:
> One common issue in computational geometry is the need to operate rapidly on
> arrays with "heterogeneous shapes."
>
> So, an array that has rows with different numbers of columns -- shape (1,3)
> for the first polygon and shape (1, 12) for the second polygon and so on.
>
> This seems like a particularly nasty scenario when the loss of "homogeneity"
> in shape precludes traditional vectorization -- I think numpy effectively
> converts these to dtype=object, etc. I don't
> think is necessarily a BLAS issue since wrapping comp. geo. libraries does
> happen in a subset of cases to handle this, but if there's overlap in
> utility you could pass it along I suppose.

You might be interested in this discussion of "Batch BLAS":
https://docs.google.com/document/d/1DY4ImZT1coqri2382GusXgBTTTVdBDvtD5I14QHp9OE/edit#heading=h.pvsif1mxvaqq

I didn't get into it in the draft response, because it didn't seem
like something where NumPy/SciPy have any useful experience to offer,
but it sounds like there are people worrying about this case.

-n

-- 
Nathaniel J. Smith -- https://vorpus.org
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Moving NumPy's PRNG Forward

2018-01-19 Thread Nathaniel Smith
On Fri, Jan 19, 2018 at 6:55 AM, Robert Kern  wrote:
[...]
> There seems to be a lot of pent-up motivation to improve on the random
> number generation, in particular the distributions, that has been blocked by
> our policy. I think we've lost a few potential first-time contributors that
> have run up against this wall. We have been pondering ways to allow for
> adding new core PRNGs and improve the distribution methods while maintaining
> stream-compatibility for existing code. Kevin Sheppard, in particular, has
> been working hard to implement new core PRNGs with a common API.
>
>   https://github.com/bashtage/ng-numpy-randomstate
>
> Kevin has also been working to implement the several proposals that have
> been made to select different versions of distribution implementations. In
> particular, one idea is to pass something to the RandomState constructor to
> select a specific version of distributions (or switch out the core PRNG).
> Note that to satisfy the policy, the simplest method of seeding a
> RandomState will always give you the oldest version: what we have now.
>
> Kevin has recently come to the conclusion that it's not technically feasible
> to add the version-selection at all if we keep the stream-compatibility
> policy.
>
>   https://github.com/numpy/numpy/pull/10124#issuecomment-350876221
>
> I would argue that our current policy isn't providing the value that it
> claims to.

I agree that relaxing our policy would be better than the status quo.
Before making any decisions, though, I'd like to make sure we
understand the alternatives and their trade-offs. Specifically, I
think the main alternative would be the following approach to
versioning:

1) make RandomState's state be a tuple (underlying RNG algorithm,
underlying RNG state, distribution version)
2) zero-argument initialization/seeding, like RandomState() or
rstate.seed(), sets the state to: (our recommended RNG algorithm,
os.urandom(...), version=LATEST_VERSION)
3) for backcompat, single-argument seeding like RandomState(123) or
rstate.seed(123), sets the state to: (mersenne twister,
expand_mt_seed(123), version=0)
4) also allow seeding to explicitly control all the parameters, like
RandomState(PCG_XSL_RR(123), version=12) or whatever
5) the distribution functions are implemented like:

def normal(*args, **kwargs):
if self.version < 3:
return self._normal_box_muller(*args, **kwargs)
elif self.version < 8:
return self._normal_ziggurat_v1(*args, **kwargs)
else:  # version >= 8
return self._normal_ziggurat_v2(*args, **kwargs)

Advantages: fully backwards compatible; preserves the compatibility
guarantee (such as it is); users who use the default seeding
automatically get the highest speed and quality
Disadvantages: users who specify seeds explicitly get old/slow
distributions (but of course that's the point of compatibility); we
have to keep the old distribution code around forever (but this is not
too hard; it just sits in some function and we never touch it).

Kevin, is this the version that you think is non-viable? Is the above
a good description of the advantages/disadvantages?

-n

-- 
Nathaniel J. Smith -- https://vorpus.org
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] New NEP: merging multiarray and umath

2018-03-08 Thread Nathaniel Smith
Hi all,

Well, this is something that we've discussed for a while and I think
generally has consensus already, but I figured I'd write it down
anyway to make sure.

There's a rendered version here:
https://github.com/njsmith/numpy/blob/nep-0015-merge-multiarray-umath/doc/neps/nep-0015-merge-multiarray-umath.rst

-


Merging multiarray and umath


:Author: Nathaniel J. Smith 
:Status: Draft
:Type: Standards Track
:Created: 2018-02-22


Abstract


Let's merge ``numpy.core.multiarray`` and ``numpy.core.umath`` into a
single extension module, and deprecate ``np.set_numeric_ops``.


Background
--

Currently, numpy's core C code is split between two separate extension
modules.

``numpy.core.multiarray`` is built from
``numpy/core/src/multiarray/*.c``, and contains the core array
functionality (in particular, the ``ndarray`` object).

``numpy.core.umath`` is built from ``numpy/core/src/umath/*.c``, and
contains the ufunc machinery.

These two modules each expose their own separate C API, accessed via
``import_multiarray()`` and ``import_umath()`` respectively. The idea
is that they're supposed to be independent modules, with
``multiarray`` as a lower-level layer with ``umath`` built on top. In
practice this has turned out to be problematic.

First, the layering isn't perfect: when you write ``ndarray +
ndarray``, this invokes ``ndarray.__add__``, which then calls the
ufunc ``np.add``. This means that ``ndarray`` needs to know about
ufuncs – so instead of a clean layering, we have a circular
dependency. To solve this, ``multiarray`` exports a somewhat
terrifying function called ``set_numeric_ops``. The bootstrap
procedure each time you ``import numpy`` is:

1. ``multiarray`` and its ``ndarray`` object are loaded, but
   arithmetic operations on ndarrays are broken.

2. ``umath`` is loaded.

3. ``set_numeric_ops`` is used to monkeypatch all the methods like
   ``ndarray.__add__`` with objects from ``umath``.

In addition, ``set_numeric_ops`` is exposed as a public API,
``np.set_numeric_ops``.

Furthermore, even when this layering does work, it ends up distorting
the shape of our public ABI. In recent years, the most common reason
for adding new functions to ``multiarray``\'s "public" ABI is not that
they really need to be public or that we expect other projects to use
them, but rather just that we need to call them from ``umath``. This
is extremely unfortunate, because it makes our public ABI
unnecessarily large, and since we can never remove things from it then
this creates an ongoing maintenance burden. The way C works, you can
have internal API that's visible to everything inside the same
extension module, or you can have a public API that everyone can use;
you can't have an API that's visible to multiple extension modules
inside numpy, but not to external users.

We've also increasingly been putting utility code into
``numpy/core/src/private/``, which now contains a bunch of files which
are ``#include``\d twice, once into ``multiarray`` and once into
``umath``. This is pretty gross, and is purely a workaround for these
being separate C extensions.


Proposed changes


This NEP proposes three changes:

1. We should start building ``numpy/core/src/multiarray/*.c`` and
   ``numpy/core/src/umath/*.c`` together into a single extension
   module.

2. Instead of ``set_numeric_ops``, we should use some new, private API
   to set up ``ndarray.__add__`` and friends.

3. We should deprecate, and eventually remove, ``np.set_numeric_ops``.


Non-proposed changes


We don't necessarily propose to throw away the distinction between
multiarray/ and umath/ in terms of our source code organization:
internal organization is useful! We just want to build them together
into a single extension module. Of course, this does open the door for
potential future refactorings, which we can then evaluate based on
their merits as they come up.

It also doesn't propose that we break the public C ABI. We should
continue to provide ``import_multiarray()`` and ``import_umath()``
functions – it's just that now both ABIs will ultimately be loaded
from the same C library. Due to how ``import_multiarray()`` and
``import_umath()`` are written, we'll also still need to have modules
called ``numpy.core.multiarray`` and ``numpy.core.umath``, and they'll
need to continue to export ``_ARRAY_API`` and ``_UFUNC_API`` objects –
but we can make one or both of these modules be tiny shims that simply
re-export the magic API object from where-ever it's actually defined.
(See ``numpy/core/code_generators/generate_{numpy,ufunc}_api.py`` for
details of how these imports work.)


Backward compatibility
--

The only compatibility break is the deprecation of ``np.set_numeric_ops``.


Alternatives


n/a


Discussion
--

TBD


Copyright
-

This document has been placed in the public domain.


-- 
Nathaniel J. Smi

Re: [Numpy-discussion] New NEP: merging multiarray and umath

2018-03-08 Thread Nathaniel Smith
On Thu, Mar 8, 2018 at 12:47 AM, Eric Wieser
 wrote:
> This means that ndarray needs to know about ufuncs – so instead of a clean
> layering, we have a circular dependency.
>
> Perhaps we should split ndarray into a base_ndarray class with no arithmetic
> support (add, sum, etc), and then provide an ndarray subclass from umath
> instead (either the separate extension, or just a different set of files)

This just seems like adding more complexity because we can, though?

-n

-- 
Nathaniel J. Smith -- https://vorpus.org
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] new NEP: np.AbstractArray and np.asabstractarray

2018-03-08 Thread Nathaniel Smith
Hi all,

Here's a more substantive NEP: trying to define how to define a
standard way for functions to say that they can accept any "duck
array".

Biggest open question for me: the name "asabstractarray" kinda sucks
(for reasons described in the NEP), and I'd love to have something
better. Any ideas?

Rendered version:
https://github.com/njsmith/numpy/blob/nep-16-abstract-array/doc/neps/nep-0016-abstract-array.rst

-n




An abstract base class for identifying "duck arrays"


:Author: Nathaniel J. Smith 
:Status: Draft
:Type: Standards Track
:Created: 2018-03-06


Abstract


We propose to add an abstract base class ``AbstractArray`` so that
third-party classes can declare their ability to "quack like" an
``ndarray``, and an ``asabstractarray`` function that performs
similarly to ``asarray`` except that it passes through
``AbstractArray`` instances unchanged.


Detailed description


Many functions, in NumPy and in third-party packages, start with some
code like::

   def myfunc(a, b):
   a = np.asarray(a)
   b = np.asarray(b)
   ...

This ensures that ``a`` and ``b`` are ``np.ndarray`` objects, so
``myfunc`` can carry on assuming that they'll act like ndarrays both
semantically (at the Python level), and also in terms of how they're
stored in memory (at the C level). But many of these functions only
work with arrays at the Python level, which means that they don't
actually need ``ndarray`` objects *per se*: they could work just as
well with any Python object that "quacks like" an ndarray, such as
sparse arrays, dask's lazy arrays, or xarray's labeled arrays.

However, currently, there's no way for these libraries to express that
their objects can quack like an ndarray, and there's no way for
functions like ``myfunc`` to express that they'd be happy with
anything that quacks like an ndarray. The purpose of this NEP is to
provide those two features.

Sometimes people suggest using ``np.asanyarray`` for this purpose, but
unfortunately its semantics are exactly backwards: it guarantees that
the object it returns uses the same memory layout as an ``ndarray``,
but tells you nothing at all about its semantics, which makes it
essentially impossible to use safely in practice. Indeed, the two
``ndarray`` subclasses distributed with NumPy – ``np.matrix`` and
``np.ma.masked_array`` – do have incompatible semantics, and if they
were passed to a function like ``myfunc`` that doesn't check for them
as a special-case, then it may silently return incorrect results.


Declaring that an object can quack like an array


There are two basic approaches we could use for checking whether an
object quacks like an array. We could check for a special attribute on
the class::

  def quacks_like_array(obj):
  return bool(getattr(type(obj), "__quacks_like_array__", False))

Or, we could define an `abstract base class (ABC)
`__::

  def quacks_like_array(obj):
  return isinstance(obj, AbstractArray)

If you look at how ABCs work, this is essentially equivalent to
keeping a global set of types that have been declared to implement the
``AbstractArray`` interface, and then checking it for membership.

Between these, the ABC approach seems to have a number of advantages:

* It's Python's standard, "one obvious way" of doing this.

* ABCs can be introspected (e.g. ``help(np.AbstractArray)`` does
  something useful).

* ABCs can provide useful mixin methods.

* ABCs integrate with other features like mypy type-checking,
  ``functools.singledispatch``, etc.

One obvious thing to check is whether this choice affects speed. Using
the attached benchmark script on a CPython 3.7 prerelease (revision
c4d77a661138d, self-compiled, no PGO), on a Thinkpad T450s running
Linux, we find::

np.asarray(ndarray_obj)  330 ns
np.asarray([])  1400 ns

Attribute check, success  80 ns
Attribute check, failure  80 ns

ABC, success via subclass340 ns
ABC, success via register()  700 ns
ABC, failure 370 ns

Notes:

* The first two lines are included to put the other lines in context.

* This used 3.7 because both ``getattr`` and ABCs are receiving
  substantial optimizations in this release, and it's more
  representative of the long-term future of Python. (Failed
  ``getattr`` doesn't necessarily construct an exception object
  anymore, and ABCs were reimplemented in C.)

* The "success" lines refer to cases where ``quacks_like_array`` would
  return True. The "failure" lines are cases where it would return
  False.

* The first measurement for ABCs is subclasses defined like::

  class MyArray(AbstractArray):
  ...

  The second is for subclasses defined like::

  class MyArray:
  ...

  AbstractArray.register(

Re: [Numpy-discussion] New NEP: merging multiarray and umath

2018-03-08 Thread Nathaniel Smith
On Thu, Mar 8, 2018 at 1:52 AM, Gregor Thalhammer
 wrote:
>
> Hi,
>
> long time ago I wrote a wrapper to to use optimised and parallelized math
> functions from Intels vector math library
> geggo/uvml: Provide vectorized math function (MKL) for numpy
>
> I found it useful to inject (some of) the fast methods into numpy via
> np.set_num_ops(), to gain more performance without changing my programs.
>
> While this original project is outdated, I can imagine that a centralised
> way to swap the implementation of math functions is useful. Therefor I
> suggest to keep np.set_num_ops(), but admittedly I do not understand all the
> technical implications of the proposed change.

The main part of the proposal is to merge the two libraries; the
question of whether to deprecate set_numeric_ops is a bit separate.
There's no technical obstacle to keeping it, except the usual issue of
having more cruft to maintain :-).

It's usually true that any monkeypatching interface will be useful to
someone under some circumstances, but we usually don't consider this a
good enough reason on its own to add and maintain these kinds of
interfaces. And an unfortunate side-effect of these kinds of hacky
interfaces is that they can end up removing the pressure to solve
problems properly. In this case, better solutions would include:

- Adding support for accelerated vector math libraries to NumPy
directly (e.g. MKL, yeppp)

- Overriding the inner loops inside ufuncs like numpy.add that
np.ndarray.__add__ ultimately calls. This would speed up all addition
(whether or not it uses Python + syntax), would be a more general
solution (e.g. you could monkeypatch np.exp to use MKL's fast
vectorized exp), would let you skip reimplementing all the tricky
shared bits of the ufunc logic, etc. Conceptually it's not even very
hacky, because we allow you add new loops to existing ufuncs; making
it possible to replace existing loops wouldn't be a big stretch. (In
fact it's possible that we already allow this; I haven't checked.)

So I still lean towards deprecating set_numeric_ops. It's not the most
crucial part of the proposal though; if it turns out to be too
controversial then I'll take it out.

-n

-- 
Nathaniel J. Smith -- https://vorpus.org
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] Where to discuss NEPs (was: Re: new NEP: np.AbstractArray and np.asabstractarray)

2018-03-08 Thread Nathaniel Smith
On Thu, Mar 8, 2018 at 7:06 AM, Marten van Kerkwijk
 wrote:
> Hi Nathaniel,
>
> Overall, hugely in favour!  For detailed comments, it would be good to
> have a link to a PR; could you put that up?

Well, there's a PR here: https://github.com/numpy/numpy/pull/10706

But, this raises a question :-). (One which also came up here:
https://github.com/numpy/numpy/pull/10704#issuecomment-371684170)

There are sensible two workflows we could use (or at least, two that I
can think of):

1. We merge updates to the NEPs as we go, so that whatever's in the
repo is the current draft. Anyone can go to the NEP webpage at
http://numpy.org/neps (WIP, see #10702) to see the latest version of
all NEPs, whether accepted, rejected, or in progress. Discussion
happens on the mailing list, and line-by-line feedback can be done by
quote-replying and commenting on individual lines. From time to time,
the NEP author takes all the accumulated feedback, updates the
document, and makes a new post to the list to let people know about
the updated version.

This is how python-dev handles PEPs.

2. We use Github itself to manage the review. The repo only contains
"accepted" NEPs; draft NEPs are represented by open PRs, and rejected
NEPs are represented by PRs that were closed-without-merging.
Discussion uses Github's commenting/review tools, and happens in the
PR itself.

This is roughly how Rust handles their RFC process, for example:
https://github.com/rust-lang/rfcs

Trying to do some hybrid version of these seems like it would be
pretty painful, so we should pick one.

Given that historically we've tried to use the mailing list for
substantive features/planning discussions, and that our NEP process
has been much closer to workflow 1 than workflow 2 (e.g., there are
already a bunch of old NEPs already in the repo that are effectively
rejected/withdrawn), I think we should maybe continue that way, and
keep discussions here?

So my suggestion is discussion should happen on the list, and NEP
updates should be merged promptly, or just self-merged. Sound good?

-n

-- 
Nathaniel J. Smith -- https://vorpus.org
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Where to discuss NEPs (was: Re: new NEP: np.AbstractArray and np.asabstractarray)

2018-03-09 Thread Nathaniel Smith
On Thu, Mar 8, 2018 at 10:26 PM, Ralf Gommers  wrote:
>
>
> On Thu, Mar 8, 2018 at 8:22 PM, Nathaniel Smith  wrote:
>>
>> On Thu, Mar 8, 2018 at 7:06 AM, Marten van Kerkwijk
>>  wrote:
>> > Hi Nathaniel,
>> >
>> > Overall, hugely in favour!  For detailed comments, it would be good to
>> > have a link to a PR; could you put that up?
>>
>> Well, there's a PR here: https://github.com/numpy/numpy/pull/10706
>>
>> But, this raises a question :-). (One which also came up here:
>> https://github.com/numpy/numpy/pull/10704#issuecomment-371684170)
>>
>> There are sensible two workflows we could use (or at least, two that I
>> can think of):
>>
>> 1. We merge updates to the NEPs as we go, so that whatever's in the
>> repo is the current draft. Anyone can go to the NEP webpage at
>> http://numpy.org/neps (WIP, see #10702) to see the latest version of
>> all NEPs, whether accepted, rejected, or in progress. Discussion
>> happens on the mailing list, and line-by-line feedback can be done by
>> quote-replying and commenting on individual lines. From time to time,
>> the NEP author takes all the accumulated feedback, updates the
>> document, and makes a new post to the list to let people know about
>> the updated version.
>>
>> This is how python-dev handles PEPs.
>>
>> 2. We use Github itself to manage the review. The repo only contains
>> "accepted" NEPs; draft NEPs are represented by open PRs, and rejected
>> NEPs are represented by PRs that were closed-without-merging.
>> Discussion uses Github's commenting/review tools, and happens in the
>> PR itself.
>>
>> This is roughly how Rust handles their RFC process, for example:
>> https://github.com/rust-lang/rfcs
>>
>> Trying to do some hybrid version of these seems like it would be
>> pretty painful, so we should pick one.
>>
>> Given that historically we've tried to use the mailing list for
>> substantive features/planning discussions, and that our NEP process
>> has been much closer to workflow 1 than workflow 2 (e.g., there are
>> already a bunch of old NEPs already in the repo that are effectively
>> rejected/withdrawn), I think we should maybe continue that way, and
>> keep discussions here?
>>
>> So my suggestion is discussion should happen on the list, and NEP
>> updates should be merged promptly, or just self-merged. Sound good?
>
>
> Agreed that overall (1) is better than (2), rejected NEPs should be visible.
> However there's no need for super-quick self-merge, and I think it would be
> counter-productive.
>
> Instead, just send a PR, leave it open for some discussion, and update for
> detailed comments (as well as long in-depth discussions that only a couple
> of people care about) in the Github UI and major ones on the list. Once it's
> stabilized a bit, then merge with status "Draft" and update once in a while.
> I think this is also much more in like with what python-dev does, I have
> seen substantial discussion on Github and have not seen quick self-merges.

Not sure what you mean about python-dev. Are you looking at the peps
repository? https://github.com/python/peps

>From a quick skim, it looks like of the last 37 commits, only 8 came
in through PRs and the other 29 were pushed directly by committers
without any review. 3 of the 8 PRs were self-merged immediately after
submission, and of the remaining 5 PRs, 4 of them were from external
contributors who didn't have commit rights, and the 1 other was a fix
to the repo README, rather than an actual PEP change. I don't think
I've ever seen any kind of substantive discussion in that repo -- any
discussion is mostly restricted to helping new contributors with
procedural stuff, maybe formatting issues or fixes to the PEP tooling.

Anyway, just because python-dev does it that way doesn't mean that we
have to too.

But if we split discussions between GH and the mailing list, then
we're definitely going to end up discussing substantive issues there
(how do we know which discussions only a couple of people care
about?), and trying to juggle that seems confusing to me, plus makes
it harder to track down what happened later, after we've had multiple
PRs each with their own comments...

-n

-- 
Nathaniel J. Smith -- https://vorpus.org
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] new NEP: np.AbstractArray and np.asabstractarray

2018-03-09 Thread Nathaniel Smith
On Thu, Mar 8, 2018 at 7:06 AM, Marten van Kerkwijk
 wrote:
> A larger comment: you state that you think `np.asanyarray` is a
> mistake since `np.matrix` and `np.ma.MaskedArray` would pass through
> and that those do not strictly mimic `NDArray`. Here, I agree with
> `matrix` (but since we're deprecating it, let's remove that from the
> discussion), but I do not see how your proposed interface would not
> let `MaskedArray` pass through, nor really that one would necessarily
> want that.

We can discuss whether MaskedArray should be an AbstractArray.
Conceptually it probably should be; I think that was a goal of the
MaskedArray authors (even if they wouldn't have put it that way). In
practice there are a lot of funny quirks in MaskedArray, so I'd want
to look more carefully in case there are weird incompatibilities that
would cause problems. Note that we can figure this out after the NEP
is finished, too.

I wonder if the matplotlib folks have any thoughts on this? I know
they're one of the more prominent libraries that tries to handle both
regular and masked arrays, so maybe they could comment on how often
they run

> I think it may be good to distinguish two separate cases:
> 1. Everything has exactly the same meaning as for `ndarray` but the
> data is stored differently (i.e., only `view` does not work). One can
> thus expect that for `output = function(inputs)`, at the end all
> `duck_output == ndarray_output`.
> 2. Everything is implemented but operations may give different output
> (depending on masks for masked arrays, units for quantities, etc.), so
> generally `duck_output != ndarray_output`.
>
> Which one of these are you aiming at? By including
> `NDArrayOperatorsMixin`, it would seem option (2), but perhaps not? Is
> there a case for both separately?

Well, (1) is much easier to design around, because it's well-defined
:-). And I'm not sure that there's a principled difference between
regular arrays and masked arrays/quantity arrays; these *could* be
ndarray objects with special dtypes and extra methods, neither of
which would disqualify you from being a "case 1" array.

(I guess one issue is that because MaskedArray ignores the mask by
default, you could get weird results from things like mean
calculations: np.sum(masked_arr) / np.prod(masked_arr.shape) does not
give the right result. This isn't an issue for quantities, though, or
for an R-style NA that propagated by default.)

> Smaller general comment: at least in the NEP I would not worry about
> deprecating `NDArrayOperatorsMixin` - this may well be handy in itself
> (for things that implement `__array_ufunc__` but do not have shape,
> etc. (I have been doing some work on creating ufunc chains that would
> use this -- but they definitely are not array-like). Similarly, I
> think there is room for an `NDArrayShapeMixin` which might help with
> `concatenate` and friends.

Fair enough.

> Finally, on the name: `asarray` and `asanyarray` are just shims over
> `array`, so one option would be to add an argument in `array` (or
> broaden the scope of `subok`).

We definitely don't want to broaden the scope of 'subok', because one
of the goals here is to have something that projects like sklearn can
use, and they won't use subok :-). (In particular, np.matrix is
definitely not a duck array of any kind.)

And supporting array() is tricky, because then you have to figure out
what to do with the copy=, order=, subok=, ndmin= arguments. copy= in
particular is tricky given that we don't know the object's type! I
guess we could call obj.copy() or something... but for this first
iteration it seemed simplest to make a new function that just has the
most important stuff for writing generic functions that accept duck
arrays.

What we could do is, in addition to adding some kind of
asabstractarray() function, *also* make it so asanyarray() starts
accepting abstract/duck arrays, on the theory that anyone who's
willing to put up with asanyarrays()'s weak guarantees won't notice if
we weaken them a bit more. Honestly though I'd rather just not touch
asanyarray at all, and maybe even deprecate it someday.

-n

-- 
Nathaniel J. Smith -- https://vorpus.org
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] new NEP: np.AbstractArray and np.asabstractarray

2018-03-09 Thread Nathaniel Smith
On Thu, Mar 8, 2018 at 9:45 PM, Stephan Hoyer  wrote:
> On Thu, Mar 8, 2018 at 5:54 PM Juan Nunez-Iglesias 
> wrote:
>>
>> On Fri, Mar 9, 2018, at 5:56 AM, Stephan Hoyer wrote:
>>
>> Marten's case 1: works exactly like ndarray, but stores data differently:
>> parallel arrays (e.g., dask.array), sparse arrays (e.g.,
>> https://github.com/pydata/sparse), hypothetical non-strided arrays (e.g.,
>> always C ordered).
>>
>>
>> Two other "hypotheticals" that would fit nicely in this space:
>> - the Open Connectome folks (https://neurodata.io) proposed linearising
>> indices using space-filling curves, which minimizes cache misses (or IO
>> reads) for giant volumes. I believe they implemented this but can't find it
>> currently.
>> - the N5 format for chunked arrays on disk:
>> https://github.com/saalfeldlab/n5
>
>
> I think these fall into another important category of duck arrays.
> "Indexable" arrays the serve as storage, but that don't support computation.
> These sorts of arrays typically support operations like indexing and define
> handful of array-like properties (e.g., dtype and shape), but not
> arithmetic, reductions or reshaping.
>
> This means you can't quite use them as a drop-in replacement for NumPy
> arrays in all cases, but that's OK. In contrast, both dask.array and sparse
> do aspire to do fill out nearly the full numpy.ndarray API.

I'm not sure if these particular formats fall into that category or
not (isn't the point of the space-filling curves to support
cache-efficient computation?). But I suppose you're also thinking of
things like h5py.Dataset? My impression is that these are mostly
handled pretty well already by defining __array__ and/or providing
array operations that implicitly convert to ndarray -- do you agree?

This does raise an interesting point: maybe we'll eventually want an
__abstract_array__ method that asabstractarray tries calling if
defined, so e.g. if your object isn't itself an array but can be
efficiently converted into a *sparse* array, you have a way to declare
that? I think this is something to file under "worry about later,
after we have the basic infrastructure", but it's not something I'd
thought of before so mentioning here.

-n

-- 
Nathaniel J. Smith -- https://vorpus.org
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Where to discuss NEPs (was: Re: new NEP: np.AbstractArray and np.asabstractarray)

2018-03-09 Thread Nathaniel Smith
On Fri, Mar 9, 2018 at 11:51 AM, Stefan van der Walt
 wrote:
> On Fri, 09 Mar 2018 17:00:43 +, Stephan Hoyer wrote:
>> I'll note that we basically used GitHub for revising __array_ufunc__ NEP,
>> and I think that worked out better for everyone involved. The discussion
>> was a little too specialized and high volume to be well handled on the
>> mailing list.
>
> A disadvantage of GitHub PR comments is that they do not track
> sub-threads of conversation, so you cannot "reply to" a previous concern
> directly.

Yeah, I actually find email much easier for this kind of complex
high-volume discussion. Even if lots of people don't use traditional
threaded mail clients anymore [1], archives are still threaded, and
the tools that make line-by-line responses easy and the ability to
split off conversations are both really helpful. (E.g., the way I
split this thread off from the original one :-).) The __array_ufunc__
discussion was almost impenetrable on GH, I think.

I admit though that some of this is probably just that I'm more used
to the email-based discussion workflow. Honestly none of these tools
are particularly amazing, and the __array_ufunc__ conversation would
have been difficult and inaccessible to outsiders no matter what
medium we used. It's much more important that we just pick something
and use it consistently than that pick the Most Optimal Solution.

[1] Meaning this, not gmail's threads:
https://en.wikipedia.org/wiki/Conversation_threading#/media/File:Nntp.jpg

> PRs also mix inline comments (that become much less visible after
> rebases and updates) and "story line" comments.  These two "modes" of
> commenting, substantive discussion around ideas, v.s. concerns about
> specific phrasing, usage of words, typos, content of code snippets,
> etc., may require different approaches.  It would be quite easy to
> redirect the prior to the mailing list and the latter to the GitHub PR.

I don't think we should worry about this. Fiddly detail comments are,
by definition, not super important, and generally make up a tiny
volume of the discussion around a proposal. Also in practice reviewers
are no good at splitting up substantive comments from fiddly details:
the review workflow is that you read through and as thoughts occur you
write them down, so even if you start out thinking "okay, I'm only
going to comment on typos", then half-way through some paragraph
sparks a thought and suddenly you're writing something substantive
(and I'm as guilty of this as anyone, maybe more so...). Asking people
to classify their comments and then chiding them for putting them in
the wrong place etc. isn't a good use of time. Let's just pick one
place for everything and stick with it.

> I'm also not too keen on repeated PR creation and merging (it splits up
> the PR discussion even further).  Why not simply hold off until the PEP
> is ready, and view the documents on GitHub?  The rendering there is just
> as good.

Well, if we aren't using PRs for discussion then multiple PRs are fine
:-). And merging changes quickly is helpful because it makes the
rendered NEPs page a single one-stop-shop to see all the latest NEPs,
no matter what their current status.

If we do use PRs for discussion, then I agree that we should try to
keep the PR open until the NEP is "done", to minimize the splitting of
discussion. This does create a bit of extra friction because it turns
out that "is this done?" is not something you can really ever answer
for certain :-). Even after PEPs are accepted they usually end up
getting some further tweaks once people start implementing them.
Sometimes PEPs get abandoned in "Draft" state without ever being
accepted/rejected, and sometimes a PEP that had been abandoned for
years gets picked up and finished. You can see this in the Rust RFC
guidelines too [2]; they specifically address the issue of post-merge
changes, and it sounds like their solution is that if a substantive
issue is discovered in an accepted RFC, then you have to create a new
"fixup" RFC, which then gets its own PR for discussion. I guess if
this were our process then __array_ufunc__ would have ended up with ~3
NEPs :-).

This is all doable -- every approach has trade-offs. But we should
pick one, so we can adapt to those trade-offs.

[2] https://github.com/rust-lang/rfcs#the-rfc-life-cycle

-n

-- 
Nathaniel J. Smith -- https://vorpus.org
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] New NEP: merging multiarray and umath

2018-03-09 Thread Nathaniel Smith
On Fri, Mar 9, 2018 at 3:33 AM, Julian Taylor
 wrote:
> As the functions of the different libraries have vastly different
> accuracies you want to be able to exchange numeric ops at runtime or at
> least during load time (like our cblas) and not limit yourself one
> compile time defined set of functions.
> Keeping set_numeric_ops would be preferable to me.
>
> Though I am not clear on why the two things are connected?
> Why can't we keep set_numeric_ops and merge multiarray and umath into
> one shared object?

I think I addressed both of these topics here?
https://mail.python.org/pipermail/numpy-discussion/2018-March/07.html

Looking again now, I see that we actually *do* have an explicit API
for monkeypatching ufuncs:

https://docs.scipy.org/doc/numpy/reference/c-api.ufunc.html#c.PyUFunc_ReplaceLoopBySignature

So this seems to be a strictly more general/powerful/useful version of
set_numeric_ops...

I added some discussion to the NEP:
https://github.com/numpy/numpy/pull/10704/commits/4c4716ee0b3bc51d5be9baa891d60473f480d1f2

-n

-- 
Nathaniel J. Smith -- https://vorpus.org
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] new NEP: np.AbstractArray and np.asabstractarray

2018-03-09 Thread Nathaniel Smith
On Thu, Mar 8, 2018 at 5:51 PM, Juan Nunez-Iglesias  wrote:
>> Finally for the name, what about `asduckarray`? Thought perhaps that could
>> be a source of confusion, and given the gradation of duck array like types.
>
> I suggest that the name should *not* use programmer lingo, so neither
> "abstract" nor "duck" should be in there. My humble proposal is "arraylike".
> (I know that this term has included things like "list-of-list" before but
> only in text, not code, as far as I know.)

I agree with your point about avoiding programmer lingo. My first
draft actually used 'asduckarray', but that's like an in-joke; it
works fine for us, but it's not really something I want teachers to
have to explain on day 1...

Array-like is problematic too though, because we still need a way to
say "thing that can be coerced to an array", which is what array-like
has been used to mean historically. And with the new type hints stuff,
it is actually becoming code. E.g. what should the type hints here be:

asabstractarray(a: X) -> Y

Right now "X" is "ArrayLike", but if we make "Y" be "ArrayLike" then
we'll need to come up with some other name for "X" :-).

Maybe we can call duck arrays "py arrays", since the idea is that they
implement the standard Python array API (but not necessarily the
C-level array API)? np.PyArray, np.aspyarray()?

-n

-- 
Nathaniel J. Smith -- https://vorpus.org
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] New NEP: merging multiarray and umath

2018-03-12 Thread Nathaniel Smith
On Mar 12, 2018 12:02, "Charles R Harris"  wrote:


If we accept this NEP, I'd like to get it done soon, preferably and the
next few months, so that it is finished before we drop Python 2.7 support.
That will make maintenance of the NumPy long term support release through
2019 easier.


The reason you're seeing this spurt of activity on NEPs and NEP
infrastructure from people at Berkeley is that we're preparing for the
upcoming arrival of full time devs on the numpy grant. (More announcements
there soon.) So if it's accepted then I don't think there will be any
problem getting it implemented by then.

-n
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] PR to add a function to calculate histogram edges without calculating the histogram

2018-03-15 Thread Nathaniel Smith
Instead of an nobs argument, maybe we should have a version that accepts
multiple data sets, so that we have the full information and can improve
the algorithm over time.

On Mar 15, 2018 7:57 PM, "Thomas Caswell"  wrote:

> Yes I like the name.
>
> The primary use-case for Matplotlib is that our `hist` method can take in
> a list of arrays and produces N histograms in one shot. Currently with
> 'auto' we only use the first data set to sort out what the bins should be
> and then re-use those for the rest of the data sets.  This will let us get
> the bins on the merged input, but I take Josef's point that this is not
> actually what we want
>
> Tom
>
> On Mon, Mar 12, 2018 at 11:35 PM  wrote:
>
>> On Mon, Mar 12, 2018 at 11:20 PM, Eric Wieser
>>  wrote:
>> >> Given that the bin selection are data driven, transferring them across
>> datasets might not be so useful.
>> >
>> > The main application would be to compute bins across the union of all
>> > datasets. This is already possibly by using `np.histogram` and
>> > discarding the first result, but that's super wasteful.
>>
>> assuming "union" means a combined dataset.
>>
>> If you stack  datasets, then the number of observations will not be
>> correct for individual datasets.
>>
>> In that case an additional keyword like nobs, or whatever name would
>> be appropriate for numpy, would be useful, e.g. use the average number
>> of observations across datasets.
>> Auxiliary statistic like std could then be computed on the total
>> dataset (if that makes sense, which would not be the case if the
>> variance across datasets is larger than the variance within datasets.
>>
>> Josef
>>
>> > ___
>> > NumPy-Discussion mailing list
>> > NumPy-Discussion@python.org
>> > https://mail.python.org/mailman/listinfo/numpy-discussion
>> ___
>> NumPy-Discussion mailing list
>> NumPy-Discussion@python.org
>> https://mail.python.org/mailman/listinfo/numpy-discussion
>>
>
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
>
>
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] PR to add a function to calculate histogram edges without calculating the histogram

2018-03-16 Thread Nathaniel Smith
Oh sure, I'm not suggesting it be impossible to calculate for a single data
set. If nothing else, if we had a version that accepted a list of data
sets, then you could always pass in a single-element list :-).

On Mar 15, 2018 22:10, "Eric Wieser"  wrote:

> That sounds like a reasonable extension - but I think there still exist
> cases where you want to treat the data as one uniform set when computing
> bins (toggling between orthogonal subsets of data) so isn't really a useful
> replacement.
>
> I suppose this becomes relevant when `density` is passed to the individual
> histogram invocations. Does matplotlib handle that correctly for stacked
> histograms?
>
> On Thu, Mar 15, 2018, 20:14 Nathaniel Smith  wrote:
>
>> Instead of an nobs argument, maybe we should have a version that accepts
>> multiple data sets, so that we have the full information and can improve
>> the algorithm over time.
>>
>> On Mar 15, 2018 7:57 PM, "Thomas Caswell"  wrote:
>>
>>> Yes I like the name.
>>>
>>> The primary use-case for Matplotlib is that our `hist` method can take
>>> in a list of arrays and produces N histograms in one shot. Currently with
>>> 'auto' we only use the first data set to sort out what the bins should be
>>> and then re-use those for the rest of the data sets.  This will let us get
>>> the bins on the merged input, but I take Josef's point that this is not
>>> actually what we want
>>>
>>> Tom
>>>
>>> On Mon, Mar 12, 2018 at 11:35 PM  wrote:
>>>
>>>> On Mon, Mar 12, 2018 at 11:20 PM, Eric Wieser
>>>>  wrote:
>>>> >> Given that the bin selection are data driven, transferring them
>>>> across datasets might not be so useful.
>>>> >
>>>> > The main application would be to compute bins across the union of all
>>>> > datasets. This is already possibly by using `np.histogram` and
>>>> > discarding the first result, but that's super wasteful.
>>>>
>>>> assuming "union" means a combined dataset.
>>>>
>>>> If you stack  datasets, then the number of observations will not be
>>>> correct for individual datasets.
>>>>
>>>> In that case an additional keyword like nobs, or whatever name would
>>>> be appropriate for numpy, would be useful, e.g. use the average number
>>>> of observations across datasets.
>>>> Auxiliary statistic like std could then be computed on the total
>>>> dataset (if that makes sense, which would not be the case if the
>>>> variance across datasets is larger than the variance within datasets.
>>>>
>>>> Josef
>>>>
>>>> > ___
>>>> > NumPy-Discussion mailing list
>>>> > NumPy-Discussion@python.org
>>>> > https://mail.python.org/mailman/listinfo/numpy-discussion
>>>> ___
>>>> NumPy-Discussion mailing list
>>>> NumPy-Discussion@python.org
>>>> https://mail.python.org/mailman/listinfo/numpy-discussion
>>>>
>>>
>>> ___
>>> NumPy-Discussion mailing list
>>> NumPy-Discussion@python.org
>>> https://mail.python.org/mailman/listinfo/numpy-discussion
>>>
>>> ___
>> NumPy-Discussion mailing list
>> NumPy-Discussion@python.org
>> https://mail.python.org/mailman/listinfo/numpy-discussion
>>
>
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
>
>
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] new NEP: np.AbstractArray and np.asabstractarray

2018-03-22 Thread Nathaniel Smith
On Sat, Mar 10, 2018 at 4:27 AM, Matthew Rocklin  wrote:
> I'm very glad to see this discussion.
>
> I think that coming up with a single definition of array-like may be
> difficult, and that we might end up wanting to embrace duck typing instead.
>
> It seems to me that different array-like classes will implement different
> mixtures of features.  It may be difficult to pin down a single definition
> that includes anything except for the most basic attributes (shape and
> dtype?).  Consider two extreme cases of restrictive functionality:
>
> LinearOperators (support dot in a numpy-like way)
> Storage objects like h5py (support getitem in a numpy-like way)
>
> I can imagine authors of both groups saying that they should qualify as
> array-like because downstream projects that consume them should not convert
> them to numpy arrays in important contexts.

I think this is an important point -- there are a lot of subtleties in
the interfaces that different objects might want to provide. Some
interesting ones that haven't been mentioned:

- a "duck array" that has everything except fancy indexing
- xarray's arrays are just like numpy arrays in most ways, but they
have incompatible broadcasting semantics
- immutable vs. mutable arrays

When faced with this kind of situation, always it's tempting to try to
write down some classification system to capture every possible
configuration of interesting behavior. In fact, this is one of the
most classic nerd snipes; it's been catching people for literally
thousands of years [1]. Most of these attempts fail though :-).

So let's back up -- I probably erred in not making this more clear in
the NEP, but I actually have a fairly concrete use case in mind here.
What happened is, I started working on a NEP for
__array_concatenate__, and my thought pattern went as follows:

1) Cool, this should work for np.concatenate.
2) But what about all the other variants, like np.row_stack. We don't
want __array_row_stack__; we want to express row_stack in terms of
concatenate.
3) Ok, what's row_stack? It's:
  np.concatenate([np.atleast_2d(arr) for arr in arrs], axis=0)
4) So I need to make atleast_2d work on duck arrays. What's
atleast_2d? It's: asarray + some shape checks and indexing with
newaxis
5) Okay, so I need something atleast_2d can call instead of asarray [2].

And this kind of pattern shows up everywhere inside numpy, e.g. it's
the first thing inside lots of functions in np.linalg b/c they do some
futzing with dtypes and shape before delegating to ufuncs, it's the
first thing the mean() function does b/c it needs to check arr.dtype
before proceeding, etc. etc.

So, we need something we can use in these functions as a first step
towards unlocking the use of duck arrays in general. But we can't
realistically go through each of these functions, make an exact list
of all the operations/attributes it cares about, and then come up with
exactly the right type constraint for it to impose at the top. And
these functions aren't generally going to work on LinearOperators or
h5py datasets anyway.

We also don't want to go through every function in numpy and add new
arguments to control this coercion behavior.

What we can do, at least to start, is to have a mechanism that passes
through objects that aspire to be "complete" duck arrays, like dask
arrays or sparse arrays or astropy's unit arrays, and then if it turns
out that in practice people find uses for finer-grained distinctions,
we can iteratively add those as a second pass. Notice that if a
function starts out requiring a "complete" duck array, and then later
relaxes that to accept "partial" duck arrays, that's actually
increasing the domain of objects that it can act on, so it's a
backwards-compatible change that we can do later.

So I think we should start out with a concept of "duck array" that's
fairly strong but a bit vague on the exact details (e.g.,
dask.array.Array is currently missing some weird things like arr.ptp()
and arr.tolist(), I guess because no-one has ever noticed or cared?).



Thinking things through like this, I also realized that this proposal
jumps through hoops to avoid changing np.asarray itself, because I was
nervous about changing the rule that its output is always an
ndarray... but actually, this is currently the rule for most functions
in numpy, and the whole point of this proposal is to relax that rule
for most functions, in cases where the user is explicitly passing in a
duck-array object. So maybe I'm being overparanoid? I'm genuinely
unsure here.

Instead of messing about with ABCs, an alternative mechanism would be
to add a new method __arrayish__ (hat tip to Tom Caswell for the name
:-)), that essentially acts as an override for Python-level calls to
np.array / np.asarray, in much the same way that __array_ufunc__
overrides ufuncs, etc. (C level calls to PyArray_FromAny and similar
would of course continue to return ndarray objects, and I assume we'd
add some argument like require_nda

Re: [Numpy-discussion] round(numpy.float64(0.0)) is a numpy.float64

2018-03-26 Thread Nathaniel Smith
Even knowing that, it's still confusing that round(np.float64(0.0))
isn't the same as round(0.0). The reason is a Python 2 / Python 3
thing: in Python 2, round returns a float, while on Python 3, it
returns an integer – but numpy still uses the python 2 behavior
everywhere.

I'm not sure if it's possible or worthwhile to change this. If we'd
changed it when we first added python 3 support then it would have
been easy (and obviously a good idea), but at this point it might be
tricky?

-n

On Thu, Mar 22, 2018 at 12:32 PM, Nathan Goldbaum  wrote:
> numpy.float is an alias to the python float builtin.
>
> https://github.com/numpy/numpy/issues/3998
>
>
> On Thu, Mar 22, 2018 at 2:26 PM Olivier  wrote:
>>
>> Hello,
>>
>>
>> Is it normal, expected and desired that :
>>
>>
>>   round(numpy.float64(0.0)) is a numpy.float64
>>
>>
>> while
>>
>>   round(numpy.float(0.0)) is an integer?
>>
>>
>> I find it disturbing and misleading. What do you think? Has it already
>> been
>> discussed somewhere else?
>>
>>
>> Best regards,
>>
>>
>> Olivier
>>
>> ___
>> NumPy-Discussion mailing list
>> NumPy-Discussion@python.org
>> https://mail.python.org/mailman/listinfo/numpy-discussion
>
>
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
>



-- 
Nathaniel J. Smith -- https://vorpus.org
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] round(numpy.float64(0.0)) is a numpy.float64

2018-03-26 Thread Nathaniel Smith
On Mon, Mar 26, 2018 at 6:24 PM, Nathaniel Smith  wrote:
> Even knowing that, it's still confusing that round(np.float64(0.0))
> isn't the same as round(0.0). The reason is a Python 2 / Python 3
> thing: in Python 2, round returns a float, while on Python 3, it
> returns an integer – but numpy still uses the python 2 behavior
> everywhere.
>
> I'm not sure if it's possible or worthwhile to change this. If we'd
> changed it when we first added python 3 support then it would have
> been easy (and obviously a good idea), but at this point it might be
> tricky?

Oh right, and I forgot: part of the reason it's tricky is that it
really would have to return a Python 'int', *not* any of numpy's
integer types, because floats have a much larger range than numpy
integers, e.g.:

In [4]: round(1e50)
Out[4]: 17629769841091887003294964970946560

In [5]: round(np.float64(1e50))
Out[5]: 1e+50

In [6]: np.uint64(round(np.float64(1e50)))
Out[6]: 0

(Actually that last case illustrates another weird inconsistency:
np.uint64(1e50) -> OverflowError, but np.uint64(np.float64(1e50)) ->
0. I have no idea what's going on there.)

-n

-- 
Nathaniel J. Smith -- https://vorpus.org
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Turn numpy.ones_like into a ufunc

2018-05-18 Thread Nathaniel Smith
I would like to see a plan for how we're going to handle zeroes_like,
empty_like, ones_like, and full_like before we start making changes to any
of them.

On Fri, May 18, 2018, 05:33 Matthew Rocklin  wrote:

> Hi All,
>
> I would like to see the numpy.ones_like function operate as a ufunc.
> This is currently done in np.core.umath._ones_like.  This was recently
> raised and discussed in https://github.com/numpy/numpy/issues/11074 .  It
> was suggested that I raise the topic here instead.
>
> My understanding is that this was considered some time ago, but that the
> current numpy.ones_like function was implemented instead.  No one on that
> issue seems to fully remember why.  Perhaps someone here has a longer
> memory?
>
> My objective for defaulting to the ufunc implementation is that it makes
> it compatible with other projects that implement numpy-like interfaces
> (dask.array, sparse, cupy) so that downstream projects can use a subset of
> numpy code that is valid across a few projects.  More broadly I would like
> to see ufuncs and other protocol-enabled functions start to become more
> common within numpy, ones_like being one specific case.
>
> Best,
> -matt
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
>
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Efficiency of Numpy wheels and simple way to benchmark Numpy installation?

2018-05-27 Thread Nathaniel Smith
Performance is an incredibly multi-dimensional thing. Modern computers are
incredibly complex, with layers of interacting caches, different
microarchitectural features (do you have AVX2? does your cpu's branch
predictor interact in a funny way with your workload?), compiler
optimizations that vary from version to version, ... and different parts of
numpy are affected differently by an these things.

So, the only really reliable answer to a question like this is, always,
that you need to benchmark the application you actually care about in the
contexts where it will actually run (or as close as you can get to that).

That said, as a general rule of thumb, the main difference between
different numpy builds is which BLAS library they use, which primarily
affects the speed of numpy's linear algebra routines. The wheels on pypi
use either OpenBLAS (on Windows and Linux), or Accelerate (in MacOS. The
conda packages provided as part of the Anaconda distribution normally use
Intel's MKL.

All three of these libraries are generally pretty good. They're all serious
attempts to make a blazing fast linear algebra library, and much much
faster than naive implementations. Generally MKL has a reputation for being
somewhat faster than the others, when there's a difference. But again,
whether this happens, or is significant, for *your* app is impossible to
say without trying it.

-n

On Sun, May 27, 2018, 08:32 PIERRE AUGIER <
pierre.aug...@univ-grenoble-alpes.fr> wrote:

> Hello,
>
> I don't know if it is a good place to ask such questions. As advised here
> https://www.scipy.org/scipylib/mailing-lists.html#stackoverflow, I first
> posted a question on stackoverflow:
>
>
> https://stackoverflow.com/questions/50475989/efficiency-of-numpy-wheels-and-simple-benchmark-for-numpy-installations
>
> Since I got no feedback, I try here. My questions are:
>
> - When we care about performance, is it a good practice to rely on wheels
> (especially for Numpy)? Will it be slower than using (for example) a conda
> built Numpy?
>
> - Are there simple commands to benchmark Numpy installations and get a
> good idea of their overall performance?
>
> I explain a little bit more in the stackoverflow question...
>
> Pierre Augier
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
>
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] matmul as a ufunc

2018-05-28 Thread Nathaniel Smith
On Mon, May 28, 2018 at 4:26 PM, Stephan Hoyer  wrote:
> On Mon, May 21, 2018 at 5:42 PM Matti Picus  wrote:
>>
>> - create a wrapper that can convince the ufunc mechanism to call
>> __array_ufunc__ even on functions that are not true ufuncs
>
>
> I am somewhat opposed to this approach, because __array_ufunc__ is about
> overloading ufuncs, and as soon as we relax this guarantee the set of
> invariants __array_ufunc__ implementors rely on becomes much more limited.
>
> We really should have another mechanism for arbitrary function overloading
> in NumPy (NEP to follow shortly!).
>
>>
>> - expand the semantics of core signatures so that a single matmul ufunc
>> can implement matrix-matrix, vector-matrix, matrix-vector, and
>> vector-vector multiplication.
>
>
> I was initially concerned that adding optional dimensions for gufuncs would
> introduce additional complexity for only the benefit of a single function
> (matmul), but I'm now convinced that it makes sense:
> 1. All other arithmetic overloads use __array_ufunc__, and it would be nice
> to keep @/matmul in the same place.
> 2. There's a common family of gufuncs for which optional dimensions like
> np.matmul make sense: matrix functions where 1D arrays should be treated as
> 2D row- or column-vectors.
>
> One example of this class of behavior would be np.linalg.solve, which could
> support vectors like Ax=b and matrices like Ax=B with the signature
> (m,m),(m,n?)->(m,n?). We couldn't immediately make np.linalg.solve a gufunc
> since it uses a subtly different dispatching rule, but it's the same
> use-case.

Specifically, np.linalg.solve uses a unique rule where

   solve(a, b)

assumes that b is a stack of vectors if (a.ndim - 1 == b.ndim), and
otherwise assumes that it's a stack of matrices. This is pretty
confusing. You'd think that solve(a, b) should be equivalent to
(inv(a) @ b), but it isn't.

Say a.shape == (10, 3, 3) and b.shape == (3,). Then inv(a) @ b works,
and does what you'd expect: for each of the ten 3x3 matrices in a, it
computes the inverse and multiplies it by the 1-d vector in b (treated
as a column vector). But solve(a, b) is an error, because the
dimension aren't lined up to trigger the special handling for 1-d
vectors.

Or, say a.shape == (10, 3, 3) and b.shape == (3, 3). Then again inv(a)
@ b works, and does what you'd expect: for each of the ten 3x3
matrices in a, it computes the inverse and multiplies it by the 3x3
matrix in b. But again solve(a, b) is an error -- this time because
the special handling for 1-d vectors *does* kick in, even though it
doesn't make sense: it tries to match up the ten 3x3 matrices in a
against the three one-dimensional vectors in b, and 10 != 3 so the
broadcasting fails.

This also points to even more confusing possibilities: if a.shape ==
(3, 3) or (3, 3, 3, 3) and b.shape == (3, 3), then inv(a) @ b and
solve(a, b) both work and do the same thing. But if a.shape == (3, 3,
3), then inv(a) @ b and solve(a, b) both work, and do totally
*different* things.

I wonder if we should deprecate these corner cases, and eventually
migrate to making inv(a) @ b and solve(a, b) the same in all
situations. If we did, then solve(a, b) would actually be a gufunc
with signature (m,m),(m,n?)->(m,n?).

I think the cases that would need to be changed are those where
(a.ndim - 1 == b.ndim and b.ndim > 1). My guess is that this happens
very rarely in existing code, especially since (IIRC) this behavior
was only added a few years ago, when we gufunc-ified numpy.linalg.

> Another example would be the "matrix transpose" function that has been
> occasionally proposed, to swap the last two dimensions of an array. It could
> have the signature (m?,n)->(n,m?), which ensure that it is still well
> defined (as the identity) on 1d arrays.

Unfortunately I don't think we could make "broadcasting matrix
transpose" be literally a gufunc, since it should return a view. But I
guess there'd still be some value in having the notation available
just when talking about it, so we could say "this operation is *like*
a gufunc with signature (m?,n)->(n,m?), except that it returns a
view".

-n

-- 
Nathaniel J. Smith -- https://vorpus.org
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] matmul as a ufunc

2018-05-29 Thread Nathaniel Smith
On Mon, May 28, 2018, 20:41 Stephan Hoyer  wrote:

> On Mon, May 28, 2018 at 7:36 PM Eric Wieser 
> wrote:
>
>> which ensure that it is still well defined (as the identity) on 1d
>> arrays.
>>
>> This strikes me as a bad idea. There’s already enough confusion from
>> beginners that array_1d.T is a no-op. If we introduce a
>> matrix-transpose, it should either error on <1d inputs with a useful
>> message, or insert the extra dimension. I’d favor the former.
>>
> To be clear: matrix transpose is an example use-case rather than a serious
> proposal in this discussion.
>
> But given that idiomatic NumPy code uses 1D arrays in favor of explicit
> row/column vectors with shapes (1,n) and (n,1), I do think it does make
> sense for matrix transpose on 1D arrays to be the identity, because matrix
> transpose should convert back and forth between row and column vectors
> representations.
>

More concretely, I think the idea is that if you write code like

  a.T @ a

then it's nice if that automatically works for both 2d and 1d arrays.
Especially, say, if this is embedded inside a larger function so you have
some responsibilities to you users to handle different inputs
appropriately, and your users expect that to include both 2d matrices and
1d vectors. It reduces special cases.

But, on the other hand, if you write

a @ a.T

then you'll be in for a surprise... So maybe it's not a great idea after
all.

(Note that here I'm using .T as a placeholder for a hypothetical
"broadcasting matrix transpose". I don't think anyone proposes that .T
itself should be changed to do this; I just needed some notation.)

-n

>
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Allowing broadcasting of code dimensions in generalized ufuncs

2018-05-30 Thread Nathaniel Smith
On Wed, May 30, 2018 at 11:14 AM, Marten van Kerkwijk
 wrote:
> Hi All,
>
> Following on a PR combining the ability to provide fixed and flexible
> dimensions [1] (useful for, e.g., 3-vector input with a signature like
> `(3),(3)->(3)`, and for `matmul`, resp.; based on earlier PRs by Jaime
> [2] and Matt (Picus) [3]), I've now made a PR with a further
> enhancement, which allows one can indicate that a core dimension can
> be broadcast [4].
>
> A particular use case is `all_equal`, a new function suggested in a
> stalled PR by Matt (Harrigan) [5], which compares two arrays
> axis-by-axis, but short-circuits if a non-equality is found (unlike
> what is the case if one does `(a==b).all(axis)`). One thing that would
> be obviously useful for a routine like `all_equal` is to be able to
> provide an array as one argument and a constant as another, i.e., if
> the core dimensions can be broadcast if needed, just like they are in
> `(a==b).all(axis)`. This is currently not possible: with its signature
> of `(n),(n)->()`, the two arrays have to have the same trailing size.
>
> My PR provides the ability to indicate in the signature that a core
> dimension can be broadcast, by using a suffix of "|1". Thus, the
> signature of `all_equal` would become:
>
> ```
> (n|1),(n|1)->()
> ```
>
> Comments most welcome (yes, even on the notation - though I think it
> is fairly self-explanatory)!

I'm currently -0.5 on both fixed dimensions and this broadcasting
dimension idea. My reasoning is:

- The use cases seem fairly esoteric. For fixed dimensions, I guess
the motivating example is cross-product (are there any others?). But
would it be so bad for a cross-product gufunc to raise an error if it
receives the wrong number of dimensions? For this broadcasting case...
well, obviously we've survived this long without all_equal :-). And
there's something funny about all_equal, since it's really smushing
together two conceptually separate gufuncs for efficiency. Should we
also have all_less_than, sum_square, ...? If this is a big problem,
then wouldn't it be better to solve it in a general way, like dask or
Numba or numexpr do? To be clear, I'm not saying these features are
necessarily *bad* ideas, in isolation -- just that the benefits aren't
very convincing, and there are trade-offs, like:

- When it comes to the core ufunc machinery, we have a limited
complexity budget. I'm nervous that if we add too many bells and
whistles, we'll end up writing ourselves into a corner where we have
trouble maintaining it, where it becomes difficult to predict how
different features interact, it becomes increasingly difficult for
third-parties to handle all the different features in their
__array_ufunc__ methods...

- And, we have a lot of other demands on the core ufunc machinery,
that might be better places to spend our limited complexity budget.
For example, can we come up with an extension to make np.sort a
gufunc? That seems like a much higher priority than figuring out how
to make all_equal a gufunc. What about refactoring the ufunc machinery
to support user-defined dtypes? That'll need some serious work, and
again, it's probably higher priority than supporting cross-product or
all_equal directly (or at least it seems that way to me).

Maybe there are more compelling use cases that I'm missing, but as it
is, I feel like trying to add too many features to the current ufunc
machinery is pretty risky for future maintainability, and we shouldn't
do it without really solid use cases.

-n

-- 
Nathaniel J. Smith -- https://vorpus.org
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Allowing broadcasting of code dimensions in generalized ufuncs

2018-05-31 Thread Nathaniel Smith
On Thu, May 31, 2018 at 4:20 AM, Marten van Kerkwijk
 wrote:
> Hi Nathaniel,
>
> I think the case for frozen dimensions is much more solid that just
> `cross1d` - there are many operations that work on size-3 vectors.
> Indeed, as I noted in the PR, I have just been wrapping a
> Standards-of-Astronomy library in gufuncs, and many of its functions
> require size-3 vectors or 3x3 matrices [1]. Of course, I can put
> checks on the sizes, and I've now done that in a custom type resolver
> (which I needed anyway since, as you say, user dtypes is currently not
> easy), but there is a real problem for functions that take scalars and
> produce vectors: with a signature like `(),()->(n)`, I am forced to
> pass in an output with size 3, which is very inconvenient (especially
> if I then also want to override with `__array_ufunc__` - now my
> Quantity implementation also has to start changing an output already
> put in. So, having frozen dimensions is definitely helpful for
> developers of new gufuncs.

Ah, this does sound like I'm missing something. I suspect this is a
situation where we have two problems:

- For some people the use cases are everyday and obvious; for others
they're things we've never heard of (what's a "standard of
astronomy"?)
- The discussion is scattered around mailing list posts, old comments
on random github issues, etc.

This makes it hard for everyone to be on the same page. But this is
exactly the situation where NEPs are useful. Maybe you could write up
a short NEP for frozen dimensions? It doesn't need to be fancy or take
long, but I think it'd be useful to have a single piece of text we can
all look at that describes the use cases and how frozen dimensions
help.

BTW, regarding output shape: as you hint, there's a similar problem
with parametrized dtypes in general. Consider defining a loop for
np.add that lets it concatenate strings. If the inputs are S4 and S5,
then the output should be S9 – but how does the ufunc machinery know
that? This suggests that when we do the big refactor to ufuncs to
support user-defined and parametrized dtypes in general, one of the
things we'll need is a way for an individual loop to select the output
dtype. One natural way to do this would be to have two callbacks per
loop: one that receives the input dtypes, and returns the output
dtypes, and then the other that's like the current loop callback that
actually performs the operation. Output shape feels very similar to
output dtype to me, so maybe the general way to handle this would be
to make the first callback take the input shapes+dtypes and return the
desired output shapes+dtypes? Maybe frozen dimensions are a good idea
regardless, but just wanted to put that out there since it might be a
more general solution.

> Furthermore, with frozen dimensions, the signature is not just
> immediately clear, `(),()->(3)` for the example above, it is also
> better in telling users about what a function does.
>
> Indeed, I think this addition has much more justification than the `?`
> which is much more complex than the fixed size, yet neither
> particularly clear nor useful beyond the single purpose of matmul. (It
> is just that that single purpose has fairly high weight...)

Yeah, that's why I'm not 100% happy with '?' either (even though I
proposed it in the first place :-)). But matmul is like, arguably the
single most important operation in numpy, so it can justify a lot
more...

-n

-- 
Nathaniel J. Smith -- https://vorpus.org
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] NEP: Random Number Generator Policy

2018-06-11 Thread Nathaniel Smith
On Sun, Jun 10, 2018 at 11:53 PM, Ralf Gommers  wrote:
>
> On Sun, Jun 10, 2018 at 11:15 PM, Robert Kern  wrote:
>>
>> The intention of this code is to shuffle two same-length sequences in the
>> same way. So now if I write my code well to call np.random.seed() once at
>> the start of my program, this function comes along and obliterates that with
>> a fixed seed just so it can reuse the seed again to replicate the shuffle.
>
>
> Yes, that's a big no-no. There are situations conceivable where a library
> has to set a seed, but I think the right pattern in that case would be
> something like
>
> old_state = np.random.get_state()
> np.random.seed(some_int)
> do_stuff()
> np.random.set_state(**old._state)

This will seem to work fine in testing, and then when someone tries to
use your library in a multithreaded program everything will break in
complicated and subtle ways :-(. I really don't think there's any
conceivable situation where a library (as opposed to an application)
can correctly use the global random state.

-n

-- 
Nathaniel J. Smith -- https://vorpus.org
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] Proposal to accept NEP 15: Merging multiarray and umath

2018-06-29 Thread Nathaniel Smith
Hi all,

I propose that we accept NEP 15: Merging multiarray and umath:

http://www.numpy.org/neps/nep-0015-merge-multiarray-umath.html

The core part of this proposal was uncontroversial. The main point of
discussion was whether it was OK to deprecate set_numeric_ops, or
whether it had some legitimate use cases. The conclusion was that in
all the cases where set_numeric_ops is useful,
PyUFunc_ReplaceLoopBySignature is a strictly better alternative, so
there's no reason not to deprecate set_numeric_ops. So at this point I
think the whole proposal is uncontroversial, and we can go ahead and
accept it.

If there are no substantive objections within 7 days from this email,
then the NEP will be accepted; see NEP 0 for more details:
http://www.numpy.org/neps/nep-.html

-n

-- 
Nathaniel J. Smith -- https://vorpus.org
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Proposal to accept NEP 15: Merging multiarray and umath

2018-06-29 Thread Nathaniel Smith
Note that this is the first formal proposal to accept a NEP using our
new process (yay!). While writing it I realized that the current text
about this in NEP 0 is a bit terse, so I've also just submitted a PR
to expand that section:

https://github.com/numpy/numpy/pull/11459

-n

On Fri, Jun 29, 2018 at 3:18 PM, Nathaniel Smith  wrote:
> Hi all,
>
> I propose that we accept NEP 15: Merging multiarray and umath:
>
> http://www.numpy.org/neps/nep-0015-merge-multiarray-umath.html
>
> The core part of this proposal was uncontroversial. The main point of
> discussion was whether it was OK to deprecate set_numeric_ops, or
> whether it had some legitimate use cases. The conclusion was that in
> all the cases where set_numeric_ops is useful,
> PyUFunc_ReplaceLoopBySignature is a strictly better alternative, so
> there's no reason not to deprecate set_numeric_ops. So at this point I
> think the whole proposal is uncontroversial, and we can go ahead and
> accept it.
>
> If there are no substantive objections within 7 days from this email,
> then the NEP will be accepted; see NEP 0 for more details:
> http://www.numpy.org/neps/nep-.html
>
> -n
>
> --
> Nathaniel J. Smith -- https://vorpus.org



-- 
Nathaniel J. Smith -- https://vorpus.org
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Proposal to accept NEP 15: Merging multiarray and umath

2018-06-29 Thread Nathaniel Smith
On Fri, Jun 29, 2018 at 3:28 PM, Marten van Kerkwijk
 wrote:
> Agreed on accepting the NEP! But it is not the first proposal to accept
> under the new rules - that goes to the broadcasting NEP (though perhaps I
> wasn't sufficiently explicit in stating that I was starting a
> count-down...). -- Marten

Oh sorry, I missed that! (Which I guess is some evidence in favor of
starting a new thread :-).)

-n

-- 
Nathaniel J. Smith -- https://vorpus.org
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Fwd: Allowing broadcasting of code dimensions in generalized ufuncs

2018-07-03 Thread Nathaniel Smith
On Sat, Jun 30, 2018 at 6:51 AM, Marten van Kerkwijk
 wrote:
> Hi All,
>
> In case it was missed because people have tuned out of the thread: Matti and
> I proposed last Tuesday to accept NEP 20 (on coming Tuesday, as per NEP 0),
> which introduces notation for generalized ufuncs allowing fixed, flexible
> and broadcastable core dimensions. For one thing, this will allow Matti to
> finish his work on making matmul a gufunc.
>
> See http://www.numpy.org/neps/nep-0020-gufunc-signature-enhancement.html

So I still have some of the same concerns as before...

For the possibly missing dimensions: matmul is really important, and
making it a gufunc solves the problem of making it overridable by duck
arrays (via __array_ufunc__). Also, it will help later when we rework
dtypes: new dtypes will be able to implement matmul by the normal
ufunc loop registration mechanism, which is much nicer than the
current system where every dtype has a special-case method just for
handling matmul. The ? proposal isn't the most elegant idea ever, but
we've been tossing around ideas for solving these problems for a
while, and so far this seems to be the least-bad one, so... sure,
let's do it.

For the fixed-size dimensions: this makes me nervous. It's aimed at a
real use case, which is a major point in it's favor. But a few things
make me wary. For input dimensions, it's sugar – the gufunc loop can
already raise an error if it doesn't like the size it gets. For output
dimensions, it does solve a real problem. But... only part of it. It's
awkward that right now you only have a few limited ways to choose
output dimensions, but this just extends the list of special cases,
rather than solving the underlying problem. For example,
'np.linalg.qr' needs a much more generic mechanism to choose output
shape, and parametrized dtypes will need a much more generic mechanism
to choose output dtype, so we're definitely going to end up with some
phase where arbitrary code gets to describe the output array. Are we
going to look back on fixed-size dimensions as a quirky, redundant
thing?

Also, as currently proposed, it seems to rule out the possibility of
using name-based axis specification in the future, right? (See
https://github.com/numpy/numpy/pull/8819#issuecomment-366329325) Are
we sure we want to do that?

If everyone else is comfortable with all these things then I won't
block it though.

For broadcasting: I'm sorry, but I think I'm -1 on this. I feel like
it falls into a classic anti-pattern in numpy, where someone sees a
cool thing they could do and then goes looking for problems to justify
it. (A red flag for me is that "it's easy to implement" keeps being
mentioned as justification for doing it.) The all_equal and
weighted_mean examples both feel pretty artificial -- traditionally
we've always implemented these kinds of functions as regular functions
that use (g)ufuncs internally, and it's worked fine (cf. np.allclose,
ndarray.mean). In fact in some sense the whole point of numpy is to
help people implement functions like this, without having to write
their own gufuncs. Is there some reason these need to be gufuncs? And
if there is, are these the only things that need to be gufuncs, or is
there a broader class we're missing? The design just doesn't feel
well-justified to me.

And in the past, when we've implemented things like this, where the
use cases are thin but hey why not it's easy to do, it's ended up
causing two problems: first people start trying to force it into cases
where it doesn't quite work, which makes everyone unhappy... and then
when we eventually do try to solve the problem properly, we end up
having to do elaborate workarounds to keep the old not-quite-working
use cases from breaking.

I'm pretty sure we're going to end up rewriting most of the ufunc code
over the next few years as we ramp up duck array and user dtype
support, and it's already going to be very difficult, both to design
in the first place and then to implement while carefully keeping shims
to keep all the old stuff working. Adding features has a very real
cost, because it adds extra constraints that all this future work will
have to work around. I don't think this meets the bar.

By the way, I also think we're getting well past the point where we
should be switching from a string-based DSL to a more structured
representation. (This is another trap that numpy tends to fall into...
the dtype "language" is also a major offender.) This isn't really a
commentary on any part of this in particular, but just something that
I've been noticing and wanted to mention :-).

-n

-- 
Nathaniel J. Smith -- https://vorpus.org
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


  1   2   >