Re: [Numpy-discussion] Fwd: [numfocus] Grants up to $3k available to NumFOCUS projects (sponsored & affiliated)

2017-03-27 Thread Chris Barker
On Mon, Mar 27, 2017 at 3:33 AM, Julian Taylor <
jtaylor.deb...@googlemail.com> wrote:

> - add ascii/latin1 dtype to support a compact python3 string array,
> deprecate 's' dtype which has different meaning in python2 and 3
> This one is probably too big for 3k though.
>

probably -- but not THAT big -- it seems pretty straightforward to me.

The bigger challenge is deciding what to do -- the bikeshedding -- and the
backward incompatibility issues. IIRC, when this came up on the list, there
was nothing like consensus on exactly what to do and how to do it.

-CHB


-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Fwd: [numfocus] Grants up to $3k available to NumFOCUS projects, (sponsored & affiliated)

2017-03-27 Thread Chris Barker
On Mon, Mar 27, 2017 at 12:14 PM, Pauli Virtanen  wrote:

> > The bigger challenge is deciding what to do -- the bikeshedding -- and
> > the backward incompatibility issues. IIRC, when this came up on the
> > list, there was nothing like consensus on exactly what to do and how
> > to do it.
>
> TBH, I don't see why 's' should be deprecated --- the operation is
> well-specified (byte strings + null stripping) and has the same meaning in
> python2 and 3.
>

exactly -- I don't think there was a consensus on this.


> Of course, a true 1-byte unicode subset string may be more useful type for
> some applications, so it could indeed be added.
>

That's the idea -- scientist tend to use a lot of ascii text (or at least
one-byte per char text), numy requires each element to be the same number
of bytes, so the unicode dtype is 4 btes per char -- seemingly very
wasteful.

but if you use 's' on py3, you get bytestrings back -- not "text" from a
py3 perspective.

and aside from backwards compatibility, I see no reason for a 's' dtype
that returns a bytes object on py3 -- if it's really binary data, you can
use the 'b' dtype.

-CHB


-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-20 Thread Chris Barker
Thanks so much for reviving this conversation -- we really do need to
address this.

My thoughts:

What people apparently want is a string type for Python3 which uses less
> memory for the common science use case which rarely needs more than
> latin1 encoding.
>

Yes -- I think there is a real demand for that.




https://en.wikipedia.org/wiki/ISO/IEC_8859-15

To please everyone I think we need to go with a dtype that supports
> multiple encodings via metadata, similar to how datetime supports
> multiple units.
> E.g.: 'U10[latin1]' are 10 characters in latin1 encoding
>

I wonder if we really need that -- as you say, there is real demand for
compact string type, but for many use cases, 1 byte per character is
enough. So to keep things really simple, I think a single 1-byte per char
encoding would meet most people's needs.

What should that encoding be?

latin-1 is obvious (and has the very nice property of being able to
round-trip arbitrary bytes -- at least with Python's implementation) and
scientific data sets tend to use the latin alphabet (with its ascii roots
and all).

But there is now latin-9:

https://en.wikipedia.org/wiki/ISO/IEC_8859-15

Maybe a better option?

Encodings we should support are:
> - latin1 (1 bytes):
> it is compatible with ascii and adds extra characters used in the
> western world.
> - utf-32 (4 bytes):
> can represent every character, equivalent with np.unicode
>

IIUC, datetime64 is, well, always 64 bits. So it may be better to have a
given dtype always be the same bitwidth.

So the utf-32 dtype would be a different dtype. which also keeps it really
simple, we have a latin-* dtype and a full-on unicode dtype -- that's it.

Encodings we should maybe support:
> - utf-16 with explicitly disallowing surrogate pairs (2 bytes):
> this covers a very large range of possible characters in a reasonably
> compact representation
>

I think UTF-16 is very simply, the worst of both worlds. If we want a
two-byte character set, then it should be UCS-2 -- i.e. explicitly
rejecting any code point that takes more than two bytes to represent. (or
maybe that's what you mean by explicitly disallowing surrogate pairs). in
any case, it should certainly give you an encoding error if you try to pass
in a unicode character than can not fit into two bytes.

So: is there actually a demand for this? If so, then I think it should be a
separate 2-byte string type, with the encoding always the same.


> - utf-8 (4 bytes):
> variable length encoding with minimum size of 1 bytes, but we would need
> to assume the worst case of 4 bytes so it would not save anything
> compared to utf-32 but may allow third parties replace an encoding step
> with trailing null trimming on serialization.
>

yeach -- utf-8 is great for interchange and streaming data, but not for
internal storage, particular with the numpy every item has the same number
of bytes requirement. So if someone wants to work with ut-8 they can store
it in a byte array, and encode and decode as they pass it to/from python.
That's going to have to happen anyway, even if under the hood. And it's
risky business -- if you truncate a utf-8 bytestring, you may get invalid
data --  it  really does not belong in numpy.


> - Add a new dtype, e.g. npy.realstring
>

I think that's the way to go. backwards compatibility is really key. Though
could we make the existing string dtype a latin-1 always type without
breaking too much? Or maybe depricate and get there in the future?

It has the cosmetic disadvantage that it makes the np.unicode dtype
> obsolete and is more busywork to implement.
>

I think the np.unicode type should remain as the 4-bytes per char encoding.
But that only makes sense if you follow my idea that we don't have a
variable number of bytes per char dtype.

So my proposal is:

 - Create a new one-byte-per-char dtype that is always latin-9 encoded.
- in python3 it would map to a string (i.e. unicode)
 - Keep the 4-byte per char unicode string type

Optionally (if there is really demand)
 - Create a new two-byte per char dtype that is always UCS-2 encoded.


Is there any way to leverage Python3's nifty string type? I'm thinking not.
At least not for numpy arrays that can play well with C code, etc.

All that being said, a encoding-specified string dtype would be nice too --
I just think it's more complex that it needs to be. Numpy is not the tool
for text processing...

-CHB



-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-20 Thread Chris Barker
On Thu, Apr 20, 2017 at 9:47 AM, Anne Archibald 
wrote:

> Is there any reason not to support all Unicode encodings that python does,
> with the same names and semantics? This would surely be the simplest to
> understand.
>

I think it should support all fixed-length encodings, but not the non-fixed
length ones -- they just don't fit well into the numpy data model.


> Also, if latin1 is to going to be the only practical 8-bit encoding, maybe
> check with some non-Western users to make sure it's not going to wreck
> their lives? I'd have selected ASCII as an encoding to treat specially, if
> any, because Unicode already does that and the consequences are familiar.
> (I'm used to writing and reading French without accents because it's passed
> through ASCII, for example.)
>

latin-1 (or latin-9) only makes things better than ASCII -- it buys most of
the accented characters for the European language and some symbols that are
nice to have (I use the degree symbol a lot...). And it is ASCII compatible
-- so there is NO reason to choose ASCII over Latin-*

Which does no good for non-latin languages -- so we need to hear from the
community -- is there a substantial demand for a non-latin one-byte per
character encoding?


> Variable-length encodings, of which UTF-8 is obviously the one that makes
> good handling essential, are indeed more complicated. But is it strictly
> necessary that string arrays hold fixed-length *strings*, or can the
> encoding length be fixed instead? That is, currently if you try to assign a
> longer string than will fit, the string is truncated to the number of
> characters in the data type.
>

we could do that, yes, but an improperly truncated "string" becomes invalid
-- just seems like a recipe for bugs that won't be found in testing.

memory is cheap, compressing is fast -- we really shouldn't get hung up on
this!

Note: if you are storing a LOT of text (which I have no idea why you would
use numpy anyway), then the memory size might matter, but then
semi-arbitrary truncation would probably matter, too.

I expect most text storage in numpy arrays is things like names of
datasets, ids, etc, etc -- not massive amounts of text -- so storage space
really isn't critical. but having an id or something unexpectedly truncated
could be bad.

I think practical experience has shown us that people do not handle "mostly
fixed length but once in awhile not" text well -- see the nightmare of
UTF-16 on Windows. Granted, utf-8 is multi-byte far more often, so errors
are far more likely to be found in tests (why would you use utf-8 is all
your data are in ascii???). but still -- why invite hard to test for errors?

Final point -- as Julian suggests, one reason to support utf-8 is for
interoperability with other systems -- but that makes errors more of an
issue -- if it doesn't pass through the numpy truncation machinery, invalid
data could easily get put in a numpy array.

-CHB

 it would allow UTF-8 to be used just the way it usually is - as an
> encoding that's almost 8-bit.
>

ouch! that perception is the route to way too many errors! it is by no
means almost 8-bit, unless your data are almost ascii -- in which case, use
latin-1 for pity's sake!

This highlights my point though -- if we support UTF-8, people WILL use it,
and only test it with mostly-ascii text, and not find the bugs that will
crop up later.

All this said, it seems to me that the important use cases for string
> arrays involve interaction with existing binary formats, so people who have
> to deal with such data should have the final say. (My own closest approach
> to this is the FITS format, which is restricted by the standard to ASCII.)
>

yup -- not sure we'll get much guidance here though -- netdf does not solve
this problem well, either.

But if you are pulling, say, a utf-8 encoded string out of a netcdf file --
it's probably better to pull it out as bytes and pass it through the python
decoding/encoding machinery than pasting the bytes straight to a numpy
array and hope that the encoding and truncation are correct.

-CHB


-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-20 Thread Chris Barker
On Thu, Apr 20, 2017 at 10:26 AM, Stephan Hoyer  wrote:

> I agree with Anne here. Variable-length encoding would be great to have,
> but even fixed length UTF-8 (in terms of memory usage, not characters)
> would solve NumPy's Python 3 string problem. NumPy's memory model needs a
> fixed size per array element, but that doesn't mean we need a fixed sized
> per character. Each element in a UTF-8 array would be a string with a fixed
> number of codepoints, not characters.
>

Ah, yes -- the nightmare of Unicode!

No, it would not be a fixed number of codepoints -- it would be a fixed
number of bytes (or "code units"). and an unknown number of characters.

As Julian pointed out, if you wanted to specify that a numpy element would
be able to hold, say, N characters (actually code points, combining
characters make this even more confusing) then you would need to allocate
N*4 bytes to make sure you could hold any string that long. Which would be
pretty pointless -- better to use UCS-4.

So Anne's suggestion that numpy truncates as needed would make sense --
you'd specify say N characters, numpy would arbitrarily (or user specified)
over-allocate, maybe N*1.5 bytes, and you'd truncate if someone passed in a
string that didn't fit. Then you'd need to make sure you truncated
correctly, so as not to create an invalid string (that's just code, it
could be made correct).

But how much to over allocate? for english text, with an occasional
scientific symbol, only a little. for, say, Japanese text, you'd need a
factor 2 maybe?

Anyway, the idea that "just use utf-8" solves your problems is really
dangerous. It simply is not the right way to handle text if:

you need fixed-length storage
you care about compactness

In fact, we already have this sort of distinction between element size and
> memory usage: np.string_ uses null padding to store shorter strings in a
> larger dtype.
>

sure -- but it is clear to the user that the dtype can hold "up to this
many" characters.


> The only reason I see for supporting encodings other than UTF-8 is for
> memory-mapping arrays stored with those encodings, but that seems like a
> lot of extra trouble for little gain.
>

I see it the other way around -- the only reason TO support utf-8 is for
memory mapping with other systems that use it :-)

On the other hand,  if we ARE going to support utf-8 -- maybe use it for
all unicode support, rather than messing around with all the multiple
encoding options.

I think a 1-byte-per char latin-* encoded string is a good idea though --
scientific use tend to be latin only and space constrained.

All that being said, if the truncation code were carefully written, it
would mostly "just work"

-CHB


-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-20 Thread Chris Barker
On Thu, Apr 20, 2017 at 10:36 AM, Neal Becker  wrote:

> I'm no unicode expert, but can't we truncate unicode strings so that only
> valid characters are included?
>

sure -- it's just a bit fiddly -- and you need to make sure that everything
gets passed through the proper mechanism. numpy is all about folks using
other code to mess with the bytes in a numpy array. so we can't expect that
all numpy string arrays will have been created with numpy code.

Does python's string have a truncated encode option? i.e. you don't want to
encode to utf-8 and then just chop it off.

-CHB


-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-21 Thread Chris Barker
I just re-read the "Utf-8" manifesto, and it helped me clarify my thoughts:

1) most of it is focused on utf-8 vs utf-16. And that is a strong argument
-- utf-16 is the worst of both worlds.

2) it isn't really addressing how to deal with fixed-size string storage as
needed by numpy.

It does bring up Python's current approach to Unicode:

"""
This lead to software design decisions such as Python’s string O(1) code
point access. The truth, however, is that Unicode is inherently more
complicated and there is no universal definition of such thing as *Unicode
character*. We see no particular reason to favor Unicode code points over
Unicode grapheme clusters, code units or perhaps even words in a language
for that.
"""

My thoughts on that-- it's technically correct, but practicality beats
purity, and the character concept is pretty darn useful for at least some
(commonly used in the computing world) languages.

In any case, whether the top-level API is character focused doesn't really
have a bearing on the internal encoding, which is very much an
implementation detail in py 3 at least.

And Python has made its decision about that.

So what are the numpy use-cases?

I see essentially two:

1) Use with/from  Python -- both creating and working with numpy arrays.

In this case, we want something compatible with Python's string (i.e. full
Unicode supporting) and I think should be as transparent as possible.
Python's string has made the decision to present a character oriented API
to users (despite what the manifesto says...).

However, there is a challenge here: numpy requires fixed-number-of-bytes
dtypes. And full unicode support with fixed number of bytes matching fixed
number of characters is only possible with UCS-4 -- hence the current
implementation. And this is actually just fine! I know we all want to be
efficient with data storage, but really -- in the early days of Unicode,
when folks thought 16 bits were enough, doubling the memory usage for
western language storage was considered fine -- how long in computer life
time does it take to double your memory? But now, when memory, disk space,
bandwidth, etc, are all literally orders of magnitude larger, we can't
handle a factor of 4 increase in "wasted" space?

Alternatively, Robert's suggestion of having essentially an object array,
where the objects were known to be python strings is a pretty nice idea --
it gives the full power of python strings, and is a perfect one-to-one
match with the python text data model.

But as scientific text data often is 1-byte compatible, a one-byte-per-char
dtype is a fine idea, too -- and we pretty much have that already with the
existing string type -- that could simply be enhanced by enforcing the
encoding to be latin-9 (or latin-1, if you don't want the Euro symbol).
This would get us what scientists expect from strings in a way that is
properly compatible with Python's string type. You'd get encoding errors if
you tried to stuff anything else in there, and that's that.

Yes, it would have to be a "new" dtype for backwards compatibility.

2) Interchange with other systems: passing the raw binary data back and
forth between numpy arrays and other code, written in C, Fortran, or binary
flle formats.

This is a key use-case for numpy -- I think the key to its enormous
success. But how important is it for text? Certainly any data set I've ever
worked with has had gobs of binary numerical data, and a small smattering
of text. So in that case, if, for instance, h5py had to encode/decode text
when transferring between HDF files and numpy arrays, I don't think I'd
ever see the performance hit. As for code complexity -- it would mean more
complex code in interface libs, and less complex code in numpy itself.
(though numpy could provide utilities to make it easy to write the
interface code)

If we do want to support direct binary interchange with other libs, then we
should probably simply go for it, and support any encoding that Python
supports -- as long as you are dealing with multiple encodings, why try to
decide up front which ones to support?

But how do we expose this to numpy users? I still don't like having
non-fixed-width encoding under the hood, but what can you do? Other than
that, having the encoding be a selectable part of the dtype works fine --
and in that case the number of bytes should be the "length" specifier.

This, however, creates a bit of an impedance mismatch between the
"character-focused" approach of the python string type. And requires the
user to understand something about the encoding in order to even know how
many bytes they need -- a utf-8-100 string will hold a different "length"
of string than a utf-16-100 string.

So -- I think we should address the use-cases separately -- one for
"normal" python use and simple interoperability with python strings, and
one for interoperability at the binary level. And an easy way to convert
between the two.

For Python use -- a pointer to a Python string would be

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-24 Thread Chris Barker
On Fri, Apr 21, 2017 at 2:34 PM, Stephan Hoyer  wrote:


> In this case, we want something compatible with Python's string (i.e. full
>> Unicode supporting) and I think should be as transparent as possible.
>> Python's string has made the decision to present a character oriented API
>> to users (despite what the manifesto says...).
>>
>
> Yes, but NumPy doesn't really implement string operations, so fortunately
> this is pretty irrelevant to us -- except for our API for specifying dtype
> size.
>

Exactly -- the character-orientation of python strings means that people
are used to thinking that strings have a length that is the number of
characters in the string. I think there will a cognitive dissonance if
someone does:

arr[i] = a_string

Which then raises a ValueError, something like:

String too long for a string[12] dytype array.

When len(a_string) <= 12

AND that will only  occur if there are non-ascii characters in the string,
and maybe only if there are more than N non-ascii characters. i.e. it is
very likely to be a run-time error that may not have shown up in tests.

So folks need to do something like:

len(a_string.encode('utf-8')) to see if their string will fit. If not, they
need to truncate it, and THAT is non-obvious how to do, too -- you don't
want to truncate the encodes bytes naively, you could end up with an
invalid bytestring. but you don't know how many characters to truncate,
either.


> We already have strong precedence for dtypes reflecting number of bytes
> used for storage even when Python doesn't: consider numeric types like
> int64 and float32 compared to the Python equivalents. It's an intrinsic
> aspect of NumPy that users need to think about how their data is actually
> stored.
>

sure, but a float64 is 64 bytes forever an always and the defaults
perfectly match what python is doing under its hood --even if users don't
think about. So the default behaviour of numpy matched python's built-in
types.


Storage cost is always going to be a concern. Arguably, it's even more of a
>> concern today than it used to be be, because compute has been improving
>> faster than storage.
>>
>
sure -- but again, what is the use-case for numpy arrays with a s#$)load of
text in them? common? I don't think so. And as you pointed out numpy
doesn't do text processing anyway, so cache performance and all that are
not important. So having UCS-4 as the default, but allowing folks to select
a more compact format if they really need it is a good way to go. Just like
numpy generally defaults to float64 and Int64 (or 32, depending on
platform) -- users can select a smaller size if they have a reason to.

I guess that's my summary -- just like with numeric values, numpy should
default to Python-like behavior as much as possible for strings, too --
with an option for a knowledgeable user to do something more performant.


> I still don't understand why a latin encoding makes sense as a preferred
> one-byte-per-char dtype. The world, including Python 3, has standardized on
> UTF-8, which is also one-byte-per-char for (ASCII) scientific data.
>

utf-8 is NOT a one-byte per char encoding. IF you want to assure that your
data are one-byte per char, then you could use ASCII, and it would be
binary compatible with utf-8, but not sure what the point of that is in
this context.

latin-1 or latin-9 buys you (over ASCII):

- A bunch of accented characters -- sure it only covers the latin
languages, but does cover those much better.

- A handful of other characters, including scientifically useful ones. (a
few greek characters, the degree symbol, etc...)

- round-tripping of binary data (at least with Python's encoding/decoding)
-- ANY string of bytes can be decodes as latin-1 and re-encoded to get the
same bytes back. You may get garbage, but you won't get an EncodingError.

For Python use -- a pointer to a Python string would be nice.
>>
>
> Yes, absolutely. If we want to be really fancy, we could consider a
> parametric object dtype that allows for object arrays of *any* homogeneous
> Python type. Even if NumPy itself doesn't do anything with that
> information, there are lots of use cases for that information.
>

hmm -- that's nifty idea -- though I think strings could/should be special
cased.


> Then use a native flexible-encoding dtype for everything else.
>>
>
> No opposition here from me. Though again, I think utf-8 alone would also
> be enough.
>

maybe so -- the major reason for supporting others is binary data exchange
with other libraries -- but maybe most of them have gone to utf-8 anyway.

One more note: if a user tries to assign a value to a numpy string array
>> that doesn't fit, they should get an error:
>>
>
>> EncodingError if it can't be encoded into the defined encoding.
>>
>> ValueError if it is too long -- it should not be silently truncated.
>>
>
> I think we all agree here.
>

I'm actually having second thoughts -- see above -- if the encoding is
utf-8, then truncating is non-trivial -- mayb

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-24 Thread Chris Barker
On Mon, Apr 24, 2017 at 10:51 AM, Stephan Hoyer  wrote:

> - round-tripping of binary data (at least with Python's encoding/decoding)
>> -- ANY string of bytes can be decodes as latin-1 and re-encoded to get the
>> same bytes back. You may get garbage, but you won't get an EncodingError.
>>
>
> For a new application, it's a good thing if a text type breaks when you to
> stuff arbitrary bytes in it
>

maybe, maybe not -- the application may be new, but the data it works with
may not be.


> (see Python 2 vs Python 3 strings).
>

this is exactly why py3 strings needed to add the "surrogateescape" error
handler:

https://www.python.org/dev/peps/pep-0383

sometimes text and binary data are mixed, sometimes encoded text is broken.
It is very useful to be able to pass such data through strings losslessly.

Certainly, I would argue that nobody should write data in latin-1 unless
> they're doing so for the sake of a legacy application.
>

or you really want that 1-byte per char efficiency


> I do understand the value in having some "string" data type that could be
> used by default by loaders for legacy file formats/applications (i.e.,
> netCDF3) that support unspecified "one byte strings." Then you're a few
> short calls away from viewing (i.e., array.view('text[my_real_encoding]'),
> if we support arbitrary encodings) or decoding (i.e.,
> np.char.decode(array.view(bytes), 'my_real_encoding') ) the data in the
> proper encoding. It's not realistic to expect users to know the true
> encoding for strings from a file before they even look at the data.
>

except that you really should :-(

On the other hand, if this is the use-case, perhaps we really want an
> encoding closer to "Python 2" string, i.e, "unknown", to let this be
> signaled more explicitly. I would suggest that "text[unknown]" should
> support operations like a string if it can be decoded as ASCII, and
> otherwise error. But unlike "text[ascii]", it will let you store arbitrary
> bytes.
>

I _think_ that is what using latin-1 (Or latin-9) gets you -- if it really
is ascii, then it's perfect. If it really is latin-*, then you get some
extra useful stuff, and if it's corrupted somehow, you still get the ascii
text correct, and the rest won't  barf and can be passed on through.


So far, we have real use cases for at least UTF-8, UTF-32, ASCII and
> "unknown".
>

hmm -- "unknown" should be bytes, not text. If the user needs to look at it
first, then load it as bytes, run chardet or something on it, then cast to
the right encoding.

The current 'S' dtype truncates silently already:
>>
>
> One advantage of a new (non-default) dtype is that we can change this
> behavior.
>

yeah -- still on the edge about that, at least with variable-size
encodings. It's hard to know when it's going to happen and it's hard to
know what to do when it does.

At least if if truncates silently, numpy can have the code to do the
truncation properly. Maybe an option?

And the numpy numeric types truncate (Or overflow) already. Again:

If the default string handling matches expectations from python strings,
then the specialized ones can be more buyer-beware.

Also -- if utf-8 is the default -- what do you get when you create an array
>> from a python string sequence? Currently with the 'S' and 'U' dtypes, the
>> dtype is set to the longest string passed in. Are we going to pad it a bit?
>> stick with the exact number of bytes?
>>
>
> It might be better to avoid this for now, and force users to be explicit
> about encoding if they use the dtype for encoded text.
>

yup.

And we really should have a bytes type for py3 -- which we do, it's just
called 'S', which is pretty confusing :-)

-CHB


-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-24 Thread Chris Barker
On Mon, Apr 24, 2017 at 10:51 AM, Aldcroft, Thomas <
aldcr...@head.cfa.harvard.edu> wrote:

> BTW -- maybe we should keep the pathological use-case in mind: really
>> short strings. I think we are all thinking in terms of longer strings,
>> maybe a name field, where you might assign 32 bytes or so -- then someone
>> has an accented character in their name, and then ge30 or 31 characters --
>> no big deal.
>>
>
> I wouldn't call it a pathological use case, it doesn't seem so uncommon to
> have large datasets of short strings.
>

It's pathological for using a variable-length encoding.


> I personally deal with a database of hundreds of billions of 2 to 5
> character ASCII strings.  This has been a significant blocker to Python 3
> adoption in my world.
>

I agree -- it is a VERY common case for scientific data sets. But a
one-byte-per-char encoding would handle it nicely, or UCS-4 if you want
Unicode. The wasted space is not that big a deal with short strings...

BTW, for those new to the list or with a short memory, this topic has been
> discussed fairly extensively at least 3 times before.  Hopefully the
> *fourth* time will be the charm!
>

yes, let's hope so!

The big difference now is that Julian seems to be committed to actually
making it happen!

Thanks Julian!

Which brings up a good point -- if you need us to stop the damn
bike-shedding so you can get it done -- say so.

I have strong opinions, but would still rather see any of the ideas on the
table implemented than nothing.

-Chris


-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-24 Thread Chris Barker
On Mon, Apr 24, 2017 at 11:36 AM, Robert Kern  wrote:

> > I agree -- it is a VERY common case for scientific data sets. But a
> one-byte-per-char encoding would handle it nicely, or UCS-4 if you want
> Unicode. The wasted space is not that big a deal with short strings...
>
> Unless if you have hundreds of billions of them.
>

Which is why a one-byte-per char encoding is a good idea.

Solve the HDF5 problem (i.e. fixed-length UTF-8 strings)
>

I agree-- binary compatibility with utf-8 is a core use case -- though is
it so bad to go through python's encoding/decoding machinery to so it? Do
numpy arrays HAVE to be storing utf-8 natively?


> or leave it be until someone else is willing to solve that problem. I
> don't think we're at the bikeshedding stage yet; we're still disagreeing
> about fundamental requirements.
>

yeah -- though I've seen projects get stuck in the sorting out what to do,
so nothing gets done stage before -- I don't want Julian to get too
frustrated and end up doing nothing.

So here I'll lay out what I think are the fundamental requirements:

1) The default behaviour for numpy arrays of strings is compatible with
Python3's string model: i.e. fully unicode supporting, and with a character
oriented interface. i.e. if you do:

arr = np.array(("this", "that",))

you get an array that can store ANY unicode string with 4 or less characters

and arr[1] will return a native Python string object.

2) There be some way to store mostly ascii-compatible strings in a single
byte-per-character array -- so not be wasting space for "typical
european-oriented data".

arr = np.array(("this", "that",), dtype=np.single_byte_string)

(name TBD)

and arr[1] would return a python string.

attempting to put in a not-compatible with the encoding string in would
raise an Encoding Error.

I highly recommend that (SO 8859-15 ( latin-9 or latin-1)  be the encoding
in this case.

3) There be a dtype that could store strings in null-terminated utf-8
binary format -- for interchange with other systems (netcdf, HDF, others???)

4) a fixed length bytes dtype -- pretty much what 'S' is now under python
three -- settable from a bytes or bytearray object, and returns a bytes
object.
 - you could use astype() to convert between bytes and a specified encoding
with no change in binary representation.

2) and 3) could be fully covered by a dtype with a settable encoding that
might as well support all python built-in encodings -- though I think an
alias to the common cases would be good -- latin, utf-8. If so, the length
would have to be specified in bytes.

1) could be covered with the existing 'U': type - only downside being some
wasted space -- or with a pointer to a python string dtype -- which would
also waste space, though less for long-ish strings, and maybe give us some
better access to the nifty built-in string features.

> +1.  The key point is that there is a HUGE amount of legacy science data
> in the form of FITS (astronomy-specific binary file format that has been
> the primary file format for 20+ years) and HDF5 which uses a character data
> type to store data which can be bytes 0-255.  Getting an decoding/encoding
> error when trying to deal with these datasets is a non-starter from my
> perspective.


That says to me that these are properly represented by `bytes` objects, not
> `unicode/str` objects encoding to and decoding from a hardcoded latin-1
> encoding.


Well, yes -- BUT:  That strictness in python3 -- "data is either text or
bytes, and text in an unknown (or invalid) encoding HAVE to be bytes" bit
Python3 is the butt for a long time. Folks that deal in the messy real
world of binary data that is kinda-mostly text, but may have a bit of
binary data, or be in an unknown encoding, or be corrupted were very, very
adamant about how this model DID NOT work for them. Very influential people
were seriously critical of python 3. Eventually, py3 added bytes string
formatting, surrogate_escape, and other features that facilitate working
with messy almost text.

Practicality beats purity -- if you have one-byte per char data that is
mostly european, than latin-1 or latin-9 let you work with it, have it
mostly work, and never crash out with an encoding error.

> - round-tripping of binary data (at least with Python's
> encoding/decoding) -- ANY string of bytes can be decodes as latin-1 and
> re-encoded to get the same bytes back. You may get garbage, but you won't
> get an EncodingError.
> But what if the format I'm working with specifies another encoding? Am I
> supposed to encode all of my Unicode strings in the specified encoding,
> then decode as latin-1 to assign into my array? HDF5's UTF-8 arrays are a
> really important use case for me.


latin-1 would be only for the special case of mostly-ascii (or true latin)
one-byte-per-char encodings (which is a common use-case in scientific data
sets). I think it has only upside over ascii. It would be a fine idea to
support any one-byte-per-char encoding, too.

As 

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-25 Thread Chris Barker
On Mon, Apr 24, 2017 at 4:08 PM, Robert Kern  wrote:

> Chris, you've mashed all of my emails together, some of them are in reply
> to you, some in reply to others. Unfortunately, this dropped a lot of the
> context from each of them, and appears to be creating some
> misunderstandings about what each person is advocating.
>

Sorry about that -- I was trying to keep an already really long thread from
getting eve3n longer

And I'm not sure it matters who's doing the advocating, but rather *what*
is being advocated -- I hope I didn't screw that up too badly.

Anyway, I think I made the mistake of mingling possible solutions in with
the use-cases, so I'm not sure if there is any consensus on the use cases
-- which I think we really do need to nail down first -- as Robert has made
clear.

So I'll try again -- use-case only! we'll keep the possible solutions
separate.

Do we need to write up a NEP for this? it seems we are going a bit in
circles, and we really do want to capture the final decision process.

1) The default behaviour for numpy arrays of strings is compatible with
Python3's string model: i.e. fully unicode supporting, and with a character
oriented interface. i.e. if you do::

  arr = np.array(("this", "that",))

you get an array that can store ANY unicode string with 4 or less
characters.

and arr[1] will return a native Python3 string object.

This is the use-case for "casual" numpy users -- not the folks writing H5py
and the like, or the ones writing Cython bindings to C++ libs.


2) There be some way to store mostly ascii-compatible strings in a single
byte-per-character array -- so not to be wasting space for "typical
european-language-oriented data". Note: this should ALSO be compatible with
Python's character-oriented string model. i.e. a Python String with length
N will fit into a dtype of size N.

arr = np.array(("this", "that",), dtype=np.single_byte_string)

(name TBD)

and arr[1] would return a python string.

attempting to put in a not-compatible with the encoding String  would raise
an EncodingError.

This is also a use-case primarily for "casual" users -- but ones concerned
with the size of the data storage and know that are using european text.

3) dtypes that support storage in particular encodings:

   Python strings would be encoded appropriately when put into the array. A
Python string would be returned when indexing.

   a) There be a dtype that could store strings in null-terminated utf-8
binary format -- for interchange   with other systems (netcdf, HDF,
others???) at the binary level.

   b) There be a dtype that could store data in any encoding supported by
Python -- to facilitate bytes-level interchange with other systems. If we
need more than utf-8, then we might as well have the full set.

4) a fixed length bytes dtype -- pretty much what 'S' is now under python
three -- settable from a bytes or bytearray object (or other memoryview?),
and returns a bytes object.

You could use astype() to convert between bytes and a specified encoding
with no change in binary representation. This could be used to store any
binary data, including encoded text or anything else. this should map
directly to the Python bytes model -- thus NOT null-terminted.

This is a little different than 'S' behaviour on py3 -- it appears that
with 'S', a if ALL the trailing bytes are null, then it is truncated, but
if there is a null byte in the middle, then it is preserved. I suspect that
this is a legacy from Py2's use of "strings" as both text and binary data.
But in py3, a "bytes" type should be about bytes, and not text, and thus
null-values bytes are simply another value a byte can hold.

There are multiple ways to address these use cases -- please try to make
your comments clear about whether you think the use-case is unimportant, or
ill-defined, or if you think a given solution is a poor choice.

To facilitate that, I will put my comments on possible solutions in a
separate note, too.

-CHB


-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-25 Thread Chris Barker
This is essentially my rant about use-case (2):

A compact dtype for mostly-ascii text:

On Mon, Apr 24, 2017 at 4:09 PM, Stephan Hoyer  wrote:

> On Mon, Apr 24, 2017 at 11:13 AM, Chris Barker 
> wrote:
>
>> On the other hand, if this is the use-case, perhaps we really want an
>>> encoding closer to "Python 2" string, i.e, "unknown", to let this be
>>> signaled more explicitly. I would suggest that "text[unknown]" should
>>> support operations like a string if it can be decoded as ASCII, and
>>> otherwise error. But unlike "text[ascii]", it will let you store arbitrary
>>> bytes.
>>>
>>
>> I _think_ that is what using latin-1 (Or latin-9) gets you -- if it
>> really is ascii, then it's perfect. If it really is latin-*, then you get
>> some extra useful stuff, and if it's corrupted somehow, you still get the
>> ascii text correct, and the rest won't  barf and can be passed on through.
>>
>
> I am totally in agreement with Thomas that "We are living in a messy
> world right now with messy legacy datasets that have character type data
> that are *mostly* ASCII, but not infrequently contain non-ASCII characters."
>
> My question: What are those non-ASCII characters? How often are they truly
> latin-1/9 vs. some other text encoding vs. non-string binary data?
>

I am totally euro-centric, but as I understand it, that is the whole point
of the desire for a compact one-byte-per character encoding. If there is a
strong need for other 1-byte encodings (shift-JIS, maybe?) then maybe we
should support that. But this all started with "mostly ascii". My take on
that is:

We don't want to use pure-ASCII -- that is the hell that python2's default
encoding approach led to -- it is MUCH better to pass garbage through than
crash out with an EncodingError -- data are messy, and people are really
bad at writing comprehensive tests.

So we need something that handles ASCII properly, and can pass trhough
arbitrary bytes as well without crashing. Options are:

* ASCII With errors='ignore' or 'replace'

I think that is a very bad idea -- it is tossing away information that
_may_ have some use eslewhere::

  s = arr[i]
  arr[i] = s

should put the same bytes back into the array.

* ASCII with errors='surrogateescape'

This would preserve bytes and not crash out, so meets the key criteria.


* latin-1

This would do the exactly correct thing for ASCII, preserve the bytes, and
not crash out. But it would also allow additional symbols useful to
european languages and scientific computing. Seems like a win-win to me.

As for my use-cases:

 - Messy data:

I have had a lot of data sets with european text in them, mostly ASCII and
an occasional non ASCII accented character or symbol -- most of these come
from legacy systems, and have an ugly arbitrary combination of MacRoman,
Win-something-or-other, and who knows what -- i.e. mojibake, though at
least mostly ascii.

The only way to deal with it "properly" is to examine each string and try
to figure out which encoding it is in, hope at least a single string is in
one encoding, and then decode/encode it properly. So numpy should support
that -- which would be handled by a 'bytes' type, just like in Python
itself.

But sometimes that isn't practical, and still doesn't work 100% -- in which
case, we can go with latin-1, and there will be some weird, incorrect
characters in there, and that is OK -- we fix them later when QA/QC or
users notice it -- really just like a typo.

But stripping the non-ascii characters out would be a worse solution. As
would "replace", as sometimes it IS the correct symbol! (european encodings
aren't totally incompatible...). And surrogateescape is worse, too -- any
"weird" character is the same to my users, and at least sometimes it will
be the right character -- however surrogateescape gets printed, it will
never look right. (and can it even be handles by a non-python system?)

 - filenames

File names are one of the key reasons folks struggled with the python3 data
model (particularly on *nix) and why 'surrogateescape' was added. It's
pretty common to store filenames in with our data, and thus in numpy arrays
-- we need to preserve them exactly and display them mostly right. Again,
euro-centric, but if you are euro-centric, then latin-1 is a good choice
for this.

Granted, I should probably simply use a proper unicode type for filenames
anyway, but sometimes the data comes in already encoded as latin-something.

In the end I still see no downside to latin-1 over ascii-only -- only an
upside.

I don't think that silently (mis)interpreting non-ASCII characters as
> latin-1/9 is a good idea, which is why I think it would be a mistake to use
>

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-25 Thread Chris Barker
On Mon, Apr 24, 2017 at 4:23 PM, Robert Kern  wrote:

> > My question: What are those non-ASCII characters? How often are they
> truly latin-1/9 vs. some other text encoding vs. non-string binary data?
>
> I don't know that we can reasonably make that accounting relevant. Number
> of such characters per byte of text? Number of files with such characters
> out of all existing files?
>

I have a lot of mostly english -- usually not latin-1, but usually mostly
latin-1. -- the non-ascii characters are a handful of accented characters
(usually from spanish, some french), then a few "scientific" characters:
the degree symbol, the "micro" symbol.

I suspect that this is not an unusual pattern for mostly-english scientific
text.

if it's non-string binary data, I know it -- and I'd use a bytes type.

I have two options -- try to detect the encoding properly or use
_something_ and fix it up later. latin-1 is a great choice for the later
option -- most of the text displays fine, and the wrong stuff is untouched,
so I can figure it out.

What I can say with assurance is that every time I have decided, as a
> developer, to write code that just hardcodes latin-1 for such cases, I have
> regretted it. While it's just personal anecdote, I think it's at least
> measuring the right thing. :-)
>

I've had the opposite experience -- so that's two anecdotes :-)

If it were, say, shift-jis, then yes using latin-1 would be a bad idea. but
not really much worse then any other option other than properly decoding
it. IN a way, using latin-1 is like the old py2 string -- it can be used as
text, even if it has arbitrary non-text garbage in it...

-CHB

-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-25 Thread Chris Barker
OK -- onto proposals:

1) The default behaviour for numpy arrays of strings is compatible with
> Python3's string model: i.e. fully unicode supporting, and with a character
> oriented interface. i.e. if you do::
>
>   arr = np.array(("this", "that",))
>
> you get an array that can store ANY unicode string with 4 or less
> characters.
>
> and arr[1] will return a native Python3 string object.
>
> This is the use-case for "casual" numpy users -- not the folks writing
> H5py and the like, or the ones writing Cython bindings to C++ libs.
>

I see two options here:

a) The current 'U' dtype -- fully meets the specs, and is already there.

b) Having a pointer-to-a-python string dtype:

-I take it that's what Pandas does and people seem happy.

-That would get us variable length strings, and potentially other nifty
string-processing.

   - It would lose the ability to interact at the binary level with other
systems -- but do any other systems use UCS-4 anyway?

   - how would it work with pickle and numpy zip storage?

Personally, I'm fine with (a), but (b) seems like it could be a nice
addition. As the 'U' type already exists, the choice to add a python-string
type is really orthogonal to the rest of this discussion.

Note that I think using utf-8 internally to fit his need is a mistake -- it
does not match well with the Python string model.

That's it for use-case (1)

-CHB


-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-25 Thread Chris Barker
On Tue, Apr 25, 2017 at 9:57 AM, Ambrose LI  wrote:

> 2017-04-25 12:34 GMT-04:00 Chris Barker :
> > I am totally euro-centric,
>


> But Shift-JIS is not one-byte; it's two-byte (unless you allow only
> half-width characters and nothing else). :-)


bad example then -- are their other non-euro-centric one byte per char
encodings worth worrying about? I have no clue :-)


> This I don't understand. As far as I can tell non-Western-European
> filenames are not unusual. If filenames are a reason, even if you're
> euro-centric (think Eastern Europe, say) I don't see how latin1 is a
> good choice.
>

right -- this is the age of Unicode -- Unicode is the correct choice.

But many of us have data in old files that are not proper Unicode -- and
that includes filenames.

-CHB

-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-25 Thread Chris Barker
Now my proposal for the other use cases:

2) There be some way to store mostly ascii-compatible strings in a single
> byte-per-character array -- so not to be wasting space for "typical
> european-language-oriented data". Note: this should ALSO be compatible with
> Python's character-oriented string model. i.e. a Python String with length
> N will fit into a dtype of size N.
>
> arr = np.array(("this", "that",), dtype=np.single_byte_string)
>
> (name TBD)
>
> and arr[1] would return a python string.
>
> attempting to put in a not-compatible with the encoding String  would
> raise an EncodingError.
>
> This is also a use-case primarily for "casual" users -- but ones concerned
> with the size of the data storage and know that are using european text.
>

more detail elsewhere -- but either ascii with surrageescape or latin-1
always are good options here. I prefer latin-1 (I really see no downside),
but others disagree...

But then we get to:


> 3) dtypes that support storage in particular encodings:
>

We need utf-8. We may need others. We may need a 1-byte per char compact
encoding that isn't close enough to ascii or latin-1 to be useful (say,
shift-jis), And I don't think we are going to come to a consensus on what
"single" encoding to use for 1-byte-per-char.

So really -- going back to Julian's earlier proposal:

dytpe with an encoding specified
"size" in bytes

once defined, numpy would encode/decode to/from python strings "correctly"

we might need "null-terminated utf-8" as a special case.

That would support all the other use cases.

Even the one-byte per char encoding. I"d like to see a clean alias to a
latin-1 encoding, but not a big deal.

That leaves a couple decisions:

 - error out or truncate if the passed-in string is too long?

 - error out or suragateescape if there are invalid bytes in the data?

 - error out or something else if there are characters that can't be
encoded in the specified encoding.

And we still need a proper bytes type:

4) a fixed length bytes dtype -- pretty much what 'S' is now under python
> three -- settable from a bytes or bytearray object (or other memoryview?),
> and returns a bytes object.
>
> You could use astype() to convert between bytes and a specified encoding
> with no change in binary representation. This could be used to store any
> binary data, including encoded text or anything else. this should map
> directly to the Python bytes model -- thus NOT null-terminted.
>
> This is a little different than 'S' behaviour on py3 -- it appears that
> with 'S', a if ALL the trailing bytes are null, then it is truncated, but
> if there is a null byte in the middle, then it is preserved. I suspect that
> this is a legacy from Py2's use of "strings" as both text and binary data.
> But in py3, a "bytes" type should be about bytes, and not text, and thus
> null-values bytes are simply another value a byte can hold.
>


-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-26 Thread Chris Barker
On Wed, Apr 26, 2017 at 11:31 AM, Nathaniel Smith  wrote:

> UTF-8 does not match the character-oriented Python text model. Plenty
> of people argue that that isn't the "correct" model for Unicode text
> -- maybe so, but it is the model python 3 has chosen. I wrote a much
> longer rant about that earlier.
>
> So I think the easy to access, and particularly defaults, numpy string
> dtypes should match it.
>
>
> This seems a little vague?
>

sorry -- that's what I get for trying to be concise...


> The "character-oriented Python text model" is just that str supports O(1)
> indexing of characters.
>

not really -- I think the performance characteristics are an implementation
detail (though it did influence the design, I'm sure)

I'm referring to the fact that a python string appears (to the user -- also
under the hood, but again, implementation detail)  to be a sequence of
characters, not a sequence of bytes, not a sequence of glyphs, or
graphemes, or anything else. Every Python string has a length, and that
length is the number of characters, and if you index you get a string of
length-1, and it has one character it it, and that character matches to a
code point of a single value.

Someone could implement a python string using utf-8 under the hood, and
none of that would change (and I think micropython may have done that...)

Sure, you might get two characters when you really expect a single
grapheme, but it's at least a consistent oddity. (well, not always, as some
graphemes can be represented by either a single code point or two combined
-- human language really sucks!)

The UTF-8 Manifesto (http://utf8everywhere.org/) makes the very good point
that a character-oriented interface is not the only one that makes sense,
and may not make sense at all. However:

1) Python has chosen that interface

2) It is a good interface (probably the best for computer use) if you need
to choose only one

utf8everywhere is mostly arguing for utf-8 over utf16 -- and secondarily
for utf-8 everywhere as the best option for working at the C level. That's
probably true.

(I also think the utf-8 fans are in a bit of a fantasy world -- this would
all be easier, yes, if one encoding was used for everything, all the time,
but other than that, utf-8 is not a Pancea -- we are still going to have
encoding headaches no matter how you slice it)

So where does numpy fit? well, it does operate at the C level, but people
work with it from python, so exposing the details of the encoding to the
user should be strictly opt-in.

When a numpy user wants to put a string into a numpy array, they should
know how long a string they can fit -- with "length" defined how python
strings define it.

Using utf-8 for the default string in numpy would be like using float16 for
default float--not a good idea!

I believe Julian said there would be no default -- you would need to
specify, but I think there does need to be one:

np.array(["a string", "another string"])

needs to do something.

if we make a parameterized dtype that accepts any encoding, then we could
do:

np.array(["a string", "another string"], dtype=no.stringtype["utf-8"])

If folks really want that.

I'm afraid that that would lead to errors -- cool,. utf-8 is just like
ascii, but with full Unicode support!

But... Numpy doesn't. If you want to access individual characters inside a
> string inside an array, you have to pull out the scalar first, at which
> point the data is copied and boxed into a Python object anyway, using
> whatever representation the interpreter prefers.
>


> So AFAICT​ it makes literally no difference to the user whether numpy's
> internal representation allows for fast character access.
>

agreed - unless someone wants to do a view that makes a N-D array for
strings look like a 1-D array of characters Which seems odd, but there
was recently a big debate on the netcdf CF conventions list about that very
issue...

-CHB


-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-26 Thread Chris Barker
On Wed, Apr 26, 2017 at 11:38 AM, Sebastian Berg  wrote:

> I remember talking with a colleague about something like that. And
> basically an annoying thing there was that if you strip the zero bytes
> in a zero padded string, some encodings (UTF16) may need one of the
> zero bytes to work right.


I think it's really clear that you don't want to mess with the bytes in any
way without knowing the encoding -- for UTF-16, the code unit is two bytes,
so a "null" is two zero bytes in a row.

So generic "null padded" or "null terminated" is dangerous -- it would have
to be "Null-padded utf-8" or whatever.

  Though I

> think it might have been something like "make everything in
> hdf5/something similar work"


That would be nice :-), but I suspect HDF-5 is the same as everything else
-- there are files in the wild where someone jammed the wrong thing into a
text array 

-CHB



-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-26 Thread Chris Barker
On Wed, Apr 26, 2017 at 10:45 AM, Robert Kern  wrote:

> >>> > The maximum length of an UTF-8 character is 4 bytes, so we could use
> that to size arrays by character length. The advantage over UTF-32 is that
> it is easily compressible, probably by a factor of 4 in many cases.
>

isn't UTF-32 pretty compressible also? lots of zeros in there

here's an example with pure ascii  Lorem Ipsum text:

In [17]: len(text)
Out[17]: 446


In [18]: len(utf8)
Out[18]: 446

# the same -- it's pure ascii

In [20]: len(utf32)
Out[20]: 1788

# four times a big -- of course.

In [22]: len(bz2.compress(utf8))
Out[22]: 302

# so from 446 to 302, not that great -- probably it would be better for
longer text
# -- but are compressing whole arrays or individual strings?

In [23]: len(bz2.compress(utf32))
Out[23]: 319

# almost as good as the compressed utf-8

And I'm guessing it would be even closer with more non-ascii charactors.

OK -- turns out I'm wrong -- here it with greek -- not a lot of ascii
charactors:

In [29]: len(text)
Out[29]: 672

In [30]: utf8 = text.encode("utf-8")

In [31]: len(utf8)
Out[31]: 1180

# not bad, really -- still smaller than utf-16 :-)

In [33]: len(bz2.compress(utf8))
Out[33]: 495

# pretty good then -- better than 50%

In [34]: utf32 = text.encode("utf-32")
In [35]: len(utf32)

Out[35]: 2692


In [36]: len(bz2.compress(utf32))
Out[36]: 515

# still not quite as good as utf-8, but close.

So: utf-8 compresses better than utf-32, but only by a little bit -- at
least with bz2.

But it is a lot smaller uncompressed.

>>> The major use case that we have for a UTF-8 array is HDF5, and it
> specifies the width in bytes, not Unicode characters.
> >>
> >> It's not just HDF5. Counting bytes is the Right Way to measure the size
> of UTF-8 encoded text:
> >> http://utf8everywhere.org/#myths
>

It's really the only way with utf-8 -- which is why it is an impedance
mismatch with python strings.


>> I also firmly believe (though clearly this is not universally agreed
> upon) that UTF-8 is the Right Way to encode strings for *non-legacy*
> applications.
>

fortunately, we don't need to agree to that to agree that:


> So if we're adding any new string encodings, it needs to be one of them.
>

Yup -- the most important one to add -- I don't think it is "The Right Way"
for all applications -- but it "The Right Way" for text interchange.

And regardless of what any of us think -- it is widely used.

> (1) object arrays of strings. (We have these already; whether a
> strings-only specialization would permit useful things like string-oriented
> ufuncs is a question for someone who's willing to implement one.)
>

This is the right way to get variable length strings -- but I'm concerned
that it doesn't mesh well with numpy uses like npz files, raw dumping of
array data, etc. It should not be the only way to get proper Unicode
support, nor the default when you do:

array(["this", "that"])


> > (2) a dtype for fixed byte-size, specified-encoding, NULL-padded data.
> All python encodings should be permitted. An additional function to
> truncate encoded data without mangling the encoding would be handy.
>

I think necessary -- at least when you pass in a python string...


> I think it makes more sense for this to be NULL-padded than
> NULL-terminated but it may be necessary to support both; note that
> NULL-termination is complicated for encodings like UCS4.
>

is it if you know it's UCS4? or even know the size of the code-unit (I
think that's the term)


> This also includes the legacy UCS4 strings as a special case.
>

what's special about them? I think the only thing shold be that they are
the default.
>

> > (3) a dtype for fixed-length byte strings. This doesn't look very
> different from an array of dtype u8, but given we have the bytes type,
> accessing the data this way makes sense.
>
> The void dtype is already there for this general purpose and mostly works,
> with a few niggles.
>

I'd never noticed that! And if I had I never would have guessed I could use
it that way.


> If it worked more transparently and perhaps rigorously with `bytes`, then
> it would be quite suitable.
>

Then we should fix a bit of those things -- and call it soemthig like
"bytes", please.

-CHB

>
> --

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-26 Thread Chris Barker
On Wed, Apr 26, 2017 at 4:30 PM, Stephan Hoyer  wrote:

>
> Sorry, I remain unconvinced (for the reasons that Robert, Nathaniel and
> myself have already given), but we seem to be talking past each other here.
>

yeah -- I think it's not clear what the use cases we are talking about are.


> I am still -1 on any new string encoding support unless that includes at
> least UTF-8, with length indicated by the number of bytes.
>

I've said multiple times that utf-8 support is key to any "exchange binary
data" use case (memory mapping?) -- so yes, absolutely.

I _think_ this may be some of the source for the confusion:

The name of this thread is: "proposal: smaller representation of string
arrays".

And I got the impression, maybe mistaken, that folks were suggesting that
internally encoding strings in numpy as "UTF-8, with length indicated by
the number of bytes." was THE solution to the

" the 'U' dtype takes up way too much memory, particularly  for
mostly-ascii data" problem.

I do not think it is a good solution to that problem.

I think a good solution to that problem is latin-1 encoding. (bear with me
here...)

But a bunch of folks have brought up that while we're messing around with
string encoding, let's solve another problem:

* Exchanging unicode text at the binary level with other systems that
generally don't use UCS-4.

For THAT -- utf-8 is critical.

But if I understand Julian's proposal -- he wants to create a parameterized
text dtype that you can set the encoding on, and then numpy will use the
encoding (and python's machinery) to encode / decode when passing to/from
python strings.

It seems this would support all our desires:

I'd get a latin-1 encoded type for compact representation of mostly-ascii
data.

Thomas would get latin-1 for binary interchange with mostly-ascii data

The HDF-5 folks would get utf-8 for binary interchange (If we can workout
the null-padding issue)

Even folks that had weird JAVA or Windows-generated UTF-16 data files could
do the binary interchange thing

I'm now lost as to what the hang-up is.

-CHB

PS: null padding is a pain, python strings seem to preserve the zeros, whic
is odd -- is thre a unicode code-point at x00?

But you can use it to strip properly with the unicode sandwich:

In [63]: ut16 = text.encode('utf-16') + b'\x00\x00\x00\x00\x00\x00'

In [64]: ut16.decode('utf-16')
Out[64]: 'some text\x00\x00\x00'

In [65]: ut16.decode('utf-16').strip('\x00')
Out[65]: 'some text'

In [66]: ut16.decode('utf-16').strip('\x00').encode('utf-16')
Out[66]: b'\xff\xfes\x00o\x00m\x00e\x00 \x00t\x00e\x00x\x00t\x00'

-CHB



-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-26 Thread Chris Barker
On Wed, Apr 26, 2017 at 5:17 PM, Robert Kern  wrote:

> The proposal is for only latin-1 and UTF-32 to be supported at first, and
> the eventual support of UTF-8 will be constrained by specification of the
> width in terms of characters rather than bytes, which conflicts with the
> use cases of UTF-8 that have been brought forth.
>
>   https://mail.python.org/pipermail/numpy-discussion/
> 2017-April/076668.html
>

thanks -- I had forgotten (clearly) it was that limited.

But my question now is -- if there is a encoding-parameterized string
dtype, then is it much more effort to have it support all the encodings in
the stdlib?

It seems that would solve everyone's issue.

-CHB


-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-27 Thread Chris Barker
On Thu, Apr 27, 2017 at 4:10 AM, Francesc Alted  wrote:

> I remember advocating for UCS-4 adoption in the HDF5 library many years
> ago (2007?), but I had no success and UTF-8 was decided to be the best
> candidate.  So, the boat with HDF5 using UTF-8 sailed many years ago, and I
> don't think there is a go back
>

This is the key point -- we can argue all we want about the best encoding
for fixed-length unicode-supporting strings (I think numpy and HDF have
very similar requirements), but that is not our decision to make -- many
other systems have chosen utf-8, so it's a really good idea for numpy to be
able to deal with that cleanly and easily and consistently.

I have made many anti utf-8 points in this thread because while we need to
deal with utf-8 for interplay with other systems, I am very sure that it is
not the best format for a default, naive-user-of-numpy unicode-supporting
dtype. Nor is it the best encoding for a mostly-ascii compact in memory
format.

So I think numpy needs to support at least:

utf-8
latin-1
UCS-4

And it maybe should support one-byte encoding suitable for non-european
languages, and maybe utf-16 for Java and Windows compatibility, and 

So that seems to point to "support as many encodings as possible" And
python has the machinery to do so -- so why not?

(I'm taking Julian's word for it that having a parameterized dtype would
not have a major impact on current code)

If we go with a parameterized by encoding string dtype, then we can pick
sensible defaults, and let users use what they know best fits their
use-cases.

As for python2 -- it is on the way out, I think we should keep the 'U' and
'S' dtypes as they are for backward compatibility and move forward with the
new one(s) in a way that is optimized for py3. And it would map to a py2
Unicode type.

The only catch I see in that is what to do with bytes -- we should have a
numpy dtype that matches the bytes model -- fixed length bytes that map to
python bytes objects. (this is almost what teh void type is yes?) but then
under py2, would a bytes object (py2 string) map to numpy 'S' or numpy
bytes objects??

@Francesc: -- one more question for you:

How important is it for pytables to match the numpy storage to the hdf
storage byte for byte? i.e. would it be a killer if encoding / decoding
happened every time at the boundary? I'm guessing yes, as this would have
been solved long ago if not.

-CHB

-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Array and string interoperability

2017-06-05 Thread Chris Barker
Just a few notes:

However, the fact that this works for bytestrings on Python 3 is, in my
> humble opinion, ridiculous:
>
> >>> np.array(b'100', 'u1') # b'100' IS NOT TEXT
> array(100, dtype=uint8)
>

Yes, that is a mis-feature -- I think due to bytes and string being the
same object in py2 -- so on py3, numpy continues to treat a bytes objects
as also a 1-byte-per-char string, depending on context. And users want to
be able to write numpy code that will run the same on py2 and py3, so we
kinda need this kind of thing.

Makes me think that an optional "pure-py-3" mode for numpy might be a good
idea. If that flag is set, your code will only run on py3 (or at least
might run differently).


> > Further thoughts:
> > If trying to create "u1" array from a Pyhton 3 string, question is,
> > whether it should throw an error, I think yes,


well, you can pass numbers > 255 into a u1 already:

In [*96*]: np.array(456, dtype='u1')

Out[*96*]: array(200, dtype=uint8)
and it does the wrap-around overflow thing... so why not?


> and in this case
> > "u4" type should be explicitly specified by initialisation, I suppose.
> > And e.g. translation from unicode to extended ascii (Latin1) or whatever
> > should be done on Python side  or with explicit translation.
>

absolutely!

If you ask me, passing a unicode string to fromstring with sep='' (i.e.
> to parse binary data) should ALWAYS raise an error: the semantics only
> make sense for strings of bytes.
>

exactly -- we really should have a "frombytes()" alias for fromstring() and
it should only work for atual bytes objects (strings on py2, naturally).

and overloading fromstring() to mean both "binary dump of data" and "parse
the text" due to whether the sep argument is set was always a bad idea :-(

.. and fromstring(s, sep=a_sep_char)

has been semi broken (or at least not robust) forever anyway.

Currently, there appears to be some UTF-8 conversion going on, which
> creates potentially unexpected results:
>
> >>> s = 'αβγδ'
> >>> a = np.fromstring(s, 'u1')
> >>> a
> array([206, 177, 206, 178, 206, 179, 206, 180], dtype=uint8)
> >>> assert len(a) * a.dtype.itemsize  == len(s)
> Traceback (most recent call last):
>   File "", line 1, in 
> AssertionError
> >>>
>
> This is, apparently (https://github.com/numpy/numpy/issues/2152), due to
> how the internals of Python deal with unicode strings in C code, and not
> due to anything numpy is doing.
>

exactly -- py3 strings are pretty nifty implementation of unicode text --
they have nothing to do with storing binary data, and should not be used
that way. There is essentially no reason you would ever want to pass the
actual binary representation to any other code.

fromstring should be re-named frombytes, and it should raise an exception
if you pass something other than a bytes object (or maybe a memoryview or
other binary container?)

we might want to keep fromstring() for parsing strings, but only if it were
fixed...

IMHO calling fromstring(..., sep='') with a unicode string should be
> deprecated and perhaps eventually forbidden. (Or fixed, but that would
> break backwards compatibility)


agreed.

> Python3 assumes 4-byte strings but in reality most of the time
> > we deal with 1-byte strings, so there is huge waste of resources
> > when dealing with 4-bytes. For many serious projects it is just not
> needed.
>
> That's quite enough anglo-centrism, thank you. For when you need byte
> strings, Python 3 has a type for that. For when your strings contain
> text, bytes with no information on encoding are not enough.
>

There was a big thread about this recently -- it seems to have not quite
come to a conclusion. But anglo-centrism aside, there is substantial demand
for a "smaller" way to store mostly-ascii text.

I _think_ the conversation was steering toward an encoding-specified string
dtype, so us anglo-centric folks could use latin-1 or utf-8.

But someone would need to write the code.

-CHB

> There can be some convenience methods for ascii operations,
> > like eg char.toupper(), but currently they don't seem to work with
> integer
> > arrays so why not make those potentially useful methots usable
> > and make them work on normal integer arrays?
> I don't know what you're doing, but I don't think numpy is normally the
> right tool for text manipulation...
>

I agree here. But if one were to add such a thing (vectorized string
operations) -- I'd think the thing to do would be to wrap (or port) the
python string methods. But it shoudl only work for actual string dtypes, of
course.

note that another part of the discussion previously suggested that we have
a dtype that wraps a native python string object -- then you'd get all for
free. This is essentially an object array with strings in it, which you can
do now.

-CHB


-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6

Re: [Numpy-discussion] Array and string interoperability

2017-06-05 Thread Chris Barker
On Mon, Jun 5, 2017 at 1:51 PM, Thomas Jollans  wrote:

> > and overloading fromstring() to mean both "binary dump of data" and
> > "parse the text" due to whether the sep argument is set was always a
> > bad idea :-(
> >
> > .. and fromstring(s, sep=a_sep_char)
>
> As it happens, this is pretty much what stdlib bytearray does since 3.2
> (http://bugs.python.org/issue8990)


I'm not sure that the array.array.fromstring() ever parsed the data string
as text, did it?

Anyway, This is what array.array now has:
array.frombytes(s)

Appends items from the string, interpreting the string as an array of
machine values (as if it had been read from a file using the
fromfile()method).
New in version 3.2: fromstring() is renamed to frombytes() for clarity.

array.fromfile(f, n)

Read n items (as machine values) from the file object f and append them to
the end of the array. If less than n items are available, EOFError is
raised, but the items that were available are still inserted into the
array. f must be a real built-in file object; something else with a read()
method won’t do.

array.fromstring()

Deprecated alias for frombytes().

I think numpy should do the same.And frombytes() should remove the "sep"
parameter. If someone wants to write a fast efficient simple text parser,
then it should get a new name: fromtext() maybe???And the fromfile() sep
argument should be deprecated as well, for the same reasons.array also has:

array.fromunicode(s)

Extends this array with data from the given unicode string. The array must
be a type 'u' array; otherwise a ValueError is raised.
Usearray.frombytes(unicodestring.encode(enc)) to append Unicode data to an
array of some other type.

which I think would be better supported by:np.frombytes(str.encode('UCS-4'),
dtype=uint32)
-CHB


-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Array and string interoperability

2017-06-06 Thread Chris Barker
On Mon, Jun 5, 2017 at 3:59 PM, Mikhail V  wrote:

> -- classify by "forward/backward" conversion:
> For this time consider only forward, i.e. I copy data from string
> to numpy array
>
> -- classify by " bytes  vs  ordinals ":
>
> a)  bytes:  If I need raw bytes - in this case e.g.
>
>   B = bytes(s.encode())
>

no need to call "bytes" -- encode() returns a bytes object:

In [1]: s = "this is a simple ascii-only string"

In [2]: b = s.encode()

In [3]: type(b)

Out[3]: bytes

In [4]: b

Out[4]: b'this is a simple ascii-only string'


>
> will do it. then I can copy data to array. So currently there are methods
> coverings this. If I understand correctly the data extracted corresponds
> to utf-??  byte feed, i.e. non-constant byte-length of chars (1 up to
> 4 bytes per char for
> the 'wide' unicode, correct me if I am wrong).
>

In [5]: s.encode?
Docstring:
S.encode(encoding='utf-8', errors='strict') -> bytes

So the default is utf-8, but you can set any encoding you want (that python
supports)

 In [6]: s.encode('utf-16')

Out[6]: b'\xff\xfet\x00h\x00i\x00s\x00 \x00i\x00s\x00 \x00a\x00
\x00s\x00i\x00m\x00p\x00l\x00e\x00
\x00a\x00s\x00c\x00i\x00i\x00-\x00o\x00n\x00l\x00y\x00
\x00s\x00t\x00r\x00i\x00n\x00g\x00'



> b):  I need *ordinals*
>   Yes, I need ordinals, so for the bytes() method, if a Python 3
> string contains only
>   basic ascii, I can so or so convert to bytes then to integer array
> and the length will
>   be the same 1byte for each char.
>   Although syntactically seen, and with slicing, this will look e.g. like:
>
> s= "012 abc"
> B = bytes(s.encode())  # convert to bytes
> k  = len(s)
> arr = np.zeros(k,"u1")   # init empty array length k
> arr[0:2] = list(B[0:2])
> print ("my array: ", arr)
> ->
> my array:  [48 49  0  0  0  0  0]
>

This can be done more cleanly:

In [15]: s= "012 abc"

In [16]: b = s.encode('ascii')

# you want to use the ascii encoding so you don't get utf-8 cruft if there
are non-ascii characters
#  you could use latin-1 too (Or any other one-byte per char encoding

In [17]: arr = np.fromstring(b, np.uint8)
# this is using fromstring() to means it's old py definiton - treat teh
contenst as bytes
# -- it really should be called "frombytes()"
# you could also use:

In [22]: np.frombuffer(b, dtype=np.uint8)
Out[22]: array([48, 49, 50, 32, 97, 98, 99], dtype=uint8)In [18]: print arr

In [19]: print(arr)
[48 49 50 32 97 98 99]

# you got the ordinals

In [20]: "".join([chr(i) for i in arr])
Out[20]: '012 abc'

# yes, they are the right ones...



> Result seems correct. Note that I also need to use list(B), otherwise
> the slicing does not work (fills both values with 1, no idea where 1
> comes from).
>

that is odd -- I can't explain it right now either...


> Or I can write e.g.:
> arr[0:2] = np.fromstring(B[0:2], "u1")
>
> But looks indeed like a 'hack' and not so simple.
>

is the above OK?


> -- classify "what is maximal ordinal value in the string"
> Well, say, I don't know what is maximal ordinal, e.g. here I take
> 3 Cyrillic letters instead of 'abc':
>
> s= "012 АБВ"
> k  = len(s)
> arr = np.zeros(k,"u4")   # init empty 32 bit array length k
> arr[:] = np.fromstring(np.array(s),"u4")
> ->
> [  48   49   50   32 1040 1041 1042]
>

so this is making a numpy string, which is a UCS-4 encoding unicode -- i.e.
4 bytes per charactor. Then you care converting that to an 4-byte unsigned
int. but no need to do it with fromstring:

In [52]: s
Out[52]: '012 АБВ'

In [53]: s_arr.reshape((1,)).view(np.uint32)
Out[53]: array([  48,   49,   50,   32, 1040, 1041, 1042], dtype=uint32)

we need the reshape() because .view does not work with array scalars -- not
sure why not?

> This gives correct results indeed. So I get my ordinals as expected.
> So this is better/preferred way, right?
>

I would maybe do it more "directly" -- i.e. use python's string to do the
encoding:

In [64]: s
Out[64]: '012 АБВ'

In [67]: np.fromstring(s.encode('U32'), dtype=np.uint32)
Out[67]: array([65279,48,49,50,32,  1040,  1041,  1042],
dtype=uint32)

that first value is the byte-order mark (I think...), you  can strip it off
with:

In [68]: np.fromstring(s.encode('U32')[4:], dtype=np.uint32)
Out[68]: array([  48,   49,   50,   32, 1040, 1041, 1042], dtype=uint32)

or, probably better simply specify the byte order in the encoding:

In [69]: np.fromstring(s.encode('UTF-32LE'), dtype=np.uint32)
Out[69]: array([  48,   49,   50,   32, 1040, 1041, 1042], dtype=uint32)

arr = np.ordinals(s)
> arr[0:2] = np.ordinals(s[0:2])  # with slicing
>
> or, e.g. in such format:
>
> arr = np.copystr(s)
> arr[0:2] = np.copystr(s[0:2])
>

I don't think any of this is necessary -- the UCS4 (Or UTF-32) "encoding"
is pretty much the ordinals anyway.

As you notices, if you make a numpy unicode string array, and change the
dtype to unsigned int32, you get what you want.

You really don't want to mess with any of this unless you understand
unicode and encodings anyway

Though it is a bit akward -- why 

Re: [Numpy-discussion] Array and string interoperability

2017-06-06 Thread Chris Barker
On Mon, Jun 5, 2017 at 4:06 PM, Mikhail V  wrote:

> Likely it was about some new string array type...


yes, it was.

> Obviously there is demand. Terror of unicode touches many aspects

> of programmers life.


I don't know that I'd call it Terror, but frankly, the fact that you need
up to 4 bytes for a single character is really not the big issues. Given
that computer memory has grown by literally orders of magnitude since
Unicode was introduced, I don't know why there is such a hang up about it.

But we're scientific programmers we like to be efficient !


> Foremost, it comes down to the question of defining this "optimal
> 8-bit character table".
> And "Latin-1", (exactly as it is)  is not that optimal table,


there is no such thing as a single "optimal" set of characters when you are
limited to 255 of them...

latin-1 is pretty darn good for the, well, latin-based languages


> But, granted, if define most accented letters as
> "optional", i.e . delete them
> then it is quite reasonable basic char table to start with.
>

Then you are down to ASCII, no?

but anyway, I don't think a new encoding is really the topic at hand
here

>> I don't know what you're doing, but I don't think numpy is normally the
> >> right tool for text manipulation...
> >
> >
> > I agree here. But if one were to add such a thing (vectorized string
> > operations) -- I'd think the thing to do would be to wrap (or port) the
> > python string methods. But it shoudl only work for actual string dtypes,
> of
> > course.
> >
> > note that another part of the discussion previously suggested that we
> have a
> > dtype that wraps a native python string object -- then you'd get all for
> > free. This is essentially an object array with strings in it, which you
> can
> > do now.
> >
>
> Well here I must admit I don't quite understand the whole idea of
> "numpy array of string type". How used? What is main bebefit/feature...?
>

here you go -- you can do this now:

In [74]: s_arr = np.array([s, "another string"], dtype=np.object)
In [75]:

In [75]: s_arr
Out[75]: array(['012 АБВ', 'another string'], dtype=object)

In [76]: s_arr.shape
Out[76]: (2,)

You now have an array with python string object in it -- thus access to all
the string functionality:

In [81]: s_arr[1] = s_arr[1].upper()
In [82]: s_arr
Out[82]: array(['012 АБВ', 'ANOTHER STRING'], dtype=object)

and the ability to have each string be a different length.

If numpy were to know that those were string objects, rather than arbitrary
python objects, it could do vectorized operations on them, etc.

You can do that now with numpy.vectorize, but it's pretty klunky.

In [87]: np_upper = np.vectorize(str.upper)
In [88]: np_upper(s_arr)

Out[88]:
array(['012 АБВ', 'ANOTHER STRING'],
  dtype=' Example integer array usage in context of textual data in my case:
> - holding data in a text editor (mutability+indexing/slicing)
>

you really want to use regular old python data structures for that...


> - filtering, transformations (e.g. table translations, cryptography, etc.)
>

that may be something to do with ordinals and numpy -- but then you need to
work with ascii or latin-1 and uint8 dtypes, or full Unicode and uint32
dtype -- that's that.

String type array? Will this be a string array you describe:
>
> s= "012 abc"
> arr = np.array(s)
> print ("type ", arr.dtype)
> print ("shape ", arr.shape)
> print ("my array: ", arr)
> arr = np.roll(arr[0],2)
> print ("my array: ", arr)
> ->
> type   shape  ()
> my array:  012 abc
> my array:  012 abc
>
>
> So what it does? What's up with shape?
>

shape is an empty tuple, meaning this is a numpy scalar, containing a
single string

type ' e.g. here I wanted to 'roll' the string.
> How would I replace chars? or delete?
> What is the general idea behind?
>

the numpy string type (unicode type) works with fixed length strings -- not
characters, but you can reshape it and make a view:

In [89]: s= "012 abc"

In [90]: arr.shape = (1,)

In [91]: arr.shape
Out[91]: (1,)

In [93]: c_arr = arr.view(dtype = '___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Scipy 2017 NumPy sprint

2017-07-05 Thread Chris Barker
On Mon, Jul 3, 2017 at 4:27 PM, Stephan Hoyer  wrote:

> If someone who does subclasses/array-likes or so (e.g. like Stefan
>> Hoyer ;)) and is interested, and also we do some
>> teleconferencing/chatting (and I have time) I might be interested
>> in discussing and possibly trying to develop the new indexer ideas,
>> which I feel are pretty far, but I got stuck on how to get subclasses
>> right.
>
>
> I am off course very happy to discuss this (online or via teleconference,
> sadly I won't be at scipy), but to be clear I use array likes, not
> subclasses. I think Marten van Kerkwijk is the last one who thinks that is
> still a good idea :).
>

Indeed -- I thought the community more or less had decided that duck-typing
was THE way to make something that could be plugged in where a numpy array
is expected.

Along those lines, there was some discussion of having a set of utilities
(or maybe eve3n an ABC?) that would make it easier to create a ndarray-like
object.

That is, the boilerplate needed for multi-dimensional indexing and slicing,
etc...

That could be a nice little sprint-able project.

-CHB


-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] record data previous to Numpy use

2017-07-06 Thread Chris Barker
OK, you have two performance "issues"

1) memory use: IF yu need to read a file to build a numpy array, and dont
know how big it is when you start,  you need to accumulate the values
first, and then make an array out of them. And numpy arrays are fixed size,
so they can not efficiently accumulate values.

The usual way to handle this is to read the data into a list with .append()
or the like, and then make an array from it. This is quite fast -- lists
are fast and efficient for extending arrays. However, you are then storing
(at least) a pointer and a python float object for each value, which is a
lot more memory than a single float value in a numpy array, and you need to
make the array from it, which means you have the full list and all its
pyton floats AND the array in memory at once.

Frankly, computers have a lot of memory these days, so this is a non-issue
in most cases.

Nonetheless, a while back I wrote an extendable numpy array object to
address just this issue. You can find the code on gitHub here:

https://github.com/PythonCHB/NumpyExtras/blob/master/numpy_extras/accumulator.py

I have not tested it with recent numpy's but I expect is still works fine.
It's also py2, but wouldn't take much to port.

In practice, it uses less memory that the "build a list, then make it into
an array", but isnt any faster, unless you add (.extend) a bunch of values
at once, rather than one at a time. (if you do it one at a time, the whole
python float to numpy float conversion, and function call overhead takes
just as long).

But it will, generally be as fast or faster than using  a list, and use
less memory, so a fine basis for a big ascii file reader.

However, it looks like while your files may be huge, they hold a number of
arrays, so each array may not be large enough to bother with any of this.

2) parsing and converting overhead -- for the most part, python/numpy text
file reading code read the text into a python string, converts it to python
number objects, then puts them in a list or converts them to native numbers
in an array. This whole process is a bit slow (though reading files is slow
anyway, so usually not worth worrying about, which is why the built-in file
reading methods do this). To improve this, you need to use code that reads
the file and parses it in C, and puts it straight into a numpy array
without passing through python. This is what the pandas (and I assume
astropy) text file readers do.

But if you don't want those dependencies, there is the "fromfile()"
function in numpy -- it is not very robust, but if you files are
well-formed, then it is quite fast. So your code would look something like:

with open(the_filename) as infile:
while True:
line = infile.readline()
if not line:
break
# work with line to figure out the next block
if ready_to_read_a_block:
arr = np.fromfile(infile, dtype=np.int32, count=num_values,
sep=' ')
# sep specifies that you are reading text, not binary!
arr.shape = the_shape_it_should_be


But Robert is right -- get it to work with the "usual" methods -- i.e. put
numbers in a list, then make an array out it -- first, and then worry about
making it faster.

-CHB


On Thu, Jul 6, 2017 at 1:49 AM,  wrote:

> Dear All
>
>
> First of all thanks for the answers and the information’s (I’ll ding into
> it) and let me trying to add comments on what I want to :
>
>1. My asci file mainly contains data (float and int) in a single column
>2. (it is not always the case but I can easily manage it – as well I
>saw I can use ‘spli’ instruction if necessary)
>3. Comments/texts indicates the beginning of a bloc immediately
>followed by the number of sub-blocs
>4. So I need to read/record all the values in order to build a matrix
>before working on it (using Numpy & vectorization)
>   - The columns 2 and 3 have been added for further treatments
>   - The ‘0’ values will be specifically treated afterward
>
>
> Numpy won’t be a problem I guess (I did some basic tests and I’m quite
> confident) on how to proceed, but I’m really blocked on data records … I
> trying to find a way to efficiently read and record data in a matrix:
>
>- avoiding dynamic memory allocation (here using ‘append’ in python
>meaning, not np),
>- dealing with huge asci file: the latest file I get contains more
>than *60 million of lines*
>
>
> Please find in attachment an extract of the input format
> (‘example_of_input’), and the matrix I’m trying to create and manage with
> Numpy
>
>
> Thanks again for your time
>
> Paul
>
>
> ###
>
> ##BEGIN *-> line number x in the original file*
>
> 42   *-> indicates the number of sub-blocs*
>
> 1 *-> number of the 1rst sub-bloc*
>
> 6 *-> gives how many value belong to the sub bloc*
>
> 12
>
> 47
>
> 2
>
> 46
>
> 3
>
> 51
>
> ….
>
> 13  * -> another type of sub-bloc with 25 values*
>
> 25
>
> 15
>
> 88

Re: [Numpy-discussion] Making a 1.13.2 release

2017-07-06 Thread Chris Barker
On Thu, Jul 6, 2017 at 6:10 AM, Charles R Harris 
wrote:

> I've delayed the NumPy 1.13.2 release hoping for Python 3.6.2 to show up
> fixing #29943   so we can close #9272
> , but the Python release has
> been delayed to July 11 (expected). The Python problem means that NumPy
> compiled with Python 3.6.1 will not run in Python 3.6.0.
>

If it's compiled against 3.6.0 will it work fine with 3.6.1? and probably
3.6.2 as well?

If so, it would be nice to do it that way, if Matthew doesn't mind :-)

But either way, it'll be good to get it out.

Thanks!

-CHB



> However, I've also been asked to have a bugfixed version of 1.13 available
> for Scipy 2017 next week. At this point it looks like the best thing to do
> is release 1.13.1 compiled with Python 3.6.1 and ask folks to upgrade
> Python if they have a problem, and then release 1.13.2 as soon as 3.6.2 is
> released.
>
> Thoughts?
>
> Chuck
>
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
>
>


-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Scipy 2017 NumPy sprint

2017-07-06 Thread Chris Barker
On Wed, Jul 5, 2017 at 11:05 AM, Stephan Hoyer  wrote:

> That is, the boilerplate needed for multi-dimensional indexing and
>> slicing, etc...
>>
>> That could be a nice little sprint-able project.
>>
>
> Indeed. Let me highlight a few mixins
> 
>  that
> I wrote for xarray that might be more broadly useful.
>

At a quick glance, that is exactly the kind of ting I had in mind.

The challenge here is that there are quite a few different meanings to
> "ndarray-like", so mixins really need to be mix-and-match-able.
>

exactly!


> But at least defining a base list of methods to implement/override would
> be useful.
>

With sample implementations, even... at last of parts of it -- I'm thinking
things like parsing out the indexes/slices in __getitem__ -- that sort of
thing.



> In NumPy, this could go along with NDArrayOperatorsMixins in
> numpy/lib/mixins.py
> 
>

Yes! I had no idea that existed.

-CHB


-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] record data previous to Numpy use

2017-07-06 Thread Chris Barker
On Thu, Jul 6, 2017 at 10:55 AM,  wrote:
>
> It's is just a reflexion, but for huge files one solution might be to
> split/write/build first the array in a dedicated file (2x o(n) iterations -
> one to identify the blocks size - additional one to get and write), and
> then to load it in memory and work with numpy -
>

I may have your use case confused, but if you have a huge file with
multiple "blocks" in it, there shouldn't be any problem with loading it in
one go -- start at the top of the file and load one block at a time
(accumulating in a list) -- then you only have the memory overhead issues
for one block at a time, should be no problem.

at this stage the dimension is known and some packages will be fast and
> more adapted (pandas or astropy as suggested).
>
pandas at least is designed to read variations of CSV files, not sure you
could use the optimized part to read an array out of part of an open file
from a particular point or not.

-CHB

-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] pytest and degrees of separation.

2017-07-11 Thread Chris Barker
On Tue, Jul 11, 2017 at 5:04 PM, Thomas Caswell  wrote:

> Going with option 2 is probably the best option so that you can use pytest
> fixtures and parameterization.
>

I agree -- those are worth a lot!

-CHB



> Might be worth looking at how Matplotlib re-arranged things on our master
> branch to maintain back-compatibility with nose-specific tools that were
> used by down-stream projects.
>
> Tom
>
> On Tue, Jul 11, 2017 at 4:22 PM Sebastian Berg 
> wrote:
>
>> On Tue, 2017-07-11 at 14:49 -0600, Charles R Harris wrote:
>> > Hi All,
>> >
>> > Just looking for opinions and feedback on the need to keep NumPy from
>> > having a hard nose/pytest dependency. The options as I see them are:
>> >
>> > pytest is never imported until the tests are run -- current practice
>> > with nose
>> > pytest is never imported unless the testfiles are imported -- what I
>> > would like
>> > pytest is imported together when numpy is -- what we need to avoid.
>> > Currently the approach has been 1), but I think 2) makes more sense
>> > and allows more flexibility.
>>
>>
>> I am not quite sure about everything here. My guess is we can do
>> whatever we want when it comes to our own tests, and I don't mind just
>> switching everything to pytest (I for one am happy as long as I can run
>> `runtests.py` ;)).
>> When it comes to the utils we provide, those should keep working
>> without nose/pytest if they worked before without it I think.
>>
>> My guess is that all your options do that, so I think we should take
>> the one that gives the nicest maintainable code :). Though can't say I
>> looked enough into it to really make a well educated decision, that
>> probably means your option 2.
>>
>> - Sebastian
>>
>>
>>
>> > Thoughts?
>> > Chuck
>> > ___
>> > NumPy-Discussion mailing list
>> > NumPy-Discussion@python.org
>> > https://mail.python.org/mailman/listinfo/numpy-discussion
>> ___
>> NumPy-Discussion mailing list
>> NumPy-Discussion@python.org
>> https://mail.python.org/mailman/listinfo/numpy-discussion
>>
>
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
>
>


-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] quantile() or percentile()

2017-08-14 Thread Chris Barker
+1 on quantile()

-CHB


On Sun, Aug 13, 2017 at 6:28 AM, Charles R Harris  wrote:

>
>
> On Thu, Aug 10, 2017 at 3:08 PM, Eric Wieser 
> wrote:
>
>> Let’s try and keep this on topic - most replies to this message has been
>> about #9211, which is an orthogonal issue.
>>
>> There are two main questions here:
>>
>>1. Would the community prefer to use np.quantile(x, 0.25) instead of 
>> np.percentile(x,
>>25), if they had the choice
>>2. Is this desirable enough to justify increasing the API surface?
>>
>> The general consensus on the github issue answers yes to 1, but is
>> neutral on 2. It would be good to get more opinions.
>>
>
> I think a quantile function would be natural and desirable.
>
> 
>
> Chuck
>
>
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
>
>


-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Tensor Contraction (HPTT) and Tensor Transposition (TCL)

2017-08-17 Thread Chris Barker
On Thu, Aug 17, 2017 at 12:55 AM, Sebastian Berg  wrote:

> > How would the process look like if NumPY is distributed as a
> > precompiled binary?
>
>
> Well, numpy is BSD, and the official binaries will be BSD, someone else
> could do less free binaries of course.


Indeed, if you want it to be distributed as a binary with numpy, then the
license needs to be compatible -- do you have a substantial objection to
BSD? The BSD family is pretty much the standard for Python -- Python (and
numpy) are very broadly used in proprietary software.

I doubt we can have a hard
> dependency unless it is part of the numpy source


and no reason to -- if it is a hard dependency, it HAS to be compatible
licensed, and it's a lot easier to keep the source together.

However, it _could_ be a soft dependency, like LAPACK/BLAS -- I've honestly
lost track, but numpy used come with a lapack-lite (or some such), so that
it could be compiled and work with no external LAPACK implementation -- you
wouldn't get the best performance, but it would work.

 I doubt including the source
> itself is going to happen quickly since we would first have to decide
> to actually use a modern C++ compiler (I have no idea if that is
> problematic or not).
>

could it be there as a conditional compilation? There is a lot of push to
support C++11 elsewhere, so a compiled-with-a-modern-compiler numpy is not
SO far off..

(for py3 anyway...)


* Use TCL if you need faster einsum(like) operations
>

That is, of course, the other option -- distribute it on its own or maybe
in scipy, and then users can use it as an optimization for those few core
functions where speed matters to them -- honestly, it's a pretty small
fraction of numpy code.

But it sure would be nice if it could be built in, and then folks would get
better performance without even thinkning about it.


> Just a few thoughts, did not think about details really. But yes, it is
> sounds reasonable to me to re-add support for optional dependencies
> such as fftw or your TCL. But packagers have to make use of that or I
> fear it is actually less available than a standalone python module.
>

true -- though I expect Anaconda / conda forge at least would be likely to
pick it up if it works well.

-CHB


-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Why are empty arrays False?

2017-08-22 Thread Chris Barker
On Mon, Aug 21, 2017 at 7:34 AM, Benjamin Root  wrote:

> I've long ago stopped doing any "emptiness is false"-type tests on any
> python containers when iterators and generators became common, because they
> always return True.
>

good point.

Personally, I've thought for years that Python's "Truthiness" concept is a
wart. Sure, empty sequences, and zero values are often "False" in nature,
but truthiness really is application-dependent -- in particular, sometimes
a value of zero is meaningful, and sometimes not.

Is it really so hard to write:

if len(seq) == 0:

or

if x == 0:

or

if arr.size == 0:

or

arr.shape == (0,0):

And then you are being far more explicit about what the test really is.

And thanks Ben, for pointing out the issue with iterables. One more example
of how Python has really changed its focus:

Python 2 (or maybe, Python1.5) was all about sequences. Python 3 is all
about iterables -- and the "empty is False" concept does not map well to
iterables

As to the topic at hand, if we had it to do again, I would NOT make an
array that happens to hold a single value act like a scalar for bool() -- a
1-D array that happens to be length-1 really is a different beast than a
scalar.

But we don't have it to do again -- so we probably need to keep it as it is
for backward compatibility.

-CHB



-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Why are empty arrays False?

2017-08-22 Thread Chris Barker
On Tue, Aug 22, 2017 at 11:04 AM, Michael Lamparski <
diagonaldev...@gmail.com> wrote:

> I think truthiness is easily a wart in any dynamically-typed language (and
> yet ironically, every language I can think of that has truthiness is
> dynamically typed except for C++).  And yet for some reason it seems to be
> pressed forward as idiomatic in python, and for that reason alone, I use
> it.
>

me too :-)


> Meanwhile, for an arbitrary iterator taken as an argument, if you want it
> to have at least one element for some reason, then good luck; truthiness
> will not help you.
>

of course, nor will len()

And this is mostly OK, as if you are taking an aritrary iterable, then you
are probably going to, well, iterate over it, and:

for this in an_empty_iterable:
...

works fine.

But bringing it  back OT -- it's all a bit messy, but there is logic for
the existing conventions in numpy -- and I think backward compatibility is
more important than a slightly cleaner API.

-CHB



-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Interface numpy arrays to Matlab?

2017-08-29 Thread Chris Barker
On Tue, Aug 29, 2017 at 4:08 AM, Neal Becker  wrote:

> Transplant sounds interesting, I think I could use this.  I don't
> understand though why nobody has used a more direct approach?  Matlab has
> their python API https://www.mathworks.com/help/matlab/matlab-engine-for-
> python.html.  This will pass Matlab arrays to/from python as some kind of
> opaque blob.  I would guess that inside every Matlab array is a numpy array
> crying to be freed - in both cases an array is a block of memory together
> with shape and stride information.  So I would hope a direct conversion
> could be done, at least via C API if not directly with python numpy API.
>

I agree -- it is absolutley bizare that they havn'etr built in a numpy
array <-> matlab array mapping!

MAybe they do'nt want Matlb usres to realize that nmpy provides most of
what MATLAB does (but better :-) ) -- and want people to use Python with
MATlab for other pytonic stuff that MATLAB doesn't do well

but they do provide a mapping for array.array:

https://www.mathworks.com/help/matlab/matlab_external/use-python-array-array-types.html

which is a buffer you can wrap a numpy array around efficiently

odd that you'd have to write that code.

-CHB



> But it seems nobody has done this, so maybe it's not that simple?
>
>
> On Mon, Aug 28, 2017 at 5:32 PM Gregory Lee  wrote:
>
>> I have not used Transplant, but it sounds fairly similar to
>> Python-matlab-bridge.  We currently optionally call Matlab via
>> Python-matlab-bridge in some of the the tests for the PyWavelets package.
>>
>> https://arokem.github.io/python-matlab-bridge/
>> https://github.com/arokem/python-matlab-bridge
>>
>> I would be interested in hearing about the benefits/drawbacks relative to
>> Transplant if there is anyone who has used both.
>>
>>
>> On Mon, Aug 28, 2017 at 4:29 PM, CJ Carey 
>> wrote:
>>
>>> Looks like Transplant can handle this use-case.
>>>
>>> Blog post: http://bastibe.de/2015-11-03-matlab-engine-performance.html
>>> GitHub link: https://github.com/bastibe/transplant
>>>
>>> I haven't given it a try myself, but it looks promising.
>>>
>>> On Mon, Aug 28, 2017 at 4:21 PM, Stephan Hoyer  wrote:
>>>
 If you can use Octave instead of Matlab, I've had a very good
 experience with Oct2Py:
 https://github.com/blink1073/oct2py

 On Mon, Aug 28, 2017 at 12:20 PM, Neal Becker 
 wrote:

> I've searched but haven't found any decent answer.  I need to call
> Matlab from python.  Matlab has a python module for this purpose, but it
> doesn't understand numpy AFAICT.  What solutions are there for efficiently
> interfacing numpy arrays to Matlab?
>
> Thanks,
> Neal
>
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
>
>

 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@python.org
 https://mail.python.org/mailman/listinfo/numpy-discussion


>>>
>>> ___
>>> NumPy-Discussion mailing list
>>> NumPy-Discussion@python.org
>>> https://mail.python.org/mailman/listinfo/numpy-discussion
>>>
>>>
>> ___
>> NumPy-Discussion mailing list
>> NumPy-Discussion@python.org
>> https://mail.python.org/mailman/listinfo/numpy-discussion
>>
>
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
>
>


-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] converting list of int16 values to bitmask and back to list of int32\float values

2017-09-19 Thread Chris Barker
not sure what you are getting from:

Modbus.read_input_registers()

but if it is a binary stream then you can put it all in one numpy array
(probably type uint8 (byte)).

then you can manipulate the type with arr.astype() and arr.byteswap()

astype will tell numpy to interpret the same block of data as a different
type.

You also may be able to create the array with np.fromstring() or
np.frombuffer() in the fisrst place.

-CHB





On Thu, Sep 14, 2017 at 10:11 AM, Nissim Derdiger 
wrote:

> Hi all!
>
> I'm writing a Modbus TCP client using *pymodbus3* library.
>
> When asking for some parameters, the response is always a list of int16.
>
> In order to make the values usable, I need to transfer them into 32bit
> bites, than put them in the correct order (big\little endian wise), and
> then to cast them back to the desired format (usually int32 or float)
>
> I've solved it with a pretty naïve code, but I'm guessing there must be a
> more elegant and fast way to solve it with NumPy.
>
> Your help would be very much appreciated!
>
> Nissim.
>
>
>
> My code:
>
> def Read(StartAddress, NumOfRegisters, FunctionCode,ParameterType,
> BitOrder):
>
> # select the Parameters format
>
> PrmFormat = 'f' # default is float
>
> if ParameterType == 'int':
>
> PrmFormat = 'i'
>
> # select the endian state - maybe move to the connect
> function?
>
> endian = '
> if BitOrder == 'little':
>
> endian = '>I'
>
> # start asking for the payload
>
> payload = None
>
> while payload == None:
>
> payload = Modbus.read_input_registers(StartAddress,
> NumOfRegisters)
>
>  parse the answer
>
> ResultRegisters = []
>
> # convert the returned registers from list of int16 to
> list of 32 bits bitmaks
>
> for reg in range(int(NumOfRegisters / 2)):
>
> ResultRegisters[reg] =
> struct.pack(endian, payload.registers[2 * reg]) +
> struct.pack(endian,payload.registers[2 * reg + 1])
>
> # convert this list to a list with the real parameter
> format
>
> for reg in range(len(ResultRegisters)):
>
> ResultRegisters[reg]=
> struct.unpack(PrmFormat,ResultRegisters(reg))
>
> # return results
>
> return ResultRegisters
>
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
>
>


-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] is __array_ufunc__ ready for prime-time?

2017-11-06 Thread Chris Barker
On Sat, Nov 4, 2017 at 6:47 AM, Marten van Kerkwijk <
m.h.vankerkw...@gmail.com> wrote:

>
> You just summarized excellently why I'm on a quest to change `asarray`
> to `asanyarray` within numpy


+1 -- we should all be using asanyarray() most of the time. However a
couple notes:

asarray() pre-dates asanyarray() by a LOT. asanyarray was added to better
handle subclasses, but there is a lot of legacy code out there.

An legacy coders -- I know that I still usually use asarray without
thinking about it -- sorry!

Obviously, this covers only ndarray
> subclasses, not duck types, though I guess in principle one could use
> the ABC registration mechanism mentioned above to let those types pass
> through.
>

The trick there is that what does it mean to be duck-typed to an ndarray?
For may applications its' critical that the C API be the same, so
duck-typing doesn't really apply.

And in other cases, in only needs to support a small portion of the numpy
API. IS essence, there are an almost infinite number of possible ABCs for
an ndarray...

For my part, I've been known to write custom "array_like" code -- it checks
for the handful of methods I know I need to use, and tI test it against the
small handful of duck-typed arrays that I know I want my code to work with.

Klunky, and maybe we could come up with a standard way to do it and include
that in numpy, but I'm not sure that ABCs are the way to do it.


-CHB


-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Proposal of timeline for dropping Python 2.7 support

2017-11-06 Thread Chris Barker
On Sun, Nov 5, 2017 at 10:25 AM, Charles R Harris  wrote:


>  the timeline I've been playing with is to keep Python 2.7 support through
> 2018, which given our current pace, would be for NumPy 1.15 and 1.16. After
> that 1.16 would become a long term support release with backports of
> critical bug fixes
>

+1

I think py2.7 is going to be around for a long time yet -- which means we
really do want to keep the long term support -- which may be quite some
time. But that's doesn't mean people insisting on no upgrading PYthon need
to get the latest and greatest numpy.

Also -- if py2.7 continues to see the use I expect it will well past when
pyton.org officially drops it, I wouldn't be surprised if a Python2.7
Windows build based on a newer compiler would come along -- perhaps by
Anaconda or conda-forge, or ???

If that happens, I suppose we could re-visit 2.7 support. Though it sure
would be nice to clean up the dang Unicode stuff for good, too!

In short, if it makes it easier for numpy to move forward, let's do it!

-CHB


-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Proposal of timeline for dropping Python 2.7 support

2017-11-07 Thread Chris Barker
On Mon, Nov 6, 2017 at 6:14 PM, Charles R Harris 
wrote:

> Also -- if py2.7 continues to see the use I expect it will well past when
>>> pyton.org officially drops it, I wouldn't be surprised if a Python2.7
>>> Windows build based on a newer compiler would come along -- perhaps by
>>> Anaconda or conda-forge, or ???
>>>
>>
>> I suspect that this will indeed happen. I am aware of multiple companies
>> following this path already (building python + numpy themselves with a
>> newer MS compiler).
>>
>
> I think Anaconda is talking about distributing a compiler, but what that
> will be on windows is anyone's guess. When we drop 2.7, there is a lot of
> compatibility crud that it would be nice to get rid of, and if we do that
> then NumPy will no longer compile against 2.7. I suspect some companies
> have just been putting off the task of upgrading to Python 3, which should
> be pretty straight forward these days apart from system code that needs to
> do a lot of work with bytes.
>

I agree, and if there is a compelling reason to upgrade, folks WILL do it.
But I've been amazed over the years at folks' desire to stick with what
they have! And I'm guilty too, anything new I start with py3, but older
larger codebases are still py2, I just can't find the energy to spend a the
week or so it would probably take to update everything...

But in the original post, the Windows Compiler issue was mentioned, so
there seems to be two reasons to drop py2:

A) wanting to use py3 only features.
B) wanting to use never C (C++?) compiler features.

I suggest we be clear about which of these is driving the decisions, and
explicit about the goals. That is, if (A) is critical, we don't even have
to talk about (B)

But we could choose to do (B) without doing (A) -- I suspect there will be
a user base for that

-CHB




-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] is __array_ufunc__ ready for prime-time?

2017-11-07 Thread Chris Barker
On Mon, Nov 6, 2017 at 4:28 PM, Stephan Hoyer  wrote:

>
>> What's needed, though, is not just a single ABC. Some thought and design
>> needs to go into segmenting the ndarray API to declare certain behaviors,
>> just like was done for collections:
>>
>> https://docs.python.org/3/library/collections.abc.html
>>
>> You don't just have a single ABC declaring a collection, but rather "I am
>> a mapping" or "I am a mutable sequence". It's more of a pain for developers
>> to properly specify things, but this is not a bad thing to actually give
>> code some thought.
>>
>
> I agree, it would be nice to nail down a hierarchy of duck-arrays, if
> possible. Although, there are quite a few options, so I don't know how
> doable this is.
>

Exactly -- there are an exponential amount of options...


> Well, to get the ball rolling a bit, the key thing that matplotlib needs
> to know is if `shape`, `reshape`, 'size', broadcasting, and logical
> indexing is respected. So, I see three possible abc's here: one for
> attribute access (things like `shape` and `size`) and another for shape
> manipulations (broadcasting and reshape, and assignment to .shape).


I think we're going to get into an string of ABCs:

ArrayLikeForMPL_ABC

etc, etc.


> And then a third abc for indexing support, although, I am not sure how
> that could get implemented...


This is the really tricky one -- all ABCs really check is the existence of
methods -- making sure they behave the same way is up to the developer of
the ducktype.

which is K, but will require discipline.

But indexing, specifically fancy indexing, is another matter -- I'm not
sure if there even a way with an ABC to check for what types of indexing
are support, but we'd still have the problem with whether the semantics are
the same!

For example, I work with netcdf variable objects, which are partly
duck-typed as ndarrays, but I think n-dimensional fancy indexing works
differently... how in the world do you detect that with an ABC???

For the shapes and reshaping, I wrote an ShapedLikeNDArray mixin/ABC
> for astropy, which may be a useful starting point as it also provides
> a way to implement the methods ndarray uses to reshape and get
> elements: see
> https://github.com/astropy/astropy/blob/master/astropy/utils/misc.py#L863


Sounds like a good starting point for discussion.

-CHB



-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Proposal of timeline for dropping Python 2.7 support

2017-11-08 Thread Chris Barker
On Wed, Nov 8, 2017 at 11:08 AM, Julian Taylor <
jtaylor.deb...@googlemail.com> wrote:

>
> Would dropping python2 support for windows earlier than the other
> platforms a reasonable approach?
>

no. I'm not Windows fan myself, but it is a HUGE fraction of the userbase.

-CHB


-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Proposal of timeline for dropping Python 2.7 support

2017-11-13 Thread Chris Barker
On Fri, Nov 10, 2017 at 2:03 PM, Robert McLeod  wrote:

> Pip repo names and actual module names don't have to be the same.  One
> potential work-around would be to make a 'numpylts' repo on PyPi which is
> the 1.17 version with support for Python 2.7 and bug-fix releases as
> required.  This will still cause regressions but it's a matter of modifying
> `requirements.txt` in downstream Python 2.7 packages and not much else.
>
> E.g. in `requirements.txt`:
>
> numpy;python_version>"3.0"
> numpylts; python_version<"3.0"
>


Can't we handle this with numpy versioning?

IIUC, numpy (py3 only) and numpy (LTS) will not only support different
platforms, but also be different versions. So if you have py2 or py2+3 code
that uses numpy, it will have to specify a <= version number anyway.

Also -- I think Nathaniel's point was that wheels have the python version
baked in, so pip, when run from py2, should find the latest py2 compatible
numpy automagically.

And thanks for writing this up -- LGTM

-CHB


-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Proposal of timeline for dropping Python 2.7 support

2017-11-17 Thread Chris Barker
On Fri, Nov 17, 2017 at 4:35 AM, Peter Cock 
wrote:

> Since Konrad Hinsen no longer follows the NumPy discussion list
> for lack of time, he has not posted here - but he has commented
> about this on Twitter and written up a good blog post:
>
> http://blog.khinsen.net/posts/2017/11/16/a-plea-for-
> stability-in-the-scipy-ecosystem/
>
> In a field where scientific code is expected to last and be developed
> on a timescale of decades, the change of pace with Python 2 and 3
> is harder to handle.
>

sure -- but I do not get what the problem is here!

from his post:

"""
The disappearance of Python 2 will leave much scientific software orphaned,
and many published results irreproducible.
"""

This is an issue we should all be concerned about, and, in fact, the scipy
community has been particularly active in the reproducibility realm.

BUT: that statement makes NO SENSE. dropping Python2 support in numpy (or
any other package) means that newer versions of numpy will not run on py2
-- but if you want to reproduce results, you need to run the code WITH THE
VERSION THAT WAS USED IN THE  FIRST PLACE.

So if someone publishes something based on code written in python2.7 and
numpy 1.13, then it is not helpful for reproducibility at all for numpy
1.18 (or 2.*, or whatever we call it) to run on python2. So there is no
issue here.

Potential issues will arise post 2020, when maybe python2.7 (and numpy
1.13) will no longer run on an up to date OS. But the OS vendors do a
pretty good job of backward compatibility -- so we've got quite a few years
to go on that.

And it will also be important that older versions of packages are available
-- but as long as we don't delete the archives, that should be the case for
a good long while.

So not sure what the problem is here.

note relevant for reproducibility,but I have always been puzzled that folks
often desperately want to run the very latest numpy on an old Python (2.6,
1.5, ) if you can update your numy, update your darn Python too!

-CHB

-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Deprecate matrices in 1.15 and remove in 1.17?

2017-12-01 Thread Chris Barker
On Thu, Nov 30, 2017 at 11:58 AM, Marten van Kerkwijk <
m.h.vankerkw...@gmail.com> wrote:

> Your point about not doing things in the python 2->3 move makes sense;
>

But this is NOT the 2->3 move -- numpy as been py3 compatible for years. At
some point, it is a really good idea to deprecate some things.

Personally, I think Matrix should have been deprecated a good while ago --
it never really worked well, and folks have been advised not to use it for
years. But anyway, once we can count on having @ then there really is no
reason to have Matrix, so it happens that dropping py2 support is the first
time we can count on that. But this is really deprecating something when we
stop support for py < 3.5, not the py2 to py3 transition.

Remember that deprecating is different than dropping. If we want to keep
Matrix around for one release after py2 is dropped, so that people can use
it once they are "forced" to move to py3, OK, but let's get clear
deprecation plan in place.

Also -- we aren't requiring people to move to py3 -- we are only requiring
people to move to py3 if they want the latest numpy features.

One last note: Guido's suggestion that libraries not take py3 as an
opportunity to change APIs was a good one, but it was also predicated on
the fact that py2+p3 support was going to be needed for a good while. So
this is really a different case. It's really a regular old deprecation --
you aren't going to have this feature in future numpy releases -- py2/3 has
nothing to do with it.

-CHB

-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] NEP process update

2017-12-07 Thread Chris Barker
Great idea -- thanks for pushing this forward all.

In the end, you can have the NEPs in a separate repo, and still publish
them closely with the main docs (intersphinx is pretty cool), or have them
in the same repo and publish them separately.

So I say let the folks doing the work decide what workflow works best for
them.

Comments on a couple other points:

I find myself going back to PEPs quite a bit -- mostly to understand the
hows an whys of a feature, rather than the how-to-use its.

And yes -- we should keep NEPs updated -- they certainly should be edited
for typos and minor clarifications, but It's particularly important if the
implementation ends up differing a bit from what was expected when the NEP
was written.

I'm not sure what the PEP policy is about this, but they are certainly
maintained with regard to typos and the like.

-CHB


On Wed, Dec 6, 2017 at 10:43 AM, Charles R Harris  wrote:

>
>
> On Wed, Dec 6, 2017 at 7:23 AM, Marten van Kerkwijk <
> m.h.vankerkw...@gmail.com> wrote:
>
>> Would be great to have structure, and especially a template - ideally,
>> the latter is enough for someone to create a NEP, i.e., has lots of
>> in-template documentation.
>>
>> One thing I'd recommend thinking a little about is to what extend a
>> NEP is "frozen" after acceptance. In astropy we've seen situations
>> where it helps to clarify details later, and it may be good to think
>> beforehand what one wants. In my opinion, one should allow
>> clarifications of accepted NEPs, and major editing of ones still
>> pending (as happened for __[numpy|array]_ufunc__).
>>
>> I think the location is secondary, but for what it is worth, I'm not
>> fond of the astropy APEs being in a separate repository, mostly
>> because I like detailed discussion of everything related in the
>> project to happen in one place on github. Also, having to clone a
>> repository is yet another hurdle for doing stuff. So, I'd suggest to
>> keep the NEPs in the main repository.
>
>
> +1
>
> Chuck
>
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
>
>


-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] Setting custom dtypes and 1.14

2018-01-25 Thread Chris Barker
Hi all,

I'm pretty sure this is the same thing as recently discussed on this list
about 1.14, but to confirm:

I had failures in my code with an upgrade for 1.14 -- turns out it was a
single line in a single test fixture, so no big deal, but a regression just
the same, with no deprecation warning.

I was essentially doing this:

In [*48*]: dt

Out[*48*]: dtype([('time', ' in ()

> 1 full = np.array(zip(time, uv), dtype=dt)


ValueError: setting an array element with a sequence.


It took some poking, but the solution was to do:

full = np.array(zip(time, (tuple(w) *for* w *in* uv)), dtype=dt)

That is, convert the values to nested tuples, rather than an array in a
tuple, or a list in a tuple.

As I said, my problem is solved, but to confirm:

1) This is a known change with good reason?

2) My solution was the best (only) one -- the only way to set a nested
dtype like that is with tuples?

If so, then I think we should:

A) improve the error message.

"ValueError: setting an array element with a sequence."

Is not really clear -- I spent a while trying to figure out how I could set
a nested dtype like that without a sequence? and I was actually using a
ndarray, so it wasn't even a generic sequence. And a tuple is a sequence,
too...

I had a vague recollection that in some circumstances, numpy treats tuples
and lists (and arrays) differently (fancy indexing??), so I tried the tuple
thing and that worked. But I've been around numpy a long time -- that could
have been very very confusing to many people.

So could the message be changed to something like:

"ValueError: setting an array element with a generic sequence. Only the
tuple type can be used in this context."

or something like that -- I'm not sure where else this same error message
might pop up, so that could be totally inappropriate.


2) maybe add a .totuple()method to ndarray, much like the .tolist() method?
that would have been handy here.


-Chris


-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Setting custom dtypes and 1.14

2018-01-26 Thread Chris Barker
On Fri, Jan 26, 2018 at 10:48 AM, Allan Haldane 
wrote:

> > What do folks think about a totuple() method — even before this I’ve
> > wanted that. But in this case, it seems particularly useful.
>


> Two thoughts:
>
> 1. `totuple` makes most sense for 2d arrays. But what should it do for
> 1d or 3+d arrays? I suppose it could make the last dimension a tuple, so
> 1d arrays would give a list of tuples of size 1.
>

I was thinking it would be exactly like .tolist() but with tuples -- so
you'd get tuples all the way down (or is that turtles?)

IN this use case, it would have saved me the generator expression:

(tuple(r) for r in arr)

not a huge deal, but it would be nice to not  have to write that, and to
have the looping be in C with no intermediate array generation.

2. structured array's .tolist() already returns a list of tuples. If we
> have a 2d structured array, would it add one more layer of tuples?


no -- why? it would return a tuple of tuples instead.


> That
> would raise an exception if read back in by `np.array` with the same dtype.
>

Hmm -- indeed, if the top-level structure is a tuple, the array constructor
gets confused:

This works fine -- as it should:


In [*84*]: new_full = np.array(full.tolist(), full.dtype)


But this does not:


In [*85*]: new_full = np.array(tuple(full.tolist()), full.dtype)

---

ValueErrorTraceback (most recent call last)

 in ()

> 1 new_full = np.array(tuple(full.tolist()), full.dtype)


ValueError: could not assign tuple of length 4 to structure with 2 fields.

I was hoping it would dig down to the inner structures looking for a match
to the dtype, rather than looking at the type of the top level. Oh well.

So yeah, not sure where you would go from tuple to list -- probably at the
bottom level, but that may not always be unambiguous.

These points make me think that instead of a `.totuple` method, this
> might be more suitable as a new function in np.lib.recfunctions.


I don't seem to have that module -- and I'm running 1.14.0 -- is this a new
idea?


> If the
> goal is to help manipulate structured arrays, that submodule is
> appropriate since it already has other functions do manipulate fields in
> similar ways. What about calling it `pack_last_axis`?
>
> def pack_last_axis(arr, names=None):
> if arr.names:
> return arr
> names = names or ['f{}'.format(i) for i in range(arr.shape[-1])]
> return arr.view([(n, arr.dtype) for n in names]).squeeze(-1)
>
> Then you could do:
>
> >>> pack_last_axis(uv).tolist()
>
> to get a list of tuples.
>

not sure what idea is here -- in my example, I had a regular 2-d array, so
no names:

In [*90*]: pack_last_axis(uv)

---

AttributeErrorTraceback (most recent call last)

 in ()

> 1 pack_last_axis(uv)


 in pack_last_axis(arr, names)

*  1* def pack_last_axis(arr, names=None):

> 2 if arr.names:

*  3* return arr

*  4* names = names or ['f{}'.format(i) for i in range(arr.shape[-1
])]

*  5* return arr.view([(n, arr.dtype) for n in names]).squeeze(-1)


AttributeError: 'numpy.ndarray' object has no attribute 'names'


So maybe you meants something like:


In [*95*]: *def* pack_last_axis(arr, names=None):

...: *try*:

...: arr.names

...: *return* arr

...: *except* *AttributeError*:

...: names = names *or* ['f{}'.format(i) *for* i *in* range
(arr.shape[-1])]

...: *return* arr.view([(n, arr.dtype) *for* n *in*
names]).squeeze(-1)

which does work, but seems like a convoluted way to get tuples!

However, I didn't actually need tuples, I needed something I could pack
into a stuctarray, and this does work, without the tolist:

full = np.array(zip(time, pack_last_axis(uv)), dtype=dt)


So maybe that is the way to go.

I'm not sure I'd have thought to look for this function, but what can you
do?

Thanks for your attention to this,

-CHB

-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Setting custom dtypes and 1.14

2018-01-26 Thread Chris Barker
On Fri, Jan 26, 2018 at 2:35 PM, Allan Haldane 
wrote:

> As I remember, numpy has some fairly convoluted code for array creation
> which tries to make sense of various nested lists/tuples/ndarray
> combinations. It makes a difference for structured arrays and object
> arrays. I don't remember the details right now, but I know in some cases
> the rule is "If it's a Python list, recurse, otherwise assume it is an
> object array".
>

that's at least explainable, and the "try to figure out what the user
means" array cratinon is pretty much an impossible problem, so what we've
got is probably about as good as it can get.

> > These points make me think that instead of a `.totuple` method, this
> > might be more suitable as a new function in np.lib.recfunctions.
> >
> > I don't seem to have that module -- and I'm running 1.14.0 -- is this a
> > new idea?
>
> Sorry, I didn't specify it correctly. It is "numpy.lib.recfunctions".
>

thanks -- found it.


> Also, the functions in that module encourage "pandas-like" use of
> structured arrays, but I'm not sure they should be used that way. I've
> been thinking they should be primarily used for binary interfaces
> with/to numpy, eg to talk to C programs or to read complicated binary
> files.
>

that's my use-case. And I agree -- if you really want to do that kind of
thing, pandas is the way to go.

I thought recarrays were pretty cool back in the day, but pandas is a much
better option.

So I pretty much only use structured arrays for data exchange with C
code

-CHB

-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Setting custom dtypes and 1.14

2018-01-29 Thread Chris Barker
On Sat, Jan 27, 2018 at 8:50 PM, Allan Haldane 
wrote:

> On 01/26/2018 06:01 PM, josef.p...@gmail.com wrote:
>
>> I thought recarrays were pretty cool back in the day, but pandas is
>> a much better option.
>>
>> So I pretty much only use structured arrays for data exchange with C
>> code
>>
>> My impression is that this turns into a deprecate recarrays and
>> supporting recfunction issue.
>>
>>

> *should* we have any dataframe-like functionality in numpy?
>
> We get requests every once in a while about how to sort rows, or about
> adding a "groupby" function. I myself have used recarrays in a
> dataframe-like way, when I wanted a quick multiple-array object that
> supported numpy indexing. So there is some demand to have minimal
> "dataframe-like" behavior in numpy itself.
>
> recarrays play part of this role currently, though imperfectly due to
> padding and cache issues. I think I'm comfortable with supporting some
> minor use of structured/recarrays as dataframe-like, with a warning in docs
> that the user should really look at pandas/xarray, and that structured
> arrays are primarily for data exchange.
>

Well, I think we should either:

deprecate recarrays -- i.e. explicitly not support DataFrame-like
functionality in numpy, keeping only the data-exchange functionality as
maintained.

or

Properly support it -- which doesn't mean re-implementing Pandas or xarray,
but it would mean addressing any bug-like issues like not dealing properly
with padding.

Personally, I don't need/want it enough to contribute, but if someone does,
great.

This reminds me a bit of the old numpy.Matrix issue -- it was ALMOST there,
but not quite, with issues, and there was essentially no overlap between
the people that wanted it and the people that had the time and skills to
really make it work.

(If we want to dream, maybe one day we should make a minimal multiple-array
> container class. I imagine it would look pretty similar to recarray, but
> stored as a set of arrays instead of a structured array. But maybe
> recarrays are good enough, and let's not reimplement pandas either.)
>

Exactly -- we really don't need to re-implement Pandas

(except it's CSV reading capability :-) )

-CHB


-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Setting custom dtypes and 1.14

2018-01-30 Thread Chris Barker
On Mon, Jan 29, 2018 at 7:44 PM, Allan Haldane 
wrote:

> I suggest that if we want to allow either means over fields, or conversion
> of a n-D structured array to an n+1-D regular ndarray, we should add a
> dedicated function to do so in numpy.lib.recfunctions
> which does not depend on the binary representation of the array.
>

IIUC, the core use-case of structured dtypes is binary compatibility with
external systems (arrays of C structs, mostly) -- at least that's how I use
them :-)

In which case, "conversion of a n-D structured array to an n+1-D regular
ndarray" is an important feature -- actually even more important if you
don't use recarrays

So yes, let's have a utility to make that easy.

as for recarrays -- are we that far from having them be robust and useful?
in which case, why not keep them around, fix the few issues, but explicitly
not try to extend them into more dataframe-like domains

-CHB

-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Extending C with Python

2018-01-31 Thread Chris Barker
I'm guessing you could use Cython to make this easier. It's usually used
for calling C from Python, but can do the sandwich in both directions...

Just a thought -- it will help with some of that boilerplate code...

-CHB




On Tue, Jan 30, 2018 at 10:57 PM, Jialin Liu  wrote:

> Amazing! It works! Thank you Robert.
>
> I've been stuck with this many days.
>
> Best,
> Jialin
> LBNL/NERSC
>
> On Tue, Jan 30, 2018 at 10:52 PM, Robert Kern 
> wrote:
>
>> On Wed, Jan 31, 2018 at 3:25 PM, Jialin Liu  wrote:
>>
>>> Hello,
>>> I'm extending C with python (which is opposite way of what people
>>> usually do, extending python with C), I'm currently stuck in passing a C
>>> array to python layer, could anyone plz advise?
>>>
>>> I have a C buffer in my C code and want to pass it to a python function.
>>> In the C code, I have:
>>>
>>> npy_intp  dims [2];
 dims[0] = 10;
 dims[1] = 20;
 import_array();
 npy_intp m=2;
 PyObject * py_dims = PyArray_SimpleNewFromData(1, &m, NPY_INT16 ,(void
 *)dims ); // I also tried NPY_INT
 PyObject_CallMethod(pInstance, method_name, "O", py_dims);
>>>
>>>
>>> In the Python code, I want to just print that array:
>>>
>>> def f(self, dims):
>>>
>>>print ("np array:%d,%d"%(dims[0],dims[1]))
>>>
>>>
>>>
>>> But it only prints the first number correctly, i.e., dims[0]. The second
>>> number is always 0.
>>>
>>
>> The correct typecode would be NPY_INTP.
>>
>> --
>> Robert Kern
>>
>> ___
>> NumPy-Discussion mailing list
>> NumPy-Discussion@python.org
>> https://mail.python.org/mailman/listinfo/numpy-discussion
>>
>>
>
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
>
>


-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] improving arange()? introducing fma()?

2018-02-09 Thread Chris Barker
On Wed, Feb 7, 2018 at 12:09 AM, Ralf Gommers 
wrote:
>
>  It is partly a plea for some development of numerically accurate
>> functions for computing lat/lon grids from a combination of inputs: bounds,
>> counts, and resolutions.
>>
>
Can you be more specific about what problems you've run into -- I work with
lat-lon grids all the time, and have never had a problem.

float32 degrees gives you about 1 meter accuracy or better, so I can see
how losing a few digits might be an issue, though I would argue that you
maybe shouldn't use float32 if you are worried about anything close to 1m
accuracy... -- or shift to a relative coordinate system of some sort.

I have been playing around with the decimal package a bit lately,
>>
>
sigh. decimal is so often looked at a solution to a problem it isn't
designed for. lat-lon is natively Sexagesimal -- maybe we need that dtype
:-)

what you get from decimal is variable precision -- maybe a binary variable
precision lib is a better answer -- that would be a good thing to have easy
access to in numpy, but in this case, if you want better accuracy in a
computation that will end up in float32, just use float64.

and I discovered the concept of "fused multiply-add" operations for
>> improved accuracy. I have come to realize that fma operations could be used
>> to greatly improve the accuracy of linspace() and arange().
>>
>
arange() is problematic for non-integer use anyway, by its very definition
(getting the "end point" correct requires the right step, even without FP
error).

and would it really help with linspace? it's computing a delta with one
division in fp, then multiplying it by an integer (represented in fp --
why? why not keep that an integer till the multiply?).

In particular, I have been needing improved results for computing
>> latitude/longitude grids, which tend to be done in float32's to save memory
>> (at least, this is true in data I come across).
>>
>
> If you care about saving memory *and* accuracy, wouldn't it make more
> sense to do your computations in float64, and convert to float32 at the
> end?
>

that does seem to be the easy option :-)


> Now, to the crux of my problem. It is next to impossible to generate a
>> non-trivial numpy array of coordinates, even in double precision, without
>> hitting significant numerical errors.
>>
>
I'm confused, the example you posted doesn't have significant errors...


> Which has lead me down the path of using the decimal package (which
>> doesn't play very nicely with numpy because of the lack of casting rules
>> for it). Consider the following:
>> ```
>> $ cat test_fma.py
>> from __future__ import print_function
>> import numpy as np
>> res = np.float32(0.01)
>> cnt = 7001
>> x0 = np.float32(-115.0)
>> x1 = res * cnt + x0
>> print("res * cnt + x0 = %.16f" % x1)
>> x = np.arange(-115.0, -44.99 + (res / 2), 0.01, dtype='float32')
>> print("len(arange()): %d  arange()[-1]: %16f" % (len(x), x[-1]))
>> x = np.linspace(-115.0, -44.99, cnt, dtype='float32')
>> print("linspace()[-1]: %.16f" % x[-1])
>>
>> $ python test_fma.py
>> res * cnt + x0 = -44.9900015648454428
>> len(arange()): 7002  arange()[-1]:   -44.975044
>> linspace()[-1]: -44.9900016784667969
>> ```
>> arange just produces silly results (puts out an extra element... adding
>> half of the resolution is typically mentioned as a solution on mailing
>> lists to get around arange()'s limitations -- I personally don't do this).
>>
>
The real solution is "don't do that" arange is not the right tool for the
job.

Then there is this:

res * cnt + x0 = -44.9900015648454428
linspace()[-1]: -44.9900016784667969

that's as good as you are ever going to get with 32 bit floats...

Though I just noticed something about your numbers -- there should be a
nice even base ten delta if you have 7001 gaps -- but linspace produces N
points, not N gaps -- so maybe you want:


In [*17*]: l = np.linspace(-115.0, -44.99, 7002)


In [*18*]: l[:5]

Out[*18*]: array([-115.  , -114.99, -114.98, -114.97, -114.96])


In [*19*]: l[-5:]

Out[*19*]: array([-45.03, -45.02, -45.01, -45.  , -44.99])


or, in float32 -- not as pretty:


In [*20*]: l = np.linspace(-115.0, -44.99, 7002, dtype=np.float32)


In [*21*]: l[:5]

Out[*21*]:

array([-115., -114.98999786, -114.98000336, -114.97000122,

   -114.9508], dtype=float32)


In [*22*]: l[-5:]

Out[*22*]: array([-45.02999878, -45.0246, -45.00999832, -45.,
-44.99000168], dtype=float32)


but still as good as you get with float32, and exactly the same result as
computing in float64 and converting:



In [*25*]: l = np.linspace(-115.0, -44.99, 7002).astype(np.float32)


In [*26*]: l[:5]

Out[*26*]:

array([-115., -114.98999786, -114.98000336, -114.97000122,

   -114.9508], dtype=float32)


In [*27*]: l[-5:]

Out[*27*]: array([-45.02999878, -45.0246, -45.00999832, -45.,
-44.99000168], dtype=float32)



>> So, does it make any sense to improve arange by utilizing fma() under the
>> hood?
>>
>

Re: [Numpy-discussion] improving arange()? introducing fma()?

2018-02-12 Thread Chris Barker
I think it's all been said, but a few comments:

On Sun, Feb 11, 2018 at 2:19 PM, Nils Becker  wrote:

> Generating equidistantly spaced grids is simply not always possible.
>

exactly -- and linspace gives pretty much teh best possible result,
guaranteeing tha tthe start an end points are exact, and the spacing is
within an ULP or two (maybe we could make that within 1 ULP always, but not
sure that's worth it).


> The reason is that the absolute spacing of the possible floating point
> numbers depends on their magnitude [1].
>

Also that the exact spacing may not be exactly representable in FP -- so
you have to have at least one space that's a bit off to get the end points
right (or have the endpoints not exact).


> If you - for some reason - want the same grid spacing everywhere you may
> choose an appropriate new spacing.
>

well, yeah, but usually you are trying to fit to some other constraint. I'm
still confused as to where these couple of ULPs actually cause problems,
unless you are doing in appropriate FP comparisons elsewhere.

Curiously, either by design or accident, arange() seems to do something
> similar as was mentioned by Eric. It creates a new grid spacing by adding
> and subtracting the starting point of the grid. This often has similar
> effect as adding and subtracting N*dx (e.g. if the grid is symmetric around
> 0.0). Consequently, arange() seems to trade keeping the grid spacing
> constant for a larger error in the grid size and consequently in the end
> point.
>

interesting -- but it actually makes sense -- that is the definition of
arange(), borrowed from range(), which was designed for integers, and, in
fact, pretty much mirroered the classic C index for loop:


for (int i=0; i 1. Comparison to calculations with decimal can be difficult as not all
> simple decimal step sizes are exactly representable as
>
finite floating point numbers.
>

yeah, this is what I mean by inappropriate use of Decimal -- decimal is not
inherently "more accurate" than fp -- is just can represent _decimal_
numbers exactly, which we are all use to -- we want  1 / 10 to be exact,
but dont mind that 1 / 3 isn't.

Decimal also provided variable precision -- so it can be handy for that. I
kinda wish Python had an arbitrary precision binary floating point built
in...

-CHB

-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] improving arange()? introducing fma()?

2018-02-22 Thread Chris Barker
@Ben: Have you found a solution to your problem? Are there thinks we could
do in numpy to make it better?

-CHB


On Mon, Feb 12, 2018 at 9:33 AM, Chris Barker  wrote:

> I think it's all been said, but a few comments:
>
> On Sun, Feb 11, 2018 at 2:19 PM, Nils Becker 
> wrote:
>
>> Generating equidistantly spaced grids is simply not always possible.
>>
>
> exactly -- and linspace gives pretty much teh best possible result,
> guaranteeing tha tthe start an end points are exact, and the spacing is
> within an ULP or two (maybe we could make that within 1 ULP always, but not
> sure that's worth it).
>
>
>> The reason is that the absolute spacing of the possible floating point
>> numbers depends on their magnitude [1].
>>
>
> Also that the exact spacing may not be exactly representable in FP -- so
> you have to have at least one space that's a bit off to get the end points
> right (or have the endpoints not exact).
>
>
>> If you - for some reason - want the same grid spacing everywhere you may
>> choose an appropriate new spacing.
>>
>
> well, yeah, but usually you are trying to fit to some other constraint.
> I'm still confused as to where these couple of ULPs actually cause
> problems, unless you are doing in appropriate FP comparisons elsewhere.
>
> Curiously, either by design or accident, arange() seems to do something
>> similar as was mentioned by Eric. It creates a new grid spacing by adding
>> and subtracting the starting point of the grid. This often has similar
>> effect as adding and subtracting N*dx (e.g. if the grid is symmetric around
>> 0.0). Consequently, arange() seems to trade keeping the grid spacing
>> constant for a larger error in the grid size and consequently in the end
>> point.
>>
>
> interesting -- but it actually makes sense -- that is the definition of
> arange(), borrowed from range(), which was designed for integers, and, in
> fact, pretty much mirroered the classic C index for loop:
>
>
> for (int i=0; i ...
>
>
> or in python:
>
> i = start
> while i < stop:
> i += step
>
> The problem here is that termination criteria -- i < stop -- that is the
> definition of the function, and works just fine for integers (where it came
> from), but with FP, even with no error accumulation, stop may not be
> exactly representable, so you could end up with a value for your last item
> that is about (stop-step), or you could end up with a value that is a
> couple ULPs less than step -- essentially including the end point when you
> weren't supposed to.
>
> The truth is, making a floating point range() was simply a bad idea to
> begin with -- it's not the way to define a range of numbers in floating
> point. Whiuch is why the docs now say "When using a non-integer step, such
> as 0.1, the results will often not
> be consistent.  It is better to use ``linspace`` for these cases."
>
> Ben wants a way to define grids in a consistent way -- make sense. And
> yes, sometimes, the original source you are trying to match (like GDAL)
> provides a starting point and step. But with FP, that is simply
> problematic. If:
>
> start + step*num_steps != stop
>
> exactly in floating point, then you'll need to do the math one way or
> another to get what you want -- and I'm not sure anyone but the user knows
> what they want -- do you want step to be as exact as possible, or do you
> want stop to be as exact as possible?
>
> All that being said -- if arange() could be made a tiny bit more accurate
> with fma or other numerical technique, why not? it won't solve the
> problem, but if someone writes and tests the code (and it does not require
> compiler or hardware features that aren't supported everywhere numpy
> compiles), then sure. (Same for linspace, though I'm not sure it's possible)
>
> There is one other option: a new function (or option) that makes a grid
> from a specification of: start, step, num_points. If that is really a
> common use case (that is, you don't care exactly what the end-point is),
> then it might be handy to have it as a utility.
>
> We could also have an arange-like function that, rather than < stop, would
> do "close to" stop. Someone that understands FP better than I might be able
> to compute what the expected error might be, and find the closest end point
> within that error. But I think that's a bad specification -- (stop - start)
> / step may be nowhere near an integer -- then what is the function supposed
> to do??
>
>
> BTW: I kind of wish that linspace specified the number of steps, rather
> than the number of points, that is (num+p

Re: [Numpy-discussion] improving arange()? introducing fma()?

2018-02-22 Thread Chris Barker
On Thu, Feb 22, 2018 at 11:57 AM, Sebastian Berg  wrote:

> > First, you are right...Decimal is not the right module for this. I
> > think instead I should use the 'fractions' module for loading grid
> > spec information from strings (command-line, configs, etc). The
> > tricky part is getting the yaml reader to use it instead of
> > converting to a float under the hood.
>

I'm not sure fractions is any better (Or necessary, anyway) in the end, you
need floats, so the inherent limitations of floats aren't the problem. In
your original use-case, you wanted a 32 bit float grid in the end, so doing
the calculations is 64 bit float and then downcasting is as good as you're
going to get, and easy and fast. And I suppose you could use 128 bit float
if you want to get to 64 bit in the end -- not as easy, and python itself
doesn't have it.

>  The
> tricky part is getting the yaml reader to use it instead of
> converting to a float under the hood.

64 bit floats support about 15 decimal digits -- are you string-based
sources providing more than that?? if not, then the 64 bit float version is
as good as it's going to get.

> Second, what has been pointed out about the implementation of arange
> > actually helps to explain some oddities I have encountered. In some
> > situations, I have found that it was better for me to produce the
> > reversed sequence, and then reverse that array back and use it.
>

interesting -- though I'd still say "don't use arange" is the "correct"
answer.


> > Third, it would be nice to do what we can to improve arange()'s
> > results. Would we be ok with a PR that uses fma() if it is available,
> > but then falls back on a regular multiply and add if it isn't
> > available, or are we going to need to implement it ourselves for
> > consistency?
>

I would think calling fma() if supported would be fine -- if there is an
easy macro to check if it's there. I don't know if numpy has a policy about
this sort of thing, but I'm pretty sure everywhere else, the final details
of computation fal back to the hardware/compiler/library (i.e. Intel used
extended precision fp, other platforms don't, etc) so I can't see that
having a slightly more accurate computation in arange on some platforms and
not others would cause a problem. If any of the tests are checking to that
level of accuracy, they should be fixed :-)

  2. It sounds *not* like a fix, but rather a
>  "make it slightly less bad, but it is still awful"
>

exactly -- better accuracy is a good thing, but it's not the source of the
problem here -- the source of the problem is inherent to FP, and/or poorly
specified goal. having arrange or linspace lose a couple ULPs fewer isn't
going to change anything.


> Using fma inside linspace might make linspace a bit more exact
> possible, and would be a good thing, though I am not sure we have a
> policy yet for something that is only used sometimes,


see above -- nor do I, but it seems like a fine idea to me.


> It also would be nice to add stable summation to numpy in general (in
> whatever way), which maybe is half related but on nobody's specific
> todo list.


I recall a conversation on this list a (long) while back about compensated
summation (Kahan summation) -- I guess nothign ever came of it?

> Lastly, there definitely needs to be a better tool for grid making.
> > The problem appears easy at first, but it is fraught with many
> > pitfalls and subtle issues. It is easy to say, "always use
> > linspace()", but if the user doesn't have the number of pixels, they
> > will need to calculate that using --- gasp! -- floating point
> > numbers, which could result in the wrong answer.


agreed -- this tends to be an inherently over-specified problem:

min_value
max_value
spacing
number_of_grid_spaces

That is four values, and only three independent ones.

arange() looks like it uses: min_value, max_value, spacing -- but it
doesn't really (see previous discussion) so not the right tool for anything.

linspace() uses:  min_value, max_value, (number_of_grid_spaces + 1), which
is about as good as you can get (except for that annoying 1).

But what if you are given min_value, spacing, number_of_grid_spaces?

Maybe we need a function for that?? (which I think would simply be:

np.arange(number_of_grid_spaces + 1) * spacing

Which is why we probably don't need a function :-) (note that that's only
error of one multiplication per grid point)

Or maybe a way to take all four values, and return a "best fit" grid. The
problem with that is that it's over specified, and and it may not be only
fp error that makes it not fit. What should a code do???

So Ben:

What is the problem you are trying to solve?  -- I'm still confused. What
information do you have to define the grid? Maybe all we need are docs for
how to best compute a grid with given specifications? And point to them in
the arange() and linspace() docstrings.

-CHB

I once wanted to add a "step" argument to linspace, but didn't in the
> end, largely 

Re: [Numpy-discussion] improving arange()? introducing fma()?

2018-02-23 Thread Chris Barker
On Fri, Feb 9, 2018 at 1:16 PM, Matthew Harrigan  wrote:

> I apologize if I'm missing something basic, but why are floats being
> accumulated in the first place?  Can't arange and linspace operations with
> floats be done internally similar to `start + np.arange(num_steps) *
> step_size`?  I.e. always accumulate (really increment) integers to limit
> errors.
>

I haven't looked at the arange() code, but linspace does does not
accumulate floats -- which is why it's already almost as good as it can be.
As regards to a fused-multiply-add, it does have to do a single
multiply_add operation for each value (as per your example code), so we may
be able to save a ULP there.

The problem with arange() is that the definition is poorly specified:

start + (step_num * step) while value < stop.

Even without fp issues, it's weird if (stop - start) / step is not an
integer. -- the "final" step will not be the same as the rest.

Say you want a "grid" with fully integer values. if the step is just right,
all is easy:

In [*72*]: np.arange(0, 11, 2)

Out[*72*]: array([ 0,  2,  4,  6,  8, 10])

(this is assuming you want 10 as the end point.

but then:

In [*73*]: np.arange(0, 11, 3)

Out[*73*]: array([0, 3, 6, 9])

but I wanted 10 as an end point. so:

In [*74*]: np.arange(0, 13, 3)

Out[*74*]: array([ 0,  3,  6,  9, 12])

hmm, that's not right either. Of course it's not -- you can't get 10 as an
end point, 'cause it's not a multiple of the step. With integers, you CAN
require that the end point be a multiple of the step, but with fp, you
can't required that it be EXACTLY a multiple, because either the end point
or the step may not be exactly representable, even if you do the math with
no loss of precision. And now you are stuck with the user figuring out for
themselves whether the closest fp representation of the end point is
slightly larger or smaller than the real value, so the < check will work.
NOT good.

This is why arange is simply not the tool to use.

Making a grid, you usually want to specify the end points and the number of
steps which is almost what linspace does. Or, _maybe_ you want to specify
the step and the number of steps, and accept that the end point may not be
exactly what you "expect". There is no built-in function for this in numpy.
maybe there should be, but it's pretty easy to write, as you show above.

Anyone that understands FP better than I do:

In the above code, you are multiplying the step by an integer -- is there
any precision loss when you do that??

-CHB


-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] new NEP: np.AbstractArray and np.asabstractarray

2018-03-10 Thread Chris Barker
On Sat, Mar 10, 2018 at 1:27 PM, Matthew Rocklin  wrote:

> I'm very glad to see this discussion.
>

me too, but


> I think that coming up with a single definition of array-like may be
> difficult, and that we might end up wanting to embrace duck typing instead.
>

exactly -- I think there is a clear line between "uses the numpy memory
layout" and the Python API. But the python API is pretty darn big, and many
"array_ish" classes implement only partvof it, and may even implement some
parts a bit differently. So really hard to have "one" definition, except
"Python API exactly like a ndarray" -- and I'm wondering how useful that is.

It seems to me that different array-like classes will implement different
> mixtures of features.  It may be difficult to pin down a single definition
> that includes anything except for the most basic attributes (shape and
> dtype?).
>

or a minimum set -- but again, how useful??


> Storage objects like h5py (support getitem in a numpy-like way)
>

Exactly -- though I don't know about h5py, but netCDF4 variables supoprt a
useful subst of ndarray, but do "fancy indexing" differently -- so are they
ndarray_ish? -- sorry to coin yet another term :-)


> I can imagine authors of both groups saying that they should qualify as
> array-like because downstream projects that consume them should not convert
> them to numpy arrays in important contexts.
>

indeed. My solution so far is to define my own duck types "asarraylike"
that checks for the actual methods I need:

https://github.com/NOAA-ORR-ERD/gridded/blob/master/gridded/utilities.py

which has:

must_have = ['dtype', 'shape', 'ndim', '__len__', '__getitem__', '
__getattribute__']

def isarraylike(obj):
"""
tests if obj acts enough like an array to be used in gridded.
This should catch netCDF4 variables and numpy arrays, at least, etc.
Note: these won't check if the attributes required actually work right.
"""
for attr in must_have:
if not hasattr(obj, attr):
return False
return True
def asarraylike(obj):
"""
If it satisfies the requirements of pyugrid the object is returned as is.
If not, then numpy's array() will be called on it.
:param obj: The object to check if it's like an array
"""
return obj if isarraylike(obj) else np.array(obj)

It's possible that we could come up with semi-standard "groupings" of
attributes to produce "levels" of compatibility, or maybe not levels, but
independentgroupings, so you could specify which groupings you need in this
instance.


> The name "duck arrays" that we sometimes use doesn't necessarily mean
> "quack like an ndarray" but might actually mean a number of different
> things in different contexts.  Making a single class or predicate for duck
> arrays may not be as effective as we want.  Instead, it might be that we
> need a number of different protocols like `__array_mat_vec__` or 
> `__array_slice__`
> that downstream projects can check instead.  I can imagine cases where I
> want to check only "can I use this thing to multiply against arrays" or
> "can I get numpy arrays out of this thing with numpy slicing" rather than
> "is this thing array-like" because I may genuinely not care about most of
> the functionality in a blessed definition of "array-like".
>

exactly.

but maybe we won't know until we try.

-CHB



-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] best way of speeding up a filtering-like algorithm

2018-03-29 Thread Chris Barker
sorry, not enough time to look closely, but a couple general comments:

On Wed, Mar 28, 2018 at 5:56 PM, Moroney, Catherine M (398E) <
catherine.m.moro...@jpl.nasa.gov> wrote:

> I have the following sample code (pretty simple algorithm that uses a
> rolling filter window) and am wondering what the best way is of speeding it
> up.  I tried rewriting it in Cython by pre-declaring the variables but that
> didn’t buy me a lot of time.  Then I rewrote it in Fortran (and compiled it
> with f2py) and now it’s lightning fast.
>

if done right, Cython should be almost as fast as Fortran, and just as fast
if you use the "restrict" correctly (which I hope can be done in Cython):

https://en.wikipedia.org/wiki/Pointer_aliasing


> But I would still like to know if I could rewrite it in pure
> python/numpy/scipy
>

you can use stride_tricks to make arrays "appear" to be N+1 D, to implement
windows without actually duplicating the data, and then use array
operations on them. This can buy a lot of speed, but will not be as fast
(by a factor of 10 or so) as Cython or Fortran

see:

https://github.com/PythonCHB/IRIS_Python_Class/blob/master/Numpy/code/filter_example.py
for and example in 1D



> or in Cython and get a similar speedup.
>
>
see above -- a direct port of your Fortran code to Cython should get you
within a factor of two or so of the Fortran, and then using "restrict" to
let the compiler know your pointers aren't aliased should get you the reset
of the way.

Here is an example of a Automatic Gain Control filter in 1D, iplimented in
numpy with stride_triks, and C and Cython and Fortran.

https://github.com/PythonCHB/IRIS_Python_Class/tree/master/Interfacing_C/agc_example

Note that in that example, I never got C or Cython as fast as Fortran --
but I think using "restrict" in the C would do it.

HTH,

-CHB



>
> Here is the raw Python code:
>
>
>
> def mixed_coastline_slow(nsidc, radius, count, mask=None):
>
>
>
> nsidc_copy = numpy.copy(nsidc)
>
>
>
> if (mask is None):
>
> idx_coastline = numpy.where(nsidc_copy == NSIDC_COASTLINE_MIXED)
>
> else:
>
> idx_coastline = numpy.where(mask & (nsidc_copy ==
> NSIDC_COASTLINE_MIXED))
>
>
>
> for (irow0, icol0) in zip(idx_coastline[0], idx_coastline[1]):
>
>
>
> rows = ( max(irow0-radius, 0), min(irow0+radius+1,
> nsidc_copy.shape[0]) )
>
> cols = ( max(icol0-radius, 0), min(icol0+radius+1,
> nsidc_copy.shape[1]) )
>
> window = nsidc[rows[0]:rows[1], cols[0]:cols[1]]
>
>
>
> npoints = numpy.where(window != NSIDC_COASTLINE_MIXED, True,
> False).sum()
>
> nsnowice = numpy.where( (window >= NSIDC_SEAICE_LOW) & (window <=
> NSIDC_FRESHSNOW), \
>
> True, False).sum()
>
>
>
> if (100.0*nsnowice/npoints >= count):
>
>  nsidc_copy[irow0, icol0] = MISR_SEAICE_THRESHOLD
>
>
>
> return nsidc_copy
>
>
>
> and here is my attempt at Cython-izing it:
>
>
>
> import numpy
>
> cimport numpy as cnumpy
>
> cimport cython
>
>
>
> cdef int NSIDC_SIZE  = 721
>
> cdef int NSIDC_NO_SNOW = 0
>
> cdef int NSIDC_ALL_SNOW = 100
>
> cdef int NSIDC_FRESHSNOW = 103
>
> cdef int NSIDC_PERMSNOW  = 101
>
> cdef int NSIDC_SEAICE_LOW  = 1
>
> cdef int NSIDC_SEAICE_HIGH = 100
>
> cdef int NSIDC_COASTLINE_MIXED = 252
>
> cdef int NSIDC_SUSPECT_ICE = 253
>
>
>
> cdef int MISR_SEAICE_THRESHOLD = 6
>
>
>
> def mixed_coastline(cnumpy.ndarray[cnumpy.uint8_t, ndim=2] nsidc, int
> radius, int count):
>
>
>
>  cdef int irow, icol, irow1, irow2, icol1, icol2, npoints, nsnowice
>
>  cdef cnumpy.ndarray[cnumpy.uint8_t, ndim=2] nsidc2 \
>
> = numpy.empty(shape=(NSIDC_SIZE, NSIDC_SIZE), dtype=numpy.uint8)
>
>  cdef cnumpy.ndarray[cnumpy.uint8_t, ndim=2] window \
>
> = numpy.empty(shape=(2*radius+1, 2*radius+1), dtype=numpy.uint8)
>
>
>
>  nsidc2 = numpy.copy(nsidc)
>
>
>
>  idx_coastline = numpy.where(nsidc2 == NSIDC_COASTLINE_MIXED)
>
>
>
>  for (irow, icol) in zip(idx_coastline[0], idx_coastline[1]):
>
>
>
>   irow1 = max(irow-radius, 0)
>
>   irow2 = min(irow+radius+1, NSIDC_SIZE)
>
>   icol1 = max(icol-radius, 0)
>
>   icol2 = min(icol+radius+1, NSIDC_SIZE)
>
>   window = nsidc[irow1:irow2, icol1:icol2]
>
>
>
>   npoints = numpy.where(window != NSIDC_COASTLINE_MIXED, True,
> False).sum()
>
>   nsnowice = numpy.where( (window >= NSIDC_SEAICE_LOW) & (window
> <= NSIDC_FRESHSNOW), \
>
>   True, False).sum()
>
>
>
>   if (100.0*nsnowice/npoints >= count):
>
>nsidc2[irow, icol] = MISR_SEAICE_THRESHOLD
>
>
>
>  return nsidc2
>
>
>
> Thanks in advance for any advice!
>
>
>
> Catherine
>
>
>
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
>
>


-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/

Re: [Numpy-discussion] best way of speeding up a filtering-like algorithm

2018-03-29 Thread Chris Barker
one other note:

As a rule, using numpy array operations from Cython doesn't buy you much,
as you discovered. YOu need to use numpy arrays as n-d containers, and
write the loops yourself.

You may want to check out numba as another alternative -- it DOES optimize
numpy operations.

-CHB



On Wed, Mar 28, 2018 at 5:56 PM, Moroney, Catherine M (398E) <
catherine.m.moro...@jpl.nasa.gov> wrote:

> Hello,
>
>
>
> I have the following sample code (pretty simple algorithm that uses a
> rolling filter window) and am wondering what the best way is of speeding it
> up.  I tried rewriting it in Cython by pre-declaring the variables but that
> didn’t buy me a lot of time.  Then I rewrote it in Fortran (and compiled it
> with f2py) and now it’s lightning fast.  But I would still like to know if
> I could rewrite it in pure python/numpy/scipy or in Cython and get a
> similar speedup.
>
>
>
> Here is the raw Python code:
>
>
>
> def mixed_coastline_slow(nsidc, radius, count, mask=None):
>
>
>
> nsidc_copy = numpy.copy(nsidc)
>
>
>
> if (mask is None):
>
> idx_coastline = numpy.where(nsidc_copy == NSIDC_COASTLINE_MIXED)
>
> else:
>
> idx_coastline = numpy.where(mask & (nsidc_copy ==
> NSIDC_COASTLINE_MIXED))
>
>
>
> for (irow0, icol0) in zip(idx_coastline[0], idx_coastline[1]):
>
>
>
> rows = ( max(irow0-radius, 0), min(irow0+radius+1,
> nsidc_copy.shape[0]) )
>
> cols = ( max(icol0-radius, 0), min(icol0+radius+1,
> nsidc_copy.shape[1]) )
>
> window = nsidc[rows[0]:rows[1], cols[0]:cols[1]]
>
>
>
> npoints = numpy.where(window != NSIDC_COASTLINE_MIXED, True,
> False).sum()
>
> nsnowice = numpy.where( (window >= NSIDC_SEAICE_LOW) & (window <=
> NSIDC_FRESHSNOW), \
>
> True, False).sum()
>
>
>
> if (100.0*nsnowice/npoints >= count):
>
>  nsidc_copy[irow0, icol0] = MISR_SEAICE_THRESHOLD
>
>
>
> return nsidc_copy
>
>
>
> and here is my attempt at Cython-izing it:
>
>
>
> import numpy
>
> cimport numpy as cnumpy
>
> cimport cython
>
>
>
> cdef int NSIDC_SIZE  = 721
>
> cdef int NSIDC_NO_SNOW = 0
>
> cdef int NSIDC_ALL_SNOW = 100
>
> cdef int NSIDC_FRESHSNOW = 103
>
> cdef int NSIDC_PERMSNOW  = 101
>
> cdef int NSIDC_SEAICE_LOW  = 1
>
> cdef int NSIDC_SEAICE_HIGH = 100
>
> cdef int NSIDC_COASTLINE_MIXED = 252
>
> cdef int NSIDC_SUSPECT_ICE = 253
>
>
>
> cdef int MISR_SEAICE_THRESHOLD = 6
>
>
>
> def mixed_coastline(cnumpy.ndarray[cnumpy.uint8_t, ndim=2] nsidc, int
> radius, int count):
>
>
>
>  cdef int irow, icol, irow1, irow2, icol1, icol2, npoints, nsnowice
>
>  cdef cnumpy.ndarray[cnumpy.uint8_t, ndim=2] nsidc2 \
>
> = numpy.empty(shape=(NSIDC_SIZE, NSIDC_SIZE), dtype=numpy.uint8)
>
>  cdef cnumpy.ndarray[cnumpy.uint8_t, ndim=2] window \
>
> = numpy.empty(shape=(2*radius+1, 2*radius+1), dtype=numpy.uint8)
>
>
>
>  nsidc2 = numpy.copy(nsidc)
>
>
>
>  idx_coastline = numpy.where(nsidc2 == NSIDC_COASTLINE_MIXED)
>
>
>
>  for (irow, icol) in zip(idx_coastline[0], idx_coastline[1]):
>
>
>
>   irow1 = max(irow-radius, 0)
>
>   irow2 = min(irow+radius+1, NSIDC_SIZE)
>
>   icol1 = max(icol-radius, 0)
>
>   icol2 = min(icol+radius+1, NSIDC_SIZE)
>
>   window = nsidc[irow1:irow2, icol1:icol2]
>
>
>
>   npoints = numpy.where(window != NSIDC_COASTLINE_MIXED, True,
> False).sum()
>
>   nsnowice = numpy.where( (window >= NSIDC_SEAICE_LOW) & (window
> <= NSIDC_FRESHSNOW), \
>
>   True, False).sum()
>
>
>
>   if (100.0*nsnowice/npoints >= count):
>
>nsidc2[irow, icol] = MISR_SEAICE_THRESHOLD
>
>
>
>  return nsidc2
>
>
>
> Thanks in advance for any advice!
>
>
>
> Catherine
>
>
>
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
>
>


-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] A roadmap for NumPy - longer term planning

2018-06-01 Thread Chris Barker
On Fri, Jun 1, 2018 at 4:43 AM, Marten van Kerkwijk <
m.h.vankerkw...@gmail.com> wrote:


>  one thing that always slightly annoyed me is that numpy math is way
> slower for scalars than python math
>

numpy is also quite a bit slower than raw python for math with (very) small
arrays:

In [31]: % timeit t2 = (t[0] * 10, t[1] * 10)
162 ns ± 0.79 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [32]: a
Out[32]: array([ 3.4,  5.6])

In [33]: % timeit a2 = a * 10
941 ns ± 7.95 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)


(I often want to so this sort of thing, not for performance, but for ease
of computation -- say you have 2 or three coordinates that represent a
point -- it's really nice to be able to scale or shift with array
operations, rather than all that indexing -- but it is pretty slo with
numpy.

I've wondered if numpy could be optimized for small 1D arrays, and maybe
even 2d arrays with a small fixed second dimension (N x 2, N x 3), by
special-casing / short-cutting those cases.

It would require some careful profiling to see if it would help, but it
sure seems possible.

And maybe scalars could be fit into the same system.

-CHB



-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] A roadmap for NumPy - longer term planning

2018-06-01 Thread Chris Barker
On Fri, Jun 1, 2018 at 9:46 AM, Chris Barker  wrote:

> numpy is also quite a bit slower than raw python for math with (very)
> small arrays:
>

doing a bit more experimentation, the advantage is with pure python for
over 10 elements (I got bored...). but I noticed that the time for numpy
computation is pretty much constant for 2 up to around 100 elements. Which
implies that the bulk of the issue is with "startup" costs, rather than
fancy indexing or anything like that. so maybe a short cut wouldn't be
helpful.

Note if you use a list comp (the pythonic translation of an array
operation) thecrossover point is about 15 elements (in my tests, on my
machine...)

In [90]: % timeit t2 = [x * 10 for x in t]

920 ns ± 4.88 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)

-CHB




> In [31]: % timeit t2 = (t[0] * 10, t[1] * 10)
> 162 ns ± 0.79 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
>
> In [32]: a
> Out[32]: array([ 3.4,  5.6])
>
> In [33]: % timeit a2 = a * 10
> 941 ns ± 7.95 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)
>
>
> (I often want to so this sort of thing, not for performance, but for ease
> of computation -- say you have 2 or three coordinates that represent a
> point -- it's really nice to be able to scale or shift with array
> operations, rather than all that indexing -- but it is pretty slo with
> numpy.
>
> I've wondered if numpy could be optimized for small 1D arrays, and maybe
> even 2d arrays with a small fixed second dimension (N x 2, N x 3), by
> special-casing / short-cutting those cases.
>
> It would require some careful profiling to see if it would help, but it
> sure seems possible.
>
> And maybe scalars could be fit into the same system.
>
> -CHB
>
>
>
> --
>
> Christopher Barker, Ph.D.
> Oceanographer
>
> Emergency Response Division
> NOAA/NOS/OR&R(206) 526-6959   voice
> 7600 Sand Point Way NE   (206) 526-6329   fax
> Seattle, WA  98115   (206) 526-6317   main reception
>
> chris.bar...@noaa.gov
>



-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Dropping Python 3.4 support for NumPy 1.16

2018-06-13 Thread Chris Barker
>
> I think NumPy 1.16 would be a good time to drop Python 3.4 support.
>>
>
+1

Using python3 before 3.5 was still kinda "bleeding edge" -- so projects are
more likely to be actively upgrading.

-CHB

-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Adoption of a Code of Conduct

2018-08-02 Thread Chris Barker
On Thu, Aug 2, 2018 at 3:35 AM, Marten van Kerkwijk <
m.h.vankerkw...@gmail.com> wrote:

> If we do end up with a different version, I'd prefer a really short one,
> like just stating the golden rule (treat others as one would like to be
> treated oneself).
>

Unfortunately, the golden rule is sometime used as a justification for bad
behaviour -- "I don't mind it if someone calls my ideas stupid -- I can
defend ideas just fine, thank you"

So a more specified CoC is worthwhile.

As for the issue at hand:

A code of conduct is about, well, conduct -- the whole idea is that we are
defining appropriate conduct, but NOT discriminating at all based on who or
what you are or how you identify yourself.

So is wearing a MAGA hat any different than wearing a cross, or a yarmulke,
or a hijab ?

Or a Bernie T-shirt?

(Sorry for the US-centric examples)

Note the debate in France about burkinis:

https://www.washingtonpost.com/world/europe/frances-burkini-debate-about-a-bathing-suit-and-a-countrys-peculiar-secularism/2016/08/26/48ec273e-6bad-11e6-91cb-ecb5418830e9_story.html?utm_term=.50789110d06a

So it's not an easy answer.

In the end, though, wearing a particular item of clothing is a behavior,
not an identity per se --- so saying that people of and political
persuasion are welcome is not the same as saying you can express any
political opinion you like publicly in this forum.

But it's a really slippery slope:

Some religions require the devout to wear particular items of clothing (or
hair styles, or...)

And there are a lot of issue with people saying: "I don't care if you are
gay, just don't talk about it at work" -- but straight people get to talk
about their personal lives at work -- so of course everyone should be able
to.

This is why (in the US anyway) there is the legal concept of a "protected
class" -- it needs to be clear exactly what one can can't "discriminate"
based on. If an employee is required to wear a particular item of clothing
in order to adhere to their religion, then you can't ban that type of
clothing -- but you can ban other types of clothing.

For this issue, maybe we could get some guidance from the "Hatch Act" -- it
is a law that regulates what types of political activity a US federal
employee can participate in. It bans some activities even when off the job,
but the part that might be relevant is what is banned while on the job.
That is, as a US federal employee, you can belong to any political party
you like, you can hold any political opinion you like, but you can't freely
express those on the job -- i.e. "engage in political activity".

Hmm -- I found this: " Hatch Act regulations define political activity as
one “directed toward the success or failure of a political party, candidate
for partisan political office, or partisan political group.”

Interesting -- I'm pretty sure I'm not allowed to promote white supremacy
on the job -- though that's not a partisan political group per se (can't
find the definition of partisan, either)


TL;DR:

I think "political beliefs" should be included, but it should be clear
somehow that that doesn't mean you can express any political belief in the
context of the project.

Honestly, the really horrible people often can't help themselves -- they
will actually *do* something inappropriate in the context of the project.
And if they don't, then how d we even know what horrible ideas they may
promote elsewhere?

-CHB


-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Adoption of a Code of Conduct

2018-08-03 Thread Chris Barker
On Fri, Aug 3, 2018 at 8:59 AM, Hameer Abbasi 
wrote
>
>
> I’ve created a PR, and I’ve kept the language “not too stern”.
> https://github.com/scipy/scipy/pull/9109
>

Thanks -- for ease of this thread, the sentence Hameer added is:

"We expect that you will extend the same courtesy and open-mindedness
towards other members of the SciPy community."

LGTM

-CHB

-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Adoption of a Code of Conduct

2018-08-03 Thread Chris Barker
One other thought:

Given Jupyter, numpy, scipy, matplotlib?, etc, are all working on a CoC --
maybe we could have NumFocus take a lead on this for the whole community?

I think most (all?) of the NumFocus projects have essentially the same
goals in this regard.

-CHB





On Fri, Aug 3, 2018 at 9:44 AM, Chris Barker  wrote:

> On Fri, Aug 3, 2018 at 8:59 AM, Hameer Abbasi 
> wrote
>>
>>
>> I’ve created a PR, and I’ve kept the language “not too stern”.
>> https://github.com/scipy/scipy/pull/9109
>>
>
> Thanks -- for ease of this thread, the sentence Hameer added is:
>
> "We expect that you will extend the same courtesy and open-mindedness
> towards other members of the SciPy community."
>
> LGTM
>
> -CHB
>
> --
>
> Christopher Barker, Ph.D.
> Oceanographer
>
> Emergency Response Division
> NOAA/NOS/OR&R(206) 526-6959   voice
> 7600 Sand Point Way NE   (206) 526-6329   fax
> Seattle, WA  98115   (206) 526-6317   main reception
>
> chris.bar...@noaa.gov
>



-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Adoption of a Code of Conduct

2018-08-03 Thread Chris Barker
On Fri, Aug 3, 2018 at 11:20 AM, Chris Barker  wrote:

> Given Jupyter, numpy, scipy, matplotlib?, etc, are all working on a CoC --
> maybe we could have NumFocus take a lead on this for the whole community?
>

or adopt an existing one, like maybe:

The Contributor Covenant <http://www.contributor-covenant.org/> was adopted
by several prominent open source projects, including Atom, AngularJS,
Eclipse, and even Rails. According to Github, total adoption of the
Contributor Covenant is nearing an astounding ten thousand open source
projects.

I'm trying to figure out why numpy (Or any project, really) has either
unique needs or people better qualified to write a CoC than any other
project or community. So much like OSS licences -- it's much better to pick
an established one than write your own.

For the record, the Covenant does have a laundry list of "classes", that
does not include political belief, but does mention "political" here:

"""
Examples of unacceptable behavior by participants include:
...
Trolling, insulting/derogatory comments, and personal or political attacks
 ...
"""

-CHB


Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Adoption of a Code of Conduct

2018-08-03 Thread Chris Barker
On Fri, Aug 3, 2018 at 11:33 AM, Nelle Varoquaux 
wrote:

I think what matters in code of conduct is community buy-in and the
> discussions around it, more than the document itself.
>

This is a really good point. Though I think a community could still have
that discussion around whether and which CoC to adopt, rather than the
bike-shedding of the document itself.

And the reality is that a small sub-fraction of eh community takes part in
the conversation anyway.

I'm very much on the fence about whether this thread has been truly
helpful, for instance, though it's certainly got me trolling the web
reading about the issue -- which I probably would not have if this were
simply a: "should we adopt the NumFocos CoC" thread...

By off-loading the discussion and writing process to someone else, you are
> missing most of the benefits of codes of conducts.
>

well, when reading about CoCs, it seem a large part of their benefit is not
to the existing community, but rather what it projects to the rest of the
world, particularly possible new contributors.


> This is also the reason why I think codes of conduct should be revisited
> regularly.
>

That is a good idea, yes.

I'll note that at least the Contributor Covenant is pretty vague about
enforcement:

"""
All complaints will be reviewed and investigated and will result in a
response that is deemed necessary and appropriate to the circumstances.
"""

I'd think refining THAT part for the project may provide the benefits of
discussion...

-CHB



-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Adoption of a Code of Conduct

2018-08-03 Thread Chris Barker
On Fri, Aug 3, 2018 at 12:45 PM, Stefan van der Walt 
wrote:

> I'll note that at least the Contributor Covenant is pretty vague about
>> enforcement:
>>
>> """
>> All complaints will be reviewed and investigated and will result in a
>> response that is deemed necessary and appropriate to the circumstances.
>> """
>>
>> I'd think refining THAT part for the project may provide the benefits of
>> discussion...
>>
>
> But the SciPy CoC has a whole additional document that goes into further
> detail on this specific issue, so let's not concern ourselves with the
> weaknesses of the Covenant (there are plenty),
>

Actually, I did not indent that to be highlighting a limitation in the
Covenant, but rather pointing out that there is plenty to discuss, even if
one does adopt an existing CoC.

But at Ralf points out, that discussion has been had in the context of
scipy, so I agree -- numpy should adopt scipy's CoC and be done with it.

In fact, if someone still feels strongly that "political beliefs" should be
removed, then it's probably better to bring that up in the context of
scipy, rather than numpy -- as has been said, it is practically the same
community.

To the point where the scipy developers guide and the numpy developers
guide are published on the same web site.

-CHB


-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Adoption of a Code of Conduct

2018-08-06 Thread Chris Barker
> On August 4, 2018 00:23:44 Matthew Harrigan 
> wrote:
>
>> One concern I have is the phrase "explicitly honour" in "we explicitly
>> honour diversity in: age, culture, ...".  Honour is a curious word choice.
>> honour  is defined as, among
>> other things, "to worship", "high public esteem; fame; glory", and "a
>> source of credit or distinction".  I would object to some of those
>> interpretations.  Also its not clear to me how honouring diversity relates
>> to conduct.  I would definitely agree to follow the other parts of the
>> CoC and also to welcome others regardless of where they fall on the various
>> axes of diversity.  "Explicitly welcome" is better and much more closely
>> related to conduct IMO.
>>
>
> While honor may be a slightly strange choice, I don't think it is as
> strange as this specific definition makes it out to be. You also say "I
> honor my promise", i.e., I take it seriously, and it has meaning to me.
>
> Diversity has meaning to our community (it enriches us, both
> intellectually and otherwise) and should be cherished.
>

It's also key to note the specific phrasing -- it is *diversity* that is
honored, whereas we would (and do) welcome diverse individuals.

So I like the phasing as it is.

-CHB

-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Adoption of a Code of Conduct

2018-08-07 Thread Chris Barker
On Mon, Aug 6, 2018 at 5:30 PM, Matthew Harrigan  wrote:

> It's also key to note the specific phrasing -- it is *diversity* that is
>> honored, whereas we would (and do) welcome diverse individuals.
>>
>
> I'm afraid I miss your point.  I understand that diversity is what is
> being honoured in the current CoC, and that is my central issue.  My issue
> is not so much diversity, but more that honour is not the right word.  We
> all agree (I think/hope) that we should and do welcome diverse
> individuals.  That actually paraphrases my suggested edit:
>
> Though no list can hope to be comprehensive, we explicitly *welcome*
> diversity in: age, culture, ethnicity, genotype, gender identity or
> expression, language, national origin, neurotype, phenotype, political
> beliefs, profession, race, religion, sexual orientation, socioeconomic
> status, subculture and technical ability.
>

I think the authors were explicitly using a stronger word: diversity is not
jstu welcome, it is more than welcome -- it is honored -- that is, it's a
good thing that we explicitly want to support.


> Practically speaking I don't think my edit means much.  I can't think of a
> situation where someone is friendly, welcoming, and respectful to everyone
> yet should be referred referred to CoC committee for failing to honour
> diversity.  One goal of the CoC should be to make sure that diverse people
> from potentially marginalized or targeted groups feel welcome and my edit
> addresses that more directly than the original.  But in principle the
> difference, to me at least, is stark.  Thank you for considering my view.
>
>
> On Mon, Aug 6, 2018 at 1:58 PM, Chris Barker 
> wrote:
>
>>
>> On August 4, 2018 00:23:44 Matthew Harrigan 
>>> wrote:
>>>
>>>> One concern I have is the phrase "explicitly honour" in "we explicitly
>>>> honour diversity in: age, culture, ...".  Honour is a curious word choice.
>>>> honour <https://www.dictionary.com/browse/honour> is defined as, among
>>>> other things, "to worship", "high public esteem; fame; glory", and "a
>>>> source of credit or distinction".
>>>>
>>>
I think that last one is, in fact, the point.

Anyway, I for one think it's fine either way, but would suggest that any
minor changes like this be made to the SciPy CoC (of at all), and that
numpy uses the same one.

-CHB


-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] pytest, fixture and parametrize

2018-08-08 Thread Chris Barker
On Wed, Aug 8, 2018 at 9:38 AM, Evgeni Burovski 
wrote:

> Stdlib unittest supports self.assertRaises context manager from python 3.1
>

but that requires using unittest :-)

On Wed, Aug 8, 2018, 7:30 PM Eric Wieser 
> wrote:
>
>> You forget that we already have that feature in our testing layer,
>>
>> with np.testing.assert_raises(AnException):
>> pass
>>
>>
fair enough -- I wouldn't re-write that now, but as its there already, it
may make sense to use it.

Perhaps we need a doc that lays out the prefered testing utilities.

Worthy of a NEP? Or is just a README or something in the code base.

Personally, I think a commitment to pytest is the best way to go -- but
there are a lot of legacy tests, so there will be a jumble.

-CHB

-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Taking back control of the #numpy irc channel

2018-08-10 Thread Chris Barker
On Wed, Aug 8, 2018 at 9:06 AM, Sebastian Berg

> If someone is unhappy with us two being the main
> contact/people who have those right on freenode,


On the contrary, thanks much for taking this on!

-CHB



-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Dropping 32-bit from macOS wheels

2018-08-14 Thread Chris Barker
On Tue, Aug 14, 2018 at 2:17 AM, Matthew Brett 
wrote:

> We are planning to drop 32-bit compatibility from the numpy macOS
> wheels.


+1 -- it really is time.

I note that python.org has finally gone 64 bit only -- at least for the
default download:

"""
For 3.7.0, we provide two binary installer options for download. The
default variant is 64-bit-only and works on macOS 10.9 (Mavericks) and
later systems.
"""

granted, it'll be quite a while before everyone is running 3.7+, but still.

-CHB

-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Add pybind11 to docs about writing binding code

2018-08-20 Thread Chris Barker
On Mon, Aug 20, 2018 at 8:57 AM, Neal Becker  wrote:

> I'm confused, do you have a link or example showing how to use
> xtensor-python without pybind11?
>

I think you may have it backwards:

"""
The Python bindings for xtensor are based on the pybind11 C++ library,
which enables seemless interoperability between C++ and Python.
"""

So no, yu can't use xtenson-python without pybind11 -- I think what was
suggested was that you *could* use xtenson-python without using xtenson on
the C++ side. i.e. xtensor-python is a higher level binding system than
pybind11 alone, rather than just bindings for xtensor. And thus belongs in
the docs about binding tools.

Which makes me want to take a closer look at it...

-CHB


-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] A minor milestone

2018-09-08 Thread Chris Barker
There are probably a LOT of Windows users getting numpy from conda as well.

(I know my CI's and users do...)

It'd be nice if there was some way to track real usage!

-CHB


On Sat, Sep 8, 2018 at 3:44 PM, Charles R Harris 
wrote:

>
>
> On Fri, Sep 7, 2018 at 11:16 PM Andrew Nelson  wrote:
>
>> >  but on Travis I install it half a dozen times every day.
>>
>> Good point. I wonder if there's any way to take that into account when
>> considering whether to drop versions.
>>
>> On Sat, 8 Sep 2018 at 15:14, Nathaniel Smith  wrote:
>>
>>> On Fri, Sep 7, 2018 at 6:33 PM, Charles R Harris
>>>  wrote:
>>> > Thanks for the link. It would be nice to improve the Windows numbers,
>>> Linux
>>> > is still very dominant. I suppose that might be an artifact of the
>>> systems
>>> > used by developers as opposed to end users. It would be a different
>>> open
>>> > source world if Microsoft had always released their compilers for free
>>> and
>>> > kept them current with the evolving ISO specs.
>>>
>>> Well, keep in mind also that it's counting installs, not users...
>>> people destroy and reinstall Linux systems a *lot* more often than
>>> they do Windows/macOS systems, what with clouds and containers and CI
>>> systems and all. On my personal laptop I install numpy maybe once per
>>> release, but on Travis I install it half a dozen times every day.
>>>
>>>
> Would be interesting if the travisCI and appveyor downloads could be
> separated out.
>
> Chuck
>
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
>
>


-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] asanyarray vs. asarray

2018-10-29 Thread Chris Barker
On Fri, Oct 26, 2018 at 7:12 PM, Travis Oliphant 
wrote:


>  agree that we can stop bashing subclasses in general.   The problem with
> numpy subclasses is that they were made without adherence to SOLID:
> https://en.wikipedia.org/wiki/SOLID.  In particular the Liskov
> substitution principle:  https://en.wikipedia.org/wiki/
> Liskov_substitution_principle .
>

...


> did not properly apply them in creating np.matrix which clearly violates
> the substitution principle.
>

So -- could a matrix subclass be made "properly"? or is that an example of
something that should not have been a subclass?

-CHB


-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] asanyarray vs. asarray

2018-10-30 Thread Chris Barker
On Tue, Oct 30, 2018 at 2:22 PM, Stephan Hoyer  wrote:

> The Liskov substitution principle (LSP) suggests that the set of
> reasonable ndarray subclasses are exactly those that could also in
> principle correspond to a new dtype. Of np.ndarray subclasses in
> wide-spread use, I think only the various "array with units" types come
> close satisfying this criteria. They only fall short insofar as they
> present a misleading dtype (without unit information).
>

How about subclasses that only add functionality? My only use case of
subclassing is exactly that:

I have a "bounding box" object (probably could have been called a
rectangle) that is a subclass of ndarray, is always shape (2,2), and has
various methods for merging two such boxes, etc, adding a point, etc.

I did it that way, 'cause I had a lot of code already that simply used a
(2,2) array to represent a bounding box, and I wanted all that code to
still work.

I have had zero problems with it.

Maybe that's too trivial to be worth talking about, but this kind of use
case can be handy.

It is a bit awkward to write the code, though -- it would be nice to have a
cleaner API for this sort of subclassing (not that I have any idea how to
do that)

The main problem with subclassing for numpy.ndarray is that it guarantees
> too much: a large set of operations/methods along with a specific memory
> layout exposed as part of its public API.
>

This is a big deal -- we really have two concepts here:
 - a Python class (type) with certain behaviors in Python code
 - a wrapper around a strided memory block.

maybe it's possible to be clear about that distinction:

"Duck Arrays" are the Python API

Maybe a C-API object  would be useful, that shares the memory layout, but
could have completely different functionality at the Python level.

- CHB


-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] [SPAM]Re: introducing Numpy.net, a pure C# implementation of Numpy

2019-03-27 Thread Chris Barker
On Mon, Mar 18, 2019 at 1:19 PM Paul Hobson  wrote:

>
>> I'm a civil engineer who adopted Python early in his career and became
> the "data guy" in the office pretty early on. Our company's IT department
> manages lots of Windows Servers running SQL Server. In my case, running
> python apps on our infrastructure just isn't feasible or supported by the
> IT department.
>

Just curious -- does it have to be C#? or could it be any CLR application
-- i.e. IronPython?

I imagine you could build a web service pretty easily in IronPython --
though AFAIK, the attempts at getting numpy support (and thus Pandas, etc)
never panned out.

The point of all of this is that in those situations, have a numpy-like
> library would be very nice indeed. I've very excited to hear that the OP's
> work has been open sourced.
>

I wonder if the OP's work could be used to make a numpy for Iron Python
native to the CLR 

-CHB

-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Boolean arrays with nulls?

2019-04-22 Thread Chris Barker
On Thu, Apr 18, 2019 at 10:52 AM Stuart Reynolds 
wrote:

> Is float8 a thing?
>

no, but np.float16 is -- so at least only twice as much memory as youo need
:-)

array([ nan,  inf, -inf], dtype=float16)

I think masked arrays are going to be just as much, as they need to carry
the mask.

-CHB



>
> On Thu, Apr 18, 2019 at 9:46 AM Stefan van der Walt 
> wrote:
>
>> Hi Stuart,
>>
>> On Thu, 18 Apr 2019 09:12:31 -0700, Stuart Reynolds wrote:
>> > Is there an efficient way to represent bool arrays with null entries?
>>
>> You can use the bool dtype:
>>
>> In [5]: x = np.array([True, False, True])
>>
>>
>>
>> In [6]: x
>>
>>
>> Out[6]: array([ True, False,  True])
>>
>> In [7]: x.dtype
>>
>>
>> Out[7]: dtype('bool')
>>
>> You should note that this stores one True/False value per byte, so it is
>> not optimal in terms of memory use.  There is no easy way to do
>> bit-arrays with NumPy, because we use strides to determine how to move
>> from one memory location to the next.
>>
>> See also:
>> https://www.reddit.com/r/Python/comments/5oatp5/one_bit_data_type_in_numpy/
>>
>> > What I’m hoping for is that there’s a structure that is ‘viewed’ as
>> > nan-able float data, but backed but a more efficient structures
>> > internally.
>>
>> There are good implementations of this idea, such as:
>>
>> https://github.com/ilanschnell/bitarray
>>
>> Those structures cannot typically utilize the NumPy machinery, though.
>> With the new array function interface, you should at least be able to
>> build something that has something close to the NumPy API.
>>
>> Best regards,
>> Stéfan
>> ___
>> NumPy-Discussion mailing list
>> NumPy-Discussion@python.org
>> https://mail.python.org/mailman/listinfo/numpy-discussion
>>
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
>


-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] grant proposal for core scientific Python projects (rejected)

2019-05-03 Thread Chris Barker
On Thu, May 2, 2019 at 11:51 PM Ralf Gommers  wrote:

> On Fri, May 3, 2019 at 3:49 AM Stephen Waterbury 
> wrote:
>
>> P.S.  If anyone wants to continue this discussion at SciPy 2019,
>> I will be there (on my own nickel!  ;) ...
>>
>
So will I (on NOAA's nickel, which I am grateful for)

Maybe we should hold a BoF, or even something more formal, on Government
support for SciPY Stack development?

-CHB

-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] grant proposal for core scientific Python projects (rejected)

2019-05-03 Thread Chris Barker
On Fri, May 3, 2019 at 9:56 AM Stephen Waterbury 
wrote:

> Sure, I would be interested to discuss, let's try to meet up there.
>
> OK< that's two of us :-)

NumFocus folk: Should we take this off the list and talk about a BoF or
something at SciPy?

-CHB





> Steve
>
> On 5/3/19 12:23 PM, Chris Barker wrote:
>
> On Thu, May 2, 2019 at 11:51 PM Ralf Gommers 
> wrote:
>
>> On Fri, May 3, 2019 at 3:49 AM Stephen Waterbury 
>> wrote:
>>
>>> P.S.  If anyone wants to continue this discussion at SciPy 2019,
>>> I will be there (on my own nickel!  ;) ...
>>>
>>
> So will I (on NOAA's nickel, which I am grateful for)
>
> Maybe we should hold a BoF, or even something more formal, on Government
> support for SciPY Stack development?
>
> -CHB
>
> --
>
> Christopher Barker, Ph.D.
> Oceanographer
>
> Emergency Response Division
> NOAA/NOS/OR&R(206) 526-6959   voice
> 7600 Sand Point Way NE   (206) 526-6329   fax
> Seattle, WA  98115   (206) 526-6317   main reception
>
> chris.bar...@noaa.gov
>
> ___
> NumPy-Discussion mailing 
> listNumPy-Discussion@python.orghttps://mail.python.org/mailman/listinfo/numpy-discussion
>
>
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
>


-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] Style guide for numpy code?

2019-05-08 Thread Chris Barker
Hey all,

Do any of you know of a style guide for computational / numpy code?

I don't mean code that will go into numpy itself, but rather, users code
that uses numpy (and scipy, and...)

I know about (am a proponent

-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] numpy finding local tests on import?!?!

2019-05-10 Thread Chris Barker
TL;DR:

This issue appears to have been fixed in numpy 1.15 (at least, I didn't
test 1.14)

However, I also had some issues in my environment that I also fixed, so it
may be that numpy's behavior hasn't changed -- I don't have the energy to
test now.

And it doesn't hurt to have this in the archives in case someone else runs
into the problem.

Read on if you care about weird behaviour with the testing package in numpy
1.13

Numpy appears to be both running tests on import (or at lest the the
runner), and finding local tests that are not numpy's

I found this issue (closed without a resolution):

https://github.com/numpy/numpy/issues/11457

which is related -- but it's about the import time of numpy.testing, and
not about errors/issues from that import. But maybe the import process ahs
been changed in newer numpys

What I did, and what I got:

I am trying t debug what looks like a numpy-related issue in a project.

So one thing I did was try to import numpy and check __version__:

python -c "import numpy; print(numpy.__version__)"

very weird barf:


  File
"/Users/chris.barker/miniconda2/envs/gridded/lib/python2.7/unittest/runner.py",
line 4, in 
import time
  File "time.py", line 7, in 
import netCDF4 as nc4
  File
"/Users/chris.barker/miniconda2/envs/gridded/lib/python2.7/site-packages/netCDF4/__init__.py",
line 3, in 
from ._netCDF4 import *
  File "include/netCDF4.pxi", line 728, in init netCDF4._netCDF4
(netCDF4/_netCDF4.c:83784)
AttributeError: 'module' object has no attribute 'ndarray

I get the same thing if I fire up the interpreter and then import numpy

as the error seemed to come from:

unittest/runner.py

I had a hunch.

I was, in fact, running with my current working directory in the package
dir of my project, and there is a test package in that dir

I cd out of that, and presto! numy imports fine:

$ python -c "import numpy; print(numpy.__version__)"
1.13.1

OK, that's a kinda old numpy -- but it's the minimum required by my
project. (though I can probably update that -- I"ll do that soon)

So it appears that the test runner is looking in the current working dir
(or, I suppose sys.PATH) for packages called tests -- this seems like a
broken system, unless you are runing the tests explicitly from teh command
line, it shouldn't look in the cwd, and it probably shouldn't ever look in
all of sys.path.

BUt my bigger confusion here is -- why the heck is the test runner being
run at ALL on a simple import ?!?!?

If this has been fixed / changed in newer numpy's the OK -- I'll update my
dependencies.

-CHB



-

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] defining a NumPy API standard?

2019-06-02 Thread Chris Barker
On Sun, Jun 2, 2019 at 3:45 AM Dashamir Hoxha  wrote:

>
> Would it be useful if we could integrate the documentation system with a
> discussion forum (like Discourse.org)? Each function can be linked to its
> own discussion topic, where users and developers can discuss about the
> function, upvote or downvote it etc. This kind of discussion seems to be a
> bit more structured than a mailing list discussion.
>

We could make a giHub repo for a document, and use issues to separately
discuss each topic.

-CHB


-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] defining a NumPy API standard?

2019-06-04 Thread Chris Barker
One little point here:



>   * np.ndarray.cumprod: low importance -> prefer np.multiply.accumulate
>>
>
I think that's and example of something that *should* be part of the numpy
API, but should be implemented as a mixin, based on np.multiply.accumulate.

As I'm a still confused about the goal here, that means that:

Users should still use `.cumprod`, but implementers of numpy-like packages
should implement `.multiply.accumulate`, and not directly `cumprod`, but
rather use the numpy ABC, or however it is implemented.

-CHB

-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] How to Capitalize numpy?

2019-09-16 Thread Chris Barker
Trivial note:

On the subject of naming things (spelling things??) -- should it be:

numpy
or
Numpy
or
NumPy
?

All three are in the draft NEP 30 ( mostly "NumPy", I noticed this when
reading/copy editing the NEP) . Is there an "official" capitalization?

My preference, would be to use "numpy", and where practicable, use a
"computer" font -- i.e. ``numpy`` in RST.

But if there is consensus already for anything else, that's fine, I'd just
like to know what it is.

-CHB



On Mon, Aug 12, 2019 at 4:02 AM Peter Andreas Entschev 
wrote:

> Apologies for the late reply. I've opened a new PR
> https://github.com/numpy/numpy/pull/14257 with the changes requested
> on clarifying the text. After reading the detailed description, I've
> decided to add a subsection "Scope" to clarify the scope where NEP-30
> would be useful. I think the inclusion of this new subsection
> complements the "Detail description" forming a complete text w.r.t.
> motivation of the NEP, but feel free to point out disagreements with
> my suggestion. I've also added a new section "Usage" pointing out how
> one would use duck array in replacement to np.asarray where relevant.
>
> Regarding the naming discussion, I must say I like the idea of keeping
> the __array_ prefix, but it seems like that is going to be difficult
> given that none of the existing ideas so far play very nicely with
> that. So if the general consensus is to go with __numpy_like__, I
> would also update the NEP to reflect that changes. FWIW, I
> particularly neither like nor dislike __numpy_like__, but I don't have
> any better suggestions than that or keeping the current naming.
>
> Best,
> Peter
>
> On Thu, Aug 8, 2019 at 3:40 AM Stephan Hoyer  wrote:
> >
> >
> >
> > On Wed, Aug 7, 2019 at 6:18 PM Charles R Harris <
> charlesr.har...@gmail.com> wrote:
> >>
> >>
> >>
> >> On Wed, Aug 7, 2019 at 7:10 PM Stephan Hoyer  wrote:
> >>>
> >>> On Wed, Aug 7, 2019 at 5:11 PM Ralf Gommers 
> wrote:
> 
> 
>  On Mon, Aug 5, 2019 at 6:18 PM Stephan Hoyer 
> wrote:
> >
> > On Mon, Aug 5, 2019 at 2:48 PM Ralf Gommers 
> wrote:
> >
> >>
> >> The NEP currently does not say who this is meant for. Would you
> expect libraries like SciPy to adopt it for example?
> >>
> >> The NEP also (understandably) punts on the question of when
> something is a valid duck array. If you want this to be widely used, that
> will need an answer or at least some rough guidance though. For example, we
> would expect a duck array to have a mean() method, but probably not a ptp()
> method. A library author who wants to use np.duckarray() needs to know,
> because she can't test with all existing and future duck array
> implementations.
> >
> >
> > I think this is covered in NEP-22 already.
> 
> 
>  It's not really. We discussed this briefly in the community call
> today, Peter said he will try to add some text.
> 
>  We should not add new functions to NumPy without indicating who is
> supposed to use this, and what need it fills / problem it solves. It seems
> pretty clear to me that it's mostly aimed at library authors rather than
> end users. And also that mature libraries like SciPy may not immediately
> adopt it, because it's too fuzzy - so it's new libraries first, mature
> libraries after the dust has settled a bit (I think).
> >>>
> >>>
> >>> I totally agree -- we definitely should clarify this in the docstring
> and elsewhere in the docs. An example in the new doc page on "Writing
> custom array containers" (
> https://numpy.org/devdocs/user/basics.dispatch.html) would also probably
> be appropriate.
> >>>
> >
> > As discussed there, I don't think NumPy is in a good position to
> pronounce decisive APIs at this time. I would welcome efforts to try, but I
> don't think that's essential for now.
> 
> 
>  There's no need to pronounce a decisive API that fully covers duck
> array. Note that RNumPy is an attempt in that direction (not a full one,
> but way better than nothing). In the NEP/docs, at least saying something
> along the lines of "if you implement this, we recommend the following
> strategy: check if a function is present in Dask, CuPy and Sparse. If so,
> it's reasonable to expect any duck array to work here. If not, we suggest
> you indicate in your docstring what kinds of duck arrays are accepted, or
> what properties they need to have". That's a spec by implementation, which
> is less than ideal but better than saying nothing.
> >>>
> >>>
> >>> OK, I agree here as well -- some guidance is better than nothing.
> >>>
> >>> Two other minor notes on this NEP, concerning naming:
> >>> 1. We should have a brief note on why we settled on the name "duck
> array". Namely, as discussed in NEP-22, we don't love the "duck" jargon,
> but we couldn't come up with anything better since NumPy already uses
> "array like" and "any array" for different purposes.
> >>> 2. The protocol should use *something* more clearly namesp

Re: [Numpy-discussion] NEP 30 - Duck Typing for NumPy Arrays - Implementation

2019-09-16 Thread Chris Barker
On Mon, Aug 12, 2019 at 4:02 AM Peter Andreas Entschev 
wrote:

> Apologies for the late reply. I've opened a new PR
> https://github.com/numpy/numpy/pull/14257 with the changes requested
>

thanks!

I've written a small PR on your PR:

https://github.com/pentschev/numpy/pull/1

Essentially, other than typos and copy editing, I'm suggesting that a
duck-array could choose to implement __array__ if it so chooses -- it
should, of course, return an actual numpy array.

I think this could be useful, as much code does require an actual numpy
array, and only that class itself knows how best to convert to one.

-CHB

-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] Add a total_seconds() method to timedelta64?

2019-09-16 Thread Chris Barker
I just noticed that there is no obvious way to convert a timedelta64 to
seconds (or some other easy unit) as a number.

The stdlib datetime.datetime has a .total_seconds() method for doing that.
I think it's a handy thing to have.

Looking at StackOverflow (and others), I see people suggesting things like:

a_timedelta.astype(np.float) / 1e6

This seems a really bad idea, as it's assuming the timedelta is storing
milliseconds.

The "proper" way to do it also suggested:

a_timedelta / np.timedelta64(1, 's')

This is, in fact, a much better way to do it, and allows you to specify
other units if you like: "ms"., "us", etc.

There was semi-recently a discussion thread on python-ideas about adding
other methods to datetime: (e.g.  .total_hours, .total_minutes). That was
pretty much rejected (or petered out anyway), and some argued that dividing
by a timedelta of the unit you want is the "right" way to do it anyway
(some argued that .total_seconds() never should have been added.

Personally I understand the "correctness" of using united-division, but
"practicality beats purity", and the discoverability of a method or two
really makes it easier on folks.

That being said, of folks don't want to add .total_seconds and the like --
we should add a bit to the docs about this, suggesting using the division
approach.

-CHB


-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] How to Capitalize numpy?

2019-09-16 Thread Chris Barker
got it, thanks.

I've fixed that typo in a PR I"m working on , too.

-CHB


On Mon, Sep 16, 2019 at 2:41 PM Ralf Gommers  wrote:

>
>
> On Mon, Sep 16, 2019 at 1:42 PM Peter Andreas Entschev 
> wrote:
>
>> My answer to that: "NumPy". Reference: logo at the top of
>> https://numpy.org/neps/index.html .
>>
>
> Yes, NumPy is the right capitalization
>
>
>
>> In NEP-30 [1], I've used "NumPy" everywhere, except for references to
>> code, repos, etc., where "numpy" is used. I see there's one occurrence
>> of "Numpy", which was definitely a typo and I had not noticed it until
>> now, but I will address this on a future update, thanks for pointing
>> that out.
>>
>> [1] https://numpy.org/neps/nep-0030-duck-array-protocol.html
>>
>> On Mon, Sep 16, 2019 at 9:09 PM Chris Barker 
>> wrote:
>> >
>> > Trivial note:
>> >
>> > On the subject of naming things (spelling things??) -- should it be:
>> >
>> > numpy
>> > or
>> > Numpy
>> > or
>> > NumPy
>> > ?
>> >
>> > All three are in the draft NEP 30 ( mostly "NumPy", I noticed this when
>> reading/copy editing the NEP) . Is there an "official" capitalization?
>> >
>> > My preference, would be to use "numpy", and where practicable, use a
>> "computer" font -- i.e. ``numpy`` in RST.
>> >
>> > But if there is consensus already for anything else, that's fine, I'd
>> just like to know what it is.
>> >
>> > -CHB
>> >
>> >
>> >
>> > On Mon, Aug 12, 2019 at 4:02 AM Peter Andreas Entschev <
>> pe...@entschev.com> wrote:
>> >>
>> >> Apologies for the late reply. I've opened a new PR
>> >> https://github.com/numpy/numpy/pull/14257 with the changes requested
>> >> on clarifying the text. After reading the detailed description, I've
>> >> decided to add a subsection "Scope" to clarify the scope where NEP-30
>> >> would be useful. I think the inclusion of this new subsection
>> >> complements the "Detail description" forming a complete text w.r.t.
>> >> motivation of the NEP, but feel free to point out disagreements with
>> >> my suggestion. I've also added a new section "Usage" pointing out how
>> >> one would use duck array in replacement to np.asarray where relevant.
>> >>
>> >> Regarding the naming discussion, I must say I like the idea of keeping
>> >> the __array_ prefix, but it seems like that is going to be difficult
>> >> given that none of the existing ideas so far play very nicely with
>> >> that. So if the general consensus is to go with __numpy_like__, I
>> >> would also update the NEP to reflect that changes. FWIW, I
>> >> particularly neither like nor dislike __numpy_like__, but I don't have
>> >> any better suggestions than that or keeping the current naming.
>> >>
>> >> Best,
>> >> Peter
>> >>
>> >> On Thu, Aug 8, 2019 at 3:40 AM Stephan Hoyer  wrote:
>> >> >
>> >> >
>> >> >
>> >> > On Wed, Aug 7, 2019 at 6:18 PM Charles R Harris <
>> charlesr.har...@gmail.com> wrote:
>> >> >>
>> >> >>
>> >> >>
>> >> >> On Wed, Aug 7, 2019 at 7:10 PM Stephan Hoyer 
>> wrote:
>> >> >>>
>> >> >>> On Wed, Aug 7, 2019 at 5:11 PM Ralf Gommers <
>> ralf.gomm...@gmail.com> wrote:
>> >> >>>>
>> >> >>>>
>> >> >>>> On Mon, Aug 5, 2019 at 6:18 PM Stephan Hoyer 
>> wrote:
>> >> >>>>>
>> >> >>>>> On Mon, Aug 5, 2019 at 2:48 PM Ralf Gommers <
>> ralf.gomm...@gmail.com> wrote:
>> >> >>>>>
>> >> >>>>>>
>> >> >>>>>> The NEP currently does not say who this is meant for. Would you
>> expect libraries like SciPy to adopt it for example?
>> >> >>>>>>
>> >> >>>>>> The NEP also (understandably) punts on the question of when
>> something is a valid duck array. If you want this to be widely used, that
>> will need an answer or at least some rough guidance though. For example, we
>> would expect a duck array to have a mean() method, but

Re: [Numpy-discussion] How to Capitalize numpy?

2019-09-16 Thread Chris Barker
Thanks Joe, looks like everyone agrees:

In text, NumPy it is.

-CHB



On Mon, Sep 16, 2019 at 2:41 PM Joe Harrington  wrote:

> Here are my thoughts on textual capitalization (at first, I thought you
> wanted to raise money!):
>
> We all agree that in code, it is "numpy".  If you don't use that, it
> throws an error.  If, in text, we keep "numpy" with a forced lower-case
> letter at the start, it is just one more oddball to remember.  It is even
> weirder in titles and the beginnings of sentences.  I'd strongly like not
> to be weird that way.  A few packages are, it's annoying, and it doesn't
> much earn them any goodwill. The default among people who are not "in the
> know" will be to do what they're used to.  Let's give them what they're
> used to, a proper noun with initial (at least) capital.
>
> Likewise, I object to preferring a particular font.  What fonts to use for
> the names of things like software packages is a decision for publications
> to make.  A journal or manual might make fine distinctions and demand
> several different, specific fonts, while a popular publication might prefer
> not to do that.  Leave the typesetting to the editors of the publications.
> We can certainly adopt a standard for our publications (docs, web pages,
> etc.), but we should state explicitly that others can do as they like.
>
> It's not an acronym, so that leaves the options of "Numpy" and "NumPy".
> It would be great, easy to remember, consistent for others, etc., if NumPy
> and SciPy were capitalized the same way and were pronounced the same (I
> still occasionally hear "numpee").  So, I would favor "NumPy" to go along
> with "SciPy", and let the context choose the font.
>
> --jh--
>
>
> On 9/16/19 9:09 PM, Chris Barker 
>  wrote:
>
>
>
>
>
> Trivial note:
>
> On the subject of naming things (spelling things??) -- should it be:
>
> numpy
> or
> Numpy
> or
> NumPy
> ?
>
> All three are in the draft NEP 30 ( mostly "NumPy", I noticed this when
> reading/copy editing the NEP) . Is there an "official" capitalization?
>
> My preference, would be to use "numpy", and where practicable, use a
> "computer" font -- i.e. ``numpy`` in RST.
>
> But if there is consensus already for anything else, that's fine, I'd just
> like to know what it is.
>
> -CHB
>
>
>
> On Mon, Aug 12, 2019 at 4:02 AM Peter Andreas Entschev 
> wrote:
>
>> Apologies for the late reply. I've opened a new PR
>> https://github.com/numpy/numpy/pull/14257 with the changes requested
>> on clarifying the text. After reading the detailed description, I've
>> decided to add a subsection "Scope" to clarify the scope where NEP-30
>> would be useful. I think the inclusion of this new subsection
>> complements the "Detail description" forming a complete text w.r.t.
>> motivation of the NEP, but feel free to point out disagreements with
>> my suggestion. I've also added a new section "Usage" pointing out how
>> one would use duck array in replacement to np.asarray where relevant.
>>
>> Regarding the naming discussion, I must say I like the idea of keeping
>> the __array_ prefix, but it seems like that is going to be difficult
>> given that none of the existing ideas so far play very nicely with
>> that. So if the general consensus is to go with __numpy_like__, I
>> would also update the NEP to reflect that changes. FWIW, I
>> particularly neither like nor dislike __numpy_like__, but I don't have
>> any better suggestions than that or keeping the current naming.
>>
>> Best,
>> Peter
>>
>> On Thu, Aug 8, 2019 at 3:40 AM Stephan Hoyer  wrote:
>> >
>> >
>> >
>> > On Wed, Aug 7, 2019 at 6:18 PM Charles R Harris <
>> charlesr.har...@gmail.com> wrote:
>> >>
>> >>
>> >>
>> >> On Wed, Aug 7, 2019 at 7:10 PM Stephan Hoyer  wrote:
>> >>>
>> >>> On Wed, Aug 7, 2019 at 5:11 PM Ralf Gommers 
>> wrote:
>> >>>>
>> >>>>
>> >>>> On Mon, Aug 5, 2019 at 6:18 PM Stephan Hoyer 
>> wrote:
>> >>>>>
>> >>>>> On Mon, Aug 5, 2019 at 2:48 PM Ralf Gommers 
>> wrote:
>> >>>>>
>> >>>>>>
>> >>>>>> The NEP currently does not say who this is meant for. Would you
>> expect libraries like SciPy to adopt it for example?
>> >>>>>>
>

Re: [Numpy-discussion] NEP 30 - Duck Typing for NumPy Arrays - Implementation

2019-09-16 Thread Chris Barker
On Mon, Sep 16, 2019 at 1:46 PM Peter Andreas Entschev 
wrote:

> What would be the use case for a duck-array to implement __array__ and
> return a NumPy array?


some users need a genuine, actual numpy array (for passing to Cyton code,
for example).
if __array__ is not implemented, how can they get that from an array-like
object??

Only the author of the array-like object knows how best to make a numpy
array out of it.

Unless I'm missing something, this seems
> redundant and one should just use array/asarray functions then.


but if the object does not impliment __array__, then user's can't use the
array/asarray functions!


> This
> would also prevent error-handling, what if the developer intentionally
> wants a NumPy-like array (e.g., the original array passed to the
> duckarray function) or an exception (instead of coercing to a NumPy
> array)?
>

I'm really confused now -- if a end-user wants a duckarray, they should
call duckarray() -- if they want an actual numpy array, they should call
.asarray().

Why would anyone want an Exception? If you don't want an array, then don't
call asarray()

If you call duckarray(), and the object has not implemented __duckarray__,
then you will get an exception -- whoch you should.

If you call __array_, and __array__ has not been implimented, then you will
get an exception.

what is the potential problem here?

Which makes me think -- why should Duck arrays ever implement an __array__
method that raises an Exception? why not jsut not impliment it? (unless you
wantt o add some helpful error message -- which I did for the example in my
PR.

(PR to the numpy repo in progress)

-CHB


-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] NEP 30 - Duck Typing for NumPy Arrays - Implementation

2019-09-16 Thread Chris Barker
On Mon, Sep 16, 2019 at 2:27 PM Stephan Hoyer  wrote:

> On Mon, Sep 16, 2019 at 1:45 PM Peter Andreas Entschev 
> wrote:
>
>> What would be the use case for a duck-array to implement __array__ and
>> return a NumPy array?
>
>

> Dask arrays are a good example. They will want to implement __duck_array__
> (or whatever we call it) because they support duck typed versions of NumPy
> operation. They also (already) implement __array__, so they can converted
> into NumPy arrays as a fallback. This is convenient for moderately sized
> dask arrays, e.g., so you can pass one into a matplotlib function.
>

Exactly.

And I have implemented __array__ in classes that are NOT duck arrays at all
(an image class, for instance). But I also can see wanting to support both:

use me as a duck array
and
convert me into a proper numpy array.

OK -- looking again at the NEP, I see this suggested implementation:

def duckarray(array_like):
if hasattr(array_like, '__duckarray__'):
return array_like.__duckarray__()
return np.asarray(array_like)

So I see the point now, if a user wants a duck array -- they may not want
to accidentally coerce this object to a real array (potentially expensive).

but in this case, asarray() will only get called (and thus __array__ will
only get called), if __duckarray__ is not implemented. So the only reason
to impliment __array__ and raise and Exception is so that users will get
that exception is the specifically call asarray() -- why should they get
that??

I'm working on a PR with suggestion for this.

-CHB

-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] NEP 30 - Duck Typing for NumPy Arrays - Implementation

2019-09-16 Thread Chris Barker
OK -- I *finally* got it:

when you pass an arbitrary object into np.asarray(), it will create an
array object scalar with the object in it.

So yes, I can see that you may want to raise a TypeError instead, so that
users don't get an object array  scalar when they wre expecting to get an
array-like object.

So it's probably a good idea to recommend that when a class implements
__dauckarray__ that it also implements __array__, which can either raise an
exception or return and ndarray.

-CHB


On Mon, Sep 16, 2019 at 3:11 PM Chris Barker  wrote:

> On Mon, Sep 16, 2019 at 2:27 PM Stephan Hoyer  wrote:
>
>> On Mon, Sep 16, 2019 at 1:45 PM Peter Andreas Entschev <
>> pe...@entschev.com> wrote:
>>
>>> What would be the use case for a duck-array to implement __array__ and
>>> return a NumPy array?
>>
>>
>
>> Dask arrays are a good example. They will want to implement
>> __duck_array__ (or whatever we call it) because they support duck typed
>> versions of NumPy operation. They also (already) implement __array__, so
>> they can converted into NumPy arrays as a fallback. This is convenient for
>> moderately sized dask arrays, e.g., so you can pass one into a matplotlib
>> function.
>>
>
> Exactly.
>
> And I have implemented __array__ in classes that are NOT duck arrays at
> all (an image class, for instance). But I also can see wanting to support
> both:
>
> use me as a duck array
> and
> convert me into a proper numpy array.
>
> OK -- looking again at the NEP, I see this suggested implementation:
>
> def duckarray(array_like):
> if hasattr(array_like, '__duckarray__'):
> return array_like.__duckarray__()
> return np.asarray(array_like)
>
> So I see the point now, if a user wants a duck array -- they may not want
> to accidentally coerce this object to a real array (potentially expensive).
>
> but in this case, asarray() will only get called (and thus __array__ will
> only get called), if __duckarray__ is not implemented. So the only reason
> to impliment __array__ and raise and Exception is so that users will get
> that exception is the specifically call asarray() -- why should they get
> that??
>
> I'm working on a PR with suggestion for this.
>
> -CHB
>
> --
>
> Christopher Barker, Ph.D.
> Oceanographer
>
> Emergency Response Division
> NOAA/NOS/OR&R(206) 526-6959   voice
> 7600 Sand Point Way NE   (206) 526-6329   fax
> Seattle, WA  98115   (206) 526-6317   main reception
>
> chris.bar...@noaa.gov
>


-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] NEP 30 - Duck Typing for NumPy Arrays - Implementation

2019-09-16 Thread Chris Barker
Here's a PR with a different dicsussion of __array__:

https://github.com/numpy/numpy/pull/14529

-CHB


On Mon, Sep 16, 2019 at 3:23 PM Chris Barker  wrote:

> OK -- I *finally* got it:
>
> when you pass an arbitrary object into np.asarray(), it will create an
> array object scalar with the object in it.
>
> So yes, I can see that you may want to raise a TypeError instead, so that
> users don't get an object array  scalar when they wre expecting to get an
> array-like object.
>
> So it's probably a good idea to recommend that when a class implements
> __dauckarray__ that it also implements __array__, which can either raise an
> exception or return and ndarray.
>
> -CHB
>
>
> On Mon, Sep 16, 2019 at 3:11 PM Chris Barker 
> wrote:
>
>> On Mon, Sep 16, 2019 at 2:27 PM Stephan Hoyer  wrote:
>>
>>> On Mon, Sep 16, 2019 at 1:45 PM Peter Andreas Entschev <
>>> pe...@entschev.com> wrote:
>>>
>>>> What would be the use case for a duck-array to implement __array__ and
>>>> return a NumPy array?
>>>
>>>
>>
>>> Dask arrays are a good example. They will want to implement
>>> __duck_array__ (or whatever we call it) because they support duck typed
>>> versions of NumPy operation. They also (already) implement __array__, so
>>> they can converted into NumPy arrays as a fallback. This is convenient for
>>> moderately sized dask arrays, e.g., so you can pass one into a matplotlib
>>> function.
>>>
>>
>> Exactly.
>>
>> And I have implemented __array__ in classes that are NOT duck arrays at
>> all (an image class, for instance). But I also can see wanting to support
>> both:
>>
>> use me as a duck array
>> and
>> convert me into a proper numpy array.
>>
>> OK -- looking again at the NEP, I see this suggested implementation:
>>
>> def duckarray(array_like):
>> if hasattr(array_like, '__duckarray__'):
>> return array_like.__duckarray__()
>> return np.asarray(array_like)
>>
>> So I see the point now, if a user wants a duck array -- they may not want
>> to accidentally coerce this object to a real array (potentially expensive).
>>
>> but in this case, asarray() will only get called (and thus __array__ will
>> only get called), if __duckarray__ is not implemented. So the only reason
>> to impliment __array__ and raise and Exception is so that users will get
>> that exception is the specifically call asarray() -- why should they get
>> that??
>>
>> I'm working on a PR with suggestion for this.
>>
>> -CHB
>>
>> --
>>
>> Christopher Barker, Ph.D.
>> Oceanographer
>>
>> Emergency Response Division
>> NOAA/NOS/OR&R(206) 526-6959   voice
>> 7600 Sand Point Way NE   (206) 526-6329   fax
>> Seattle, WA  98115   (206) 526-6317   main reception
>>
>> chris.bar...@noaa.gov
>>
>
>
> --
>
> Christopher Barker, Ph.D.
> Oceanographer
>
> Emergency Response Division
> NOAA/NOS/OR&R(206) 526-6959   voice
> 7600 Sand Point Way NE   (206) 526-6329   fax
> Seattle, WA  98115   (206) 526-6317   main reception
>
> chris.bar...@noaa.gov
>


-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] NEP 30 - Duck Typing for NumPy Arrays - Implementation

2019-09-17 Thread Chris Barker
On Tue, Sep 17, 2019 at 6:56 AM Peter Andreas Entschev 
wrote:

> I agree with your point and understand how the current text may be
> misleading, so we shall make it clearer in the NEP (as done in
> https://github.com/numpy/numpy/pull/14529) that both are valid ways:
>
> * Have a genuine implementation of __array__ (like Dask, as pointed
> out by Stephan); or
> * Raise an exception (as CuPy does).
>

great -- sounds like we're all (well three of us anyway) are on teh same
page.

Just need to sort out the text.

-CHB




>
> Thanks for opening the PR, I will comment there as well.
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
>


-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] New DTypes: Are scalars a central concept in NumPy or not?

2020-03-23 Thread Chris Barker
I've always found the duality of zero-d arrays an scalars confusing, and
I'm sure I'm not alone.

Having both is just plain weird.

But, backward compatibility aside, could we have ONLY Scalars?

When we index into an array, the dimensionality is reduced by one, so
indexing into a 1D array has to get us something: but the zero-d array is a
really weird object -- do we really need it?

There is certainly a need for more numpy-like scalars: more than the built
in data types, and some handy attributes and methods, like dtype,
.itemsize, etc. But could we make an enhanced scalar that had everything we
actually need from a zero-d array?

The key point would be mutability -- but do we really need mutable scalars?
I can't think of any time I've needed that, when I couldn't have used a 1-d
array of length 1.

Is there a use case for zero-d arrays that could not be met with an
enhanced scalar?

-CHB







On Mon, Feb 24, 2020 at 12:30 PM Allan Haldane 
wrote:

> I have some thoughts on scalars from playing with ndarray ducktypes
> (__array_function__), eg a MaskedArray ndarray-ducktype, for which I
> wanted an associated "MaskedScalar" type.
>
> In summary, the ways scalars currently work makes ducktyping
> (duck-scalars) difficult:
>
>   * numpy scalar types are not subclassable, so my duck-scalars aren't
> subclasses of numpy scalars and aren't in the type hierarchy
>   * even if scalars were subclassable, I would have to subclass each
> scalar datatype individually to make masked versions
>   * lots of code checks  `np.isinstance(var, np.float64)` which breaks
> for my duck-scalars
>   * it was difficult to distinguish between a duck-scalar and a duck-0d
> array. The method I used in the end seems hacky.
>
> This has led to some daydreams about how scalars should work, and also
> led me last to read through your NEPs 40/41 with specific focus on what
> you said about scalars, and was about to post there until I saw this
> discussion. I agree with what you said in the NEPs about not making
> scalars be dtype instances.
>
> Here is what ducktypes led me to:
>
> If we are able to do something like define a `np.numpy_scalar` type
> covering all numpy scalars, which has a `.dtype` attribute like you
> describe in the NEPs, then that would seem to solve the ducktype
> problems above. Ducktype implementors would need to make a "duck-scalar"
> type in parallel to their "duck-ndarray" type, but I found that to be
> pretty easy using an abstract class in my MaskedArray ducktype, since
> the MaskedArray and MaskedScalar share a lot of behavior.
>
> A numpy_scalar type would also help solve some object-array problems if
> the object scalars are wrapped in the np_scalar type. A long time ago I
> started to try to fix up various funny/strange behaviors of object
> datatypes, but there are lots of special cases, and the main problem was
> that the returned objects (eg from indexing) were not numpy types and
> did not support numpy attributes or indexing. Wrapping the returned
> object in `np.numpy_scalar` might add an extra slight annoyance to
> people who want to unwrap the object, but I think it would make object
> arrays less buggy and make code using object arrays easier to reason
> about and debug.
>
> Finally, a few random votes/comments based on the other emails on the list:
>
> I think scalars have a place in numpy (rather than just reusing 0d
> arrays), since there is a clear use in having hashable, immutable
> scalars. Structured scalars should probably be immutable.
>
> I agree with your suggestion that scalars should not be indexable. Thus,
> my duck-scalars (and proposed numpy_scalar) would not be indexable.
> However, I think they should encode their datatype though a .dtype
> attribute like ndarrays, rather than by inheritance.
>
> Also, something to think about is that currently numpy scalars satisfy
> the property `isinstance(np.float64(1), float)`, i.e they are within the
> python numerical type hierarchy. 0d arrays do not have this property. My
> proposal above would break this. I'm not sure what to think about
> whether this is a good property to maintain or not.
>
> Cheers,
> Allan
>
>
>
> On 2/21/20 8:37 PM, Sebastian Berg wrote:
> > Hi all,
> >
> > When we create new datatypes, we have the option to make new choices
> > for the new datatypes [0] (not the existing ones).
> >
> > The question is: Should every NumPy datatype have a scalar associated
> > and should operations like indexing return a scalar or a 0-D array?
> >
> > This is in my opinion a complex, almost philosophical, question, and we
> > do not have to settle anything for a long time. But, if we do not
> > decide a direction before we have many new datatypes the decision will
> > make itself...
> > So happy about any ideas, even if its just a gut feeling :).
> >
> > There are various points. I would like to mostly ignore the technical
> > ones, but I am listing them anyway here:
> >
> >   * Scalars are faster (although that c

Re: [Numpy-discussion] Good use of __dunder__ methods in numpy

2020-03-23 Thread Chris Barker
On Thu, Mar 5, 2020 at 2:15 PM Gregory Lee  wrote:

> If i can get a link to a file that shows how dunder methods help with
>> having cool coding APIs that would be great!
>>
>>
> You may want to take a look at PEP 465 as an example, then. If I recall
> correctly, the __matmul__ method described in it was added to the standard
> library largely with NumPy in mind.
> https://www.python.org/dev/peps/pep-0465/
>

and so were "rich comparisons", and in-place operators (at least in part).

numpy is VERY, VERY, heavily built on the concept of overloading operators,
i.e. using dunders or magic methods.

I'm going to venture a guess that numpy arrays custom define every single
standard dunder -- and certainly most of them.

-CHB





> On Thu, Mar 5, 2020 at 10:32 PM Sebastian Berg 
>> wrote:
>>
>>> Hi,
>>>
>>> On Thu, 2020-03-05 at 11:14 +0400, Abdur-Rahmaan Janhangeer wrote:
>>> > Greetings list,
>>> >
>>> > I have a talk about dunder methods in Python
>>> >
>>> > (
>>> >
>>> https://conference.mscc.mu/speaker/67604187-57c3-4be6-987c-ea4bef388ad3
>>> > )
>>> >
>>> > and it would be nice to include Numpy in the mix. Can someone point
>>> > me to one or two use cases / file link where dunder methods help
>>> > numpy?
>>> >
>>>
>>> I am not sure in what sense you are looking for. NumPy has its own set
>>> of dunder methods (some of which should not be used super much
>>> probably), like `__array__`, `__array_interface__`, `__array_ufunc__`,
>>> `__array_function__`, `__array_finalize__`, ...
>>> So we are using `__array_*__` for numpy related dunders.
>>>
>>> Of course we use most Python defined dunders, but I am not sure that
>>> you are looking for that?
>>>
>>> Best,
>>>
>>> Sebastian
>>>
>>>
>>> > Thanks
>>> >
>>> > fun info: i am a tiny numpy contributor with a one line merge.
>>> > ___
>>> > NumPy-Discussion mailing list
>>> > NumPy-Discussion@python.org
>>> > https://mail.python.org/mailman/listinfo/numpy-discussion
>>> ___
>>> NumPy-Discussion mailing list
>>> NumPy-Discussion@python.org
>>> https://mail.python.org/mailman/listinfo/numpy-discussion
>>>
>> ___
>> NumPy-Discussion mailing list
>> NumPy-Discussion@python.org
>> https://mail.python.org/mailman/listinfo/numpy-discussion
>>
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
>


-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


  1   2   >