Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-24 Thread Robert Kern
On Mon, Apr 24, 2017 at 7:41 PM, Nathaniel Smith wrote: > > On Mon, Apr 24, 2017 at 7:23 PM, Robert Kern wrote: > > On Mon, Apr 24, 2017 at 7:07 PM, Nathaniel Smith wrote: > > > >> That said, AFAICT what people actually want in most use cases is support > >> for arrays that can hold variable-len

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-24 Thread Stephan Hoyer
On Mon, Apr 24, 2017 at 7:41 PM, Nathaniel Smith wrote: > But also, is it important whether strings we're loading/saving to an > HDF5 file have the same in-memory representation in numpy as they > would in the file? I *know* [1] no-one is reading HDF5 files using > np.memmap :-). Of course they

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-24 Thread Nathaniel Smith
On Mon, Apr 24, 2017 at 7:23 PM, Robert Kern wrote: > On Mon, Apr 24, 2017 at 7:07 PM, Nathaniel Smith wrote: > >> That said, AFAICT what people actually want in most use cases is support >> for arrays that can hold variable-length strings, and the only place where >> the current approach is *opt

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-24 Thread Robert Kern
On Mon, Apr 24, 2017 at 7:07 PM, Nathaniel Smith wrote: > That said, AFAICT what people actually want in most use cases is support for arrays that can hold variable-length strings, and the only place where the current approach is *optimal* is when we need mmap compatibility with legacy formats th

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-24 Thread Nathaniel Smith
On Apr 21, 2017 2:34 PM, "Stephan Hoyer" wrote: I still don't understand why a latin encoding makes sense as a preferred one-byte-per-char dtype. The world, including Python 3, has standardized on UTF-8, which is also one-byte-per-char for (ASCII) scientific data. You may already know this, but

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-24 Thread Robert Kern
On Mon, Apr 24, 2017 at 5:56 PM, Aldcroft, Thomas < aldcr...@head.cfa.harvard.edu> wrote: > > On Mon, Apr 24, 2017 at 7:11 PM, Robert Kern wrote: >> >> On Mon, Apr 24, 2017 at 4:06 PM, Aldcroft, Thomas < aldcr...@head.cfa.harvard.edu> wrote: >> > >> > On Mon, Apr 24, 2017 at 4:06 PM, Robert Kern

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-24 Thread Aldcroft, Thomas
On Mon, Apr 24, 2017 at 7:11 PM, Robert Kern wrote: > On Mon, Apr 24, 2017 at 4:06 PM, Aldcroft, Thomas < > aldcr...@head.cfa.harvard.edu> wrote: > > > > On Mon, Apr 24, 2017 at 4:06 PM, Robert Kern > wrote: > >> > >> I am not unfamiliar with this problem. I still work with files that > have fie

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-24 Thread Robert Kern
On Mon, Apr 24, 2017 at 4:09 PM, Stephan Hoyer wrote: > > On Mon, Apr 24, 2017 at 11:13 AM, Chris Barker wrote: >>> >>> On the other hand, if this is the use-case, perhaps we really want an encoding closer to "Python 2" string, i.e, "unknown", to let this be signaled more explicitly. I would sugg

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-24 Thread Stephan Hoyer
On Mon, Apr 24, 2017 at 4:08 PM, Robert Kern wrote: > Let me make a counter-proposal for your latin-1 dtype (your #2) that might > address your, Thomas's, and Julian's use cases: > > 2) We want a single-byte-per-character, NULL-terminated string dtype that > can be used to represent mostly-ASCII

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-24 Thread Robert Kern
On Mon, Apr 24, 2017 at 4:06 PM, Aldcroft, Thomas < aldcr...@head.cfa.harvard.edu> wrote: > > On Mon, Apr 24, 2017 at 4:06 PM, Robert Kern wrote: >> >> I am not unfamiliar with this problem. I still work with files that have fields that are supposed to be in EBCDIC but actually contain text in ASC

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-24 Thread Stephan Hoyer
On Mon, Apr 24, 2017 at 11:13 AM, Chris Barker wrote: > On the other hand, if this is the use-case, perhaps we really want an >> encoding closer to "Python 2" string, i.e, "unknown", to let this be >> signaled more explicitly. I would suggest that "text[unknown]" should >> support operations like

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-24 Thread Robert Kern
Chris, you've mashed all of my emails together, some of them are in reply to you, some in reply to others. Unfortunately, this dropped a lot of the context from each of them, and appears to be creating some misunderstandings about what each person is advocating. On Mon, Apr 24, 2017 at 2:00 PM, Ch

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-24 Thread Aldcroft, Thomas
On Mon, Apr 24, 2017 at 4:06 PM, Robert Kern wrote: > I am not unfamiliar with this problem. I still work with files that have > fields that are supposed to be in EBCDIC but actually contain text in > ASCII, UTF-8 (if I'm lucky) or any of a variety of East European 8-bit > encodings. In that expe

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-24 Thread Chris Barker
On Mon, Apr 24, 2017 at 11:36 AM, Robert Kern wrote: > > I agree -- it is a VERY common case for scientific data sets. But a > one-byte-per-char encoding would handle it nicely, or UCS-4 if you want > Unicode. The wasted space is not that big a deal with short strings... > > Unless if you have hu

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-24 Thread Robert Kern
On Mon, Apr 24, 2017 at 11:56 AM, Aldcroft, Thomas < aldcr...@head.cfa.harvard.edu> wrote: > > On Mon, Apr 24, 2017 at 2:47 PM, Robert Kern wrote: >> >> On Mon, Apr 24, 2017 at 10:51 AM, Aldcroft, Thomas < aldcr...@head.cfa.harvard.edu> wrote: >> > >> > On Mon, Apr 24, 2017 at 1:04 PM, Chris Barke

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-24 Thread Robert Kern
On Mon, Apr 24, 2017 at 10:04 AM, Chris Barker wrote: > > On Fri, Apr 21, 2017 at 2:34 PM, Stephan Hoyer wrote: > >>> In this case, we want something compatible with Python's string (i.e. full Unicode supporting) and I think should be as transparent as possible. Python's string has made the decis

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-24 Thread Aldcroft, Thomas
On Mon, Apr 24, 2017 at 2:47 PM, Robert Kern wrote: > On Mon, Apr 24, 2017 at 10:51 AM, Aldcroft, Thomas < > aldcr...@head.cfa.harvard.edu> wrote: > > > > On Mon, Apr 24, 2017 at 1:04 PM, Chris Barker > wrote: > > >> - round-tripping of binary data (at least with Python's > encoding/decoding) --

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-24 Thread Robert Kern
On Mon, Apr 24, 2017 at 10:51 AM, Aldcroft, Thomas < aldcr...@head.cfa.harvard.edu> wrote: > > On Mon, Apr 24, 2017 at 1:04 PM, Chris Barker wrote: >> - round-tripping of binary data (at least with Python's encoding/decoding) -- ANY string of bytes can be decodes as latin-1 and re-encoded to get

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-24 Thread Robert Kern
On Mon, Apr 24, 2017 at 11:21 AM, Chris Barker wrote: > > On Mon, Apr 24, 2017 at 10:51 AM, Aldcroft, Thomas < aldcr...@head.cfa.harvard.edu> wrote: >>> >>> BTW -- maybe we should keep the pathological use-case in mind: really short strings. I think we are all thinking in terms of longer strings,

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-24 Thread Chris Barker
On Mon, Apr 24, 2017 at 10:51 AM, Aldcroft, Thomas < aldcr...@head.cfa.harvard.edu> wrote: > BTW -- maybe we should keep the pathological use-case in mind: really >> short strings. I think we are all thinking in terms of longer strings, >> maybe a name field, where you might assign 32 bytes or so

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-24 Thread Chris Barker
On Mon, Apr 24, 2017 at 10:51 AM, Stephan Hoyer wrote: > - round-tripping of binary data (at least with Python's encoding/decoding) >> -- ANY string of bytes can be decodes as latin-1 and re-encoded to get the >> same bytes back. You may get garbage, but you won't get an EncodingError. >> > > For

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-24 Thread Aldcroft, Thomas
On Mon, Apr 24, 2017 at 1:04 PM, Chris Barker wrote: > On Fri, Apr 21, 2017 at 2:34 PM, Stephan Hoyer wrote: > > >> In this case, we want something compatible with Python's string (i.e. >>> full Unicode supporting) and I think should be as transparent as possible. >>> Python's string has made th

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-24 Thread Stephan Hoyer
On Mon, Apr 24, 2017 at 10:04 AM, Chris Barker wrote: > latin-1 or latin-9 buys you (over ASCII): > > ... > > - round-tripping of binary data (at least with Python's encoding/decoding) > -- ANY string of bytes can be decodes as latin-1 and re-encoded to get the > same bytes back. You may get garb

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-24 Thread Chris Barker
On Fri, Apr 21, 2017 at 2:34 PM, Stephan Hoyer wrote: > In this case, we want something compatible with Python's string (i.e. full >> Unicode supporting) and I think should be as transparent as possible. >> Python's string has made the decision to present a character oriented API >> to users (de