On Mon, Jan 20, 2014 at 5:11 AM, Oscar Benjamin <oscar.j.benja...@gmail.com>wrote:
> On Fri, Jan 17, 2014 at 02:30:19PM -0800, Chris Barker wrote: > > Folks, > > > > I've been blathering away on the related threads a lot -- sorry if it's > too > > much. It's gotten a bit tangled up, so I thought I'd start a new one to > > address this one question (i.e. dont bring up genfromtext here): > > > > Would it be a good thing for numpy to have a one-byte--per-character > string > > type? > > If you mean a string type that can only hold latin-1 characters then I > think > that this is a step backwards. > > If you mean a dtype that holds bytes in a known, specifiable encoding and > automatically decodes them to unicode strings when you call .item() and > has a > friendly repr() then that may be a good idea. > > So for example you could have dtype='S:utf-8' which would store strings > encoded as utf-8 e.g.: > > >>> text = array(['foo', 'bar'], dtype='S:utf-8') > >>> text > array(['foo', 'bar'], dtype='|S3:utf-8') > >>> print(a) > ['foo', 'bar'] > >>> a[0] > 'foo' > >>> a.nbytes > 6 > > > We did have that with the 'S' type in py2, but the changes in py3 have > made > > it not quite the right thing. And it appears that enough people use 'S' > in > > py3 to mean 'bytes', so that we can't change that now. > > It wasn't really the right thing before either. That's why Python 3 has > changed all of this. > > > The only difference may be that 'S' currently auto translates to a bytes > > object, resulting in things like: > > > > np.array(['some text',], dtype='S')[0] == 'some text' > > > > yielding False on Py3. And you can't do all the usual text stuff with the > > resulting bytes object, either. (and it probably used the default > encoding > > to generate the bytes, so will barf on some inputs, though that may be > > unavoidable.) So you need to decode the bytes that are given back, and > now > > that I think about it, I have no idea what encoding you'd need to use in > > the general case. > > You should let the user specify the encoding or otherwise require them to > use > the 'U' dtype. > > > So the correct solution is (particularly on py3) to use the 'U' (unicode) > > dtype for text in numpy arrays. > > Absolutely. Embrace the Python 3 text model. Once you understand the how, > what > and why of it you'll see that it really is a good thing! > > > However, the 'U' dtype is 4 bytes per character, and that may be "too > big" > > for some use-cases. And there is a lot of text in scientific data sets > that > > are pure ascii, or at least some 1-byte-per-character encoding. > > > > So, in the spirit of having multiple numeric types that use different > > amounts of memory, and can hold different ranges of values, a > one-byte-per > > character dtype would be nice: > > > > (note, this opens the door for a 2-byte per (UCS-2) dtype too, I > personally > > don't think that's worth it, but maybe that's because I'm an english > > speaker...) > > You could just use a 2-byte encoding with the S dtype e.g. > dtype='S:utf-16-le'. > > > It could use the 's' (lower-case s) type identifier. > > > > For passing to/from python built-in objects, it would > > > > * Allow either Python bytes objects or Python unicode objects as input > > a) bytes objects would be passed through as-is > > b) unicode objects would be encoded as latin-1 > > > > [note: I'm not entirely sure that bytes objects should be allowed, but it > > would provide an nice efficiency in a fairly common case] > > I think it would be a bad idea to accept bytes here. There are good reasons > that Python 3 creates a barrier between the two worlds of text and bytes. > Allowing implicit mixing of bytes and text is a recipe for mojibake. The > TypeErrors in Python 3 are used to guard against conceptual errors that > lead > to data corruption. Attempting to undermine that barrier in numpy would be > a > backward step. > > I apologise if this is misplaced but there seems to be an attitude that > scientific programming isn't really affected by the issues that have lead > to > the Python 3 text model. I think that's ridiculous; data corruption is a > problem in scientific programming just as it is anywhere else. > > > * It would create python unicode text objects, decoded as latin-1. > > Don't try to bless a particular encoding and stop trying to pretend that > it's > possible to write a sensible system where end users don't need to worry > about > and specify the encoding of their data. > > > Could we have a way to specify another encoding? I'm not sure how that > > would fit into the dtype system. > > If the encoding cannot be specified then the whole idea is misguided. > > > I've explained the latin-1 thing on other threads, but the short version > is: > > > > - It will work perfectly for ascii text > > - It will work perfectly for latin-1 text (natch) > > - It will never give you an UnicodeEncodeError regardless of what > > arbitrary bytes you pass in. > > - It will preserve those arbitrary bytes through a encoding/decoding > > operation. > > So what happens if I do: > > >>> with open('myutf-8-file.txt', 'rb') as fin: > ... text = numpy.fromfile(fin, dtype='s') > >>> text[0] # Decodes as latin-1 leading to mojibake. > > I would propose that it's better to be able to do: > > >>> with open('myutf-8-file.txt', 'rb') as fin: > ... text = numpy.fromfile(fin, dtype='s:utf-8') > > There's really no way to get around the fact that users need to specify the > encoding of their text files. > > > (it still wouldn't allow you to store arbitrary unicode -- but that's the > > limitation of one-byte per character...) > > You could if you use 'utf-8'. It would be one-byte-per-char for text that > only > contains ascii characters. However it would still support every character > that > the unicode consortium can dream up. > The only possible advantage here is as a memory optimisation (potentially > having a speed impact too although it could equally be a speed regression). > Otherwise it just adds needless complexity to numpy and to the code that > uses > the new dtype as well as limiting its ability to handle unicode. > How significant are the performance issues? Does anyone really use numpy > for > this kind of text handling? If you really are operating on gigantic text > arrays of ascii characters then is it so bad to just use the bytes dtype > and > handle decoding/encoding at the boundaries? If you're not operating on > gigantic text arrays is there really a noticeable problem just using the > 'U' > dtype? > I use numpy for giga-row arrays of short text strings, so memory and performance issues are real. As discussed in the previous parent thread, using the bytes dtype is really a problem because users of a text array want to do things like filtering (`match_rows = text_array == 'match'`), printing, or other manipulations in a natural way without having to continually use bytestring literals or `.decode('ascii')` everywhere. I tried converting a few packages while leaving the arrays as bytestrings and it just ended up as a very big mess. >From my perspective the goal here is to provide a pragmatic way to allow numpy-based applications and end users to use python 3. Something like this proposal seems to be the right direction, maybe not pure and perfect but a sensible step to get us there given the reality of scientific computing. - Tom > > > Oscar > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion@scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion >
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion