On Mon, Jan 20, 2014 at 10:00:55AM -0500, Aldcroft, Thomas wrote: > On Mon, Jan 20, 2014 at 5:11 AM, Oscar Benjamin > <oscar.j.benja...@gmail.com>wrote: > > How significant are the performance issues? Does anyone really use numpy > > for > > this kind of text handling? If you really are operating on gigantic text > > arrays of ascii characters then is it so bad to just use the bytes dtype > > and > > handle decoding/encoding at the boundaries? If you're not operating on > > gigantic text arrays is there really a noticeable problem just using the > > 'U' > > dtype? > > > > I use numpy for giga-row arrays of short text strings, so memory and > performance issues are real. > > As discussed in the previous parent thread, using the bytes dtype is really > a problem because users of a text array want to do things like filtering > (`match_rows = text_array == 'match'`), printing, or other manipulations in > a natural way without having to continually use bytestring literals or > `.decode('ascii')` everywhere. I tried converting a few packages while > leaving the arrays as bytestrings and it just ended up as a very big mess. > > From my perspective the goal here is to provide a pragmatic way to allow > numpy-based applications and end users to use python 3. Something like > this proposal seems to be the right direction, maybe not pure and perfect > but a sensible step to get us there given the reality of scientific > computing.
I don't really see how writing b'match' instead of 'match' is that big a deal. And why are you needing to write .decode('ascii') everywhere? If you really do just want to work with bytes in your own known encoding then why not just read and write in binary mode? I apologise if I'm wrong but I suspect that much of the difficulty in getting the bytes/unicode separation right is down to the fact that a lot of the code you're using (or attempting to support) hasn't yet been ported to a clean text model. When I started using Python 3 it took me quite a few failed attempts at understanding the text model before I got to the point where I understood how it is supposed to be used. The problem was that I had been conflating text and bytes in many places, and that's hard to disentangle. Having fixed most of those problems I now understand why it is such an improvement. In any case I don't see anything wrong with a more efficient dtype for representing text if the user can specify the encoding. The problem is that numpy arrays expose their underlying memory buffer. Allowing them to interact directly with text strings on the one side and binary files on the other breaches Python 3's very good text model unless the user can specify the encoding that is to be used. Or at least if there is to be a blessed encoding then make it unicode-capable utf-8 instead of legacy ascii/latin-1. Oscar _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion