On Fri, Jul 18, 2014 at 3:30 PM, Andrew Collette <andrew.colle...@gmail.com> wrote:
> Hi Chris, > > > Actually, I agree about the truncation issue, but it's a question of > where > > to put it -- I'm suggesting that I don't want it at the python<->numpy > > interface. > > Yes, that's a good point. Of course, by using Latin-1 rather than > UTF-8 we can't support all Unicode code points (hence the "?" > replacement possible on read from HDF5). > > > do vlen strings support full unicode? -- then, yes, that's good. > > Yes, they do. It's somewhat unfortunate to immediately cast to vlen > though, since people usually have fixed-width datasets to start with > for efficiency reasons... > > > what about reading from fixed-width UTF-8 to 'U' -- that seems like the > > natural way to go for unicode. Tough a bit hard to know how long U needs > to > > be -- but <= the length of the utf-8 array (in characters). > > Space concerns ("U" has a 4x space penalty for ASCII-ish data). Plus, > for similar reasons to this discussion, creating "U" datasets is > unsupported at the moment. > > > note that I'm also proposing a "bytes" dtype, which might make sense for > > grabbing utf-8 data from HDF-5. Then either h5py or the user could > decode to > > a unicode type. > > Sound quite like the existing 'S' type. > > >> In any case, I can say that the lack of an text 'S' type in NumPy has > >> been a significant pain point for h5py users on Python 3 over the > >> years. > > > > isn't the current 'S' a pretty good map to hdf ascii? > > Yes; in fact, right now all fixed-width strings in h5py (ASCII and > UTF-8) are read/written as 'S'. The problem is that on Py3, 'S' is > treated as bytes, not text, so you can't freely mix it with str. > > I am about to leave for the weekend... thanks for a great discussion! > To conclude, it strikes me that in choosing an encoding we get to pick > at most two of the following: > > 1. Support for all Unicode characters > 2. Fixed number of characters > 3. Fixed number of storage bytes > > At this point, I would vote for UTF-8 in a fixed width buffer (1/3), > but I imagine as this progresses towards a NEP others will weigh in. > At some point I'm pretty sure we will want to support utf-8 as it looks well on its way to a universal standard. Chuck
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion