On Fri, Jul 18, 2014 at 12:52 PM, Andrew Collette <andrew.colle...@gmail.com > wrote:
> > What it would do is push the problem from the HDF5<->numpy interface to > the > > python<->numpy interface. > > > > I'm not sure that's a good trade off. > > Maybe I'm being too paranoid about the truncation issue. Actually, I agree about the truncation issue, but it's a question of where to put it -- I'm suggesting that I don't want it at the python<->numpy interface. > Here's a strawman for how a Latin-1 "a" type might be handled in h5py: > > 1. Creation from existing "a" data: Use vlen strings. Doesn't > preserve the dtype, but maybe that's not so important. > do vlen strings support full unicode? -- then, yes, that's good. > 2. Writing from "a" data to fixed-width ASCII: Copy, and replace > bytes>127 with "?" (or don't) > I'd vote for don't, unless HDF starts enforcing pure ascii. But if it does, then yes, replacement makes more sense than exceptions. 3. Writing from "a" data to fixed-width UTF-8: Transcode and truncate > (being careful not to end in the middle of a multibyte character) > yup -- buyer beware. > 4. Reading from fixed-width ASCII to "a": Straight copy, no inspection > yup. > 5. Reading from fixed-width UTF-8 to "a": Copy, and replace > non-Latin-1 chars with "?" > sure what about reading from fixed-width UTF-8 to 'U' -- that seems like the natural way to go for unicode. Tough a bit hard to know how long U needs to be -- but <= the length of the utf-8 array (in characters). > (The above example uses replacement rather than raising an exception, > because an exception in the HDF5 conversion callback will leave the > write/read half-completed). > and really -- what would you do with an exception on read? give up and throw the file away? note that I'm also proposing a "bytes" dtype, which might make sense for grabbing utf-8 data from HDF-5. Then either h5py or the user could decode to a unicode type. In any case, I can say that the lack of an text 'S' type in NumPy has > been a significant pain point for h5py users on Python 3 over the > years. isn't the current 'S' a pretty good map to hdf ascii? Whatever specific encoding ends up being used, such a type can > only improve the situation, and I'm firmly in favor of it. agreed. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion