On Fri, Jul 18, 2014 at 10:29 AM, Andrew Collette <andrew.colle...@gmail.com > wrote:
> The root of the issue is that HDF5 provides a limited set of > fixed-storage-width string types, and a fixed-storage-width NumPy type > of the same size using Latin-1 can't map to any of them without losing > data. For example, if "a10" is a hypothetical 10-byte-wide NumPy > dtype using Latin-1, reading/writing to an "a10" HDF5 dataset backed > with 10-byte UTF-8 storage would risk truncation, even if the > advertised widths are the same. > I do get this, yes. > There is unfortunately nothing we can do in the h5py code base to > paper over this... it's a limitation of the format. yup. Similar limitations in numpy. > This is where I wonder about HDF's "ascii" type -- is it really ascii? > Or is > > it that old standby > > > one-byte-per-character-and-if-it's-ascii-we-all-know-what-it-means-but-if-it's-not-we'll-still-pass-it-around > > type? i.e the old char* ? > > > > In which case, you can just push a latin-1 type into and out of your HDF > > ascii arrays and everything will work just fine. Unless someone stores > > something other than latin-1 or ascii in it -- but even then, the bytes > > would still be preserved. > > The encoding is explicitly ASCII (H5T_ASCII, in HDF5 lingo). > Anecdotally, I've heard people store other encodings in it, but (1) > I'm not eager to make things worse by mis-labelling data, and (2) the > HDF Group has made indications that they may start checking the > encoding at conversion time. (1) is particularly important, as a > major focus of h5py is compatibility with the rest of the HDF5 > ecosystem. > If it were me, I'd encourage the HDF group to NOT enforce ascii. just like with the numpy 'S' type, I'm guessing there is a fair bit of code in the wild that [ab]uses the ascii type by throwing other bytes in there. In fact, this one reason that utf-8 is so popular -- you still use all that code that simply takes a char* and passes it around (or maybe compares it), without making any assumptions about what it means. that from this particular HDF5 perspective, they provide maximum > compatibility and minimize the chances of accidental data loss. What it would do is push the problem from the HDF5<->numpy interface to the python<->numpy interface. I'm not sure that's a good trade off. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion