Hi Chris, > Again, they shouldn't do that, they should be pushing a 10-character string > into something -- and utf-8 is going to (Possible) truncate that. That's > HDF/utf-8 limitation that people are going to have to deal with. I think > you're suggesting that numpy follow the HDF model, so that the numpy-HDF > transition can be clean and easy. However, I think that utf-8 is an > inappropriate model for numpy, and that the mess of bytes to utf-8 is > pyHDF's problem, not numpy's.
The root of the issue is that HDF5 provides a limited set of fixed-storage-width string types, and a fixed-storage-width NumPy type of the same size using Latin-1 can't map to any of them without losing data. For example, if "a10" is a hypothetical 10-byte-wide NumPy dtype using Latin-1, reading/writing to an "a10" HDF5 dataset backed with 10-byte UTF-8 storage would risk truncation, even if the advertised widths are the same. There is unfortunately nothing we can do in the h5py code base to paper over this... it's a limitation of the format. > This is where I wonder about HDF's "ascii" type -- is it really ascii? Or is > it that old standby > one-byte-per-character-and-if-it's-ascii-we-all-know-what-it-means-but-if-it's-not-we'll-still-pass-it-around > type? i.e the old char* ? > > In which case, you can just push a latin-1 type into and out of your HDF > ascii arrays and everything will work just fine. Unless someone stores > something other than latin-1 or ascii in it -- but even then, the bytes > would still be preserved. The encoding is explicitly ASCII (H5T_ASCII, in HDF5 lingo). Anecdotally, I've heard people store other encodings in it, but (1) I'm not eager to make things worse by mis-labelling data, and (2) the HDF Group has made indications that they may start checking the encoding at conversion time. (1) is particularly important, as a major focus of h5py is compatibility with the rest of the HDF5 ecosystem. Again, I wouldn't argue that these considerations by themselves are enough of a reason for NumPy to use ASCII or UTF-8, certainly. Just that from this particular HDF5 perspective, they provide maximum compatibility and minimize the chances of accidental data loss. Andrew _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion