Hi Chris,

> Again, they shouldn't do that, they should be pushing a 10-character string
> into something -- and utf-8 is going to (Possible) truncate that. That's
> HDF/utf-8 limitation that people are going to have to deal with. I think
> you're suggesting that numpy follow the HDF model, so that the numpy-HDF
> transition can be clean and easy. However, I think that utf-8 is an
> inappropriate model for numpy, and that the mess of bytes to utf-8 is
> pyHDF's problem, not numpy's.

The root of the issue is that HDF5 provides a limited set of
fixed-storage-width string types, and a fixed-storage-width NumPy type
of the same size using Latin-1 can't map to any of them without losing
data.  For example, if "a10" is a hypothetical 10-byte-wide NumPy
dtype using Latin-1, reading/writing to an "a10" HDF5 dataset backed
with 10-byte UTF-8 storage would risk truncation, even if the
advertised widths are the same.

There is unfortunately nothing we can do in the h5py code base to
paper over this... it's a limitation of the format.

> This is where I wonder about HDF's "ascii" type -- is it really ascii? Or is
> it that old standby
> one-byte-per-character-and-if-it's-ascii-we-all-know-what-it-means-but-if-it's-not-we'll-still-pass-it-around
> type? i.e the old char* ?
>
> In which case, you can just push a latin-1 type into and out of your HDF
> ascii arrays and everything will work just fine. Unless someone stores
> something other than latin-1 or ascii in it -- but even then, the bytes
> would still be preserved.

The encoding is explicitly ASCII (H5T_ASCII, in HDF5 lingo).
Anecdotally, I've heard people store other encodings in it, but (1)
I'm not eager to make things worse by mis-labelling data, and (2) the
HDF Group has made indications that they may start checking the
encoding at conversion time.  (1) is particularly important, as a
major focus of h5py is compatibility with the rest of the HDF5
ecosystem.

Again, I wouldn't argue that these considerations by themselves are
enough of a reason for NumPy to use ASCII or UTF-8, certainly.  Just
that from this particular HDF5 perspective, they provide maximum
compatibility and minimize the chances of accidental data loss.

Andrew
_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Reply via email to