Hi Chris,

> What it would do is push the problem from the HDF5<->numpy interface to the
> python<->numpy interface.
>
> I'm not sure that's a good trade off.

Maybe I'm being too paranoid about the truncation issue.  We already
perform truncation when going from e.g. vlen to fixed-width strings in
h5py... it's just the truncation behavior for same-width data that
throws me.

Here's a strawman for how a Latin-1 "a" type might be handled in h5py:

1. Creation from existing "a" data: Use vlen strings.  Doesn't
preserve the dtype, but maybe that's not so important.
2. Writing from "a" data to fixed-width ASCII: Copy, and replace
bytes>127 with "?" (or don't)
3. Writing from "a" data to fixed-width UTF-8: Transcode and truncate
(being careful not to end in the middle of a multibyte character)
4. Reading from fixed-width ASCII to "a": Straight copy, no inspection
5. Reading from fixed-width UTF-8 to "a": Copy, and replace
non-Latin-1 chars with "?"

(The above example uses replacement rather than raising an exception,
because an exception in the HDF5 conversion callback will leave the
write/read half-completed).

In any case, I can say that the lack of an text 'S' type in NumPy has
been a significant pain point for h5py users on Python 3 over the
years.  Whatever specific encoding ends up being used, such a type can
only improve the situation, and I'm firmly in favor of it.

Andrew
_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Reply via email to