On Fri, Jan 17, 2014 at 5:18 AM, Freddie Witherden <fred...@witherden.org>wrote:
> In terms of HDF5 it is interesting to look at how h5py -- which has to > go between NumPy types and HDF5 conventions -- handles the problem as > described here: > > http://www.h5py.org/docs/topics/strings.html from that: """All strings in HDF5 hold encoded text. You can’t store arbitrary binary data in HDF5 strings. """ This is actually the same as a py3 string (though the mechanism may be completely different), and the problem with numpy's 'S' - is it text or bytes? Given the name and history, it should be text, but apparently people have been using t for bytes, so we have to keep that meaning/use case. But I suggest, that like Python3 -- we official declare that you should not consider it text, and not do any implicite conversions. Which means we could use a one-byte-per-character text dtype. """At the high-level interface, h5py exposes three kinds of strings. Each maps to a specific type within Python (but see str_py3 below): Fixed-length ASCII (NumPy S type) .... """ This is wrong, or mis-guided, or maybe only a little confusing -- 'S' is not an ASCII string (even though I wish it were...). But clearly the HDF folsk think we need one! """ Fixed-length ASCII These are created when you use numpy.string_: >>> dset.attrs["name"] = numpy.string_("Hello") or the S dtype: >>> dset = f.create_dataset("string_ds", (100,), dtype="S10") """ Pardon my py3 ignorance -- is numpy.string_ the same as 'S' in py3? Form another post, I thought you'd need to use numpy.bytes_ (which is the same on py2) """Variable-length ASCII These are created when you assign a byte string to an attribute: >>> dset.attrs["attr"] = b"Hello" or when you create a dataset with an explicit “bytes” vlen type: >>> dt = h5py.special_dtype(vlen=bytes) >>> dset = f.create_dataset("name", (100,), dtype=dt) Note that they’re not fully identical to Python byte strings. """ This implies that HDF would be well served by an ascii text type. """ What about NumPy’s U type? NumPy also has a Unicode type, a UTF-32 fixed-width format (4-byte characters). HDF5 has no support for wide characters. Rather than trying to hack around this and “pretend” to support it, h5py will raise an error when attempting to create datasets or attributes of this type. """ Interesting, though I think irrelevant to this conversation but it would be nice if HDFpy would encode/decode to/from utf-8 for these. -Chris > which IMHO got it about right. > > Regards, Freddie. > > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion@scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion