On Mon, Jul 14, 2014 at 10:39 AM, Andrew Collette <andrew.colle...@gmail.com > wrote:
> For storing data in HDF5 (PyTables or h5py), it would be somewhat > cleaner if either ASCII or UTF-8 are used, as these are the only two > charsets officially supported by the library. good argument for ASCII, but utf-8 is a bad idea, as there is no 1:1 correspondence between length of string in bytes and length in characters -- as numpy needs to pre-allocate a defined number of bytes for a dtype, there is a disconnect between the user and numpy as to how long a string is being stored...this isn't a problem for immutable strings, and less of a problem for HDF, as you can determine how many bytes you need before you write the file (or does HDF support var-length elements?) > Latin-1 would require a > custom read/write converter, which isn't the end of the world "custom"? it would be an encoding operation -- which you'd need to go from utf-8 to/from unicode anyway. So you would lose the ability to have a nice 1:1 binary representation map between numpy and HDF... good argument for ASCII, I guess. Or for HDF to use latin-1 ;-) Does HDF enforce ascii-only? what does it do with the > 127 values? > would be tricky to do in a correct way, and likely somewhat slow. > We'd also run into truncation issues since certain latin-1 chars > become multibyte sequences in UTF8. > that's the whole issue with UTF-8 -- it needs to be addressed somewhere, and the numpy-HDF interface seems like a smarter place to put it than the numpy-user interface! I assume 'a' strings would still be null-padded? yup. -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion