On Mon, Jul 14, 2014 at 10:39 AM, Andrew Collette <andrew.colle...@gmail.com
> wrote:


> For storing data in HDF5 (PyTables or h5py), it would be somewhat
> cleaner if either ASCII or UTF-8 are used, as these are the only two
> charsets officially supported by the library.


good argument for ASCII, but utf-8 is a bad idea, as there is no 1:1
correspondence between length of string in bytes and length in characters
-- as numpy needs to pre-allocate a defined number of bytes for a dtype,
there is a disconnect between the user and numpy as to how long a string is
being stored...this isn't a problem for immutable strings, and less of a
problem for HDF, as you can determine how many bytes you need before you
write the file (or does HDF support var-length elements?)


>  Latin-1 would require a
> custom read/write converter, which isn't the end of the world


"custom"? it would be an encoding operation -- which you'd need to go from
utf-8 to/from unicode anyway. So you would lose the ability to have a nice
1:1 binary representation map between numpy and HDF... good argument for
ASCII, I guess. Or for HDF to use latin-1 ;-)

Does HDF enforce ascii-only? what does it do with the > 127 values?


> would be tricky to do in a correct way, and likely somewhat slow.
> We'd also run into truncation issues since certain latin-1 chars
> become multibyte sequences in UTF8.
>

that's the whole issue with UTF-8 -- it needs to be addressed somewhere,
and the numpy-HDF interface seems like a smarter place to put it than the
numpy-user interface!

I assume 'a' strings would still be null-padded?


yup.



-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R            (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception

chris.bar...@noaa.gov
_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Reply via email to