Hi Chuck, > This note proposes to adapt the currently existing 'a' > type letter, currently aliased to 'S', as a new fixed encoding dtype. Python > 3.3 introduced two one byte internal representations for unicode strings, > ascii and latin1. Ascii has the advantage that it is a subset of UTF-8, > whereas latin1 has a few more symbols. Another possibility is to just make > it an UTF-8 encoding, but I think this would involve more overhead as Python > would need to determine the maximum character size.
For storing data in HDF5 (PyTables or h5py), it would be somewhat cleaner if either ASCII or UTF-8 are used, as these are the only two charsets officially supported by the library. Latin-1 would require a custom read/write converter, which isn't the end of the world but would be tricky to do in a correct way, and likely somewhat slow. We'd also run into truncation issues since certain latin-1 chars become multibyte sequences in UTF8. I assume 'a' strings would still be null-padded? Andrew _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion