to, 2009-11-26 kello 17:37 -0700, Charles R Harris kirjoitti: [clip] > I'm not clear on your recommendation here, is it that we should use > bytes, with unicode converted to UTF8?
The point is that I don't think we can just decide to use Unicode or Bytes in all places where PyString was used earlier. Which one it will be should depend on the use. Users will expect that eg. array([1,2,3], dtype='f4') still works, and they don't have to do e.g. array([1,2,3], dtype=b'f4'). To summarize the use cases I've ran across so far: 1) For 'S' dtype, I believe we use Bytes for the raw data and the interface. Maybe we want to introduce a separate "bytes" dtype that's an alias for 'S'? 2) The field names: a = array([], dtype=[('a', int)]) a = array([], dtype=[(b'a', int)]) This is somewhat of an internal issue. We need to decide whether we internally coerce input to Unicode or Bytes. Or whether we allow for both Unicode and Bytes (but preserving previous semantics in this case requires extra work, due to semantic changes in PyDict). Currently, there's some code in Numpy to allow for Unicode field names, but it's not been coherently implemented in all places, so e.g. direct creation of dtypes with unicode field names fails. This has also implications on field titles, as also those are stored in the fields dict. 3) Format strings a = array([], dtype=b'i4') I don't think it makes sense to handle format strings in Unicode internally -- they should always be coerced to bytes. This will make it easier at many points, since it will be enought to do PyBytes_AS_STRING(str) to get the char* pointer, rather than having to encode to utf-8 first. Same for all other similar uses of string, e.g. protocol descriptors. User input should just be coerced to ASCII on input, I believe. The problem here is that preserving repr() in this case requires some extra work. But maybe that has to be done. > Will that support arrays that have been pickled and such? Are the pickles backward compatible between Python 2 and 3 at all? I think using Bytes for format strings will be backward-compatible. Field names are then a bit more difficult. Actually, we'll probably just have to coerce them to either Bytes or Unicode internally, since we'll need to do that on unpickling if we want to be backward-compatible. > Or will we just have a minimum of code to fix up? I think we will need in any case to replace all use of PyString in Numpy by PyBytes or PyUnicode, depending on context, and #define PyString PyBytes for Python 2. This seems to be the easiest way to make sure we have fixed all points that need fixing. Currently, 193 of 800 numpy.core tests don't pass, and this seems largely due to Bytes vs. Unicode issues. > And could you expand on the changes that repr() might undergo? The main thing is that dtype('i4') dtype([('a', 'i4')]) may become dtype(b'i4') dtype([(b'a', b'i4')]) Of course, we can write and #ifdef separate repr formatting code for Py3, but this is a bit of extra work. > Mind, I think using bytes sounds best, but I haven't looked into the > whole strings part of the transition and don't have an informed > opinion on the matter. *** By the way, should I commit this stuff (after factoring the commits to logical chunks) to SVN? It does not break anything for Python 2, at least as far as the test suite is concerned. Pauli _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion