> The point is that I don't think we can just decide to use Unicode or > Bytes in all places where PyString was used earlier.
Agreed. I think it's helpful to remember the origins of all this: IMHO, there are two distinct types of data that Python2 strings support: 1) text: this is the traditional "string". 2) bytes: raw bytes -- they could represent anything. This, of course, is what the py3k string and bytes types are all about. However, when python started, it just so happened that text was represented by an array of unsigned single byte integers, so there really was no point in having a "bytes" type, as a string would work just as well. Enter unicode: Now we have multiple ways of representing text internally, but want a single interface to that -- one that looks and acts like a sequence of characters to user's code. The result is that the unicode type was introduced. In a way, unicode strings are a bit like arrays: they have an encoding associated with them (like a dtype in numpy). You can represent a given bit of text in multiple different arangements of bytes, but they are all supposed to mean the same thing and, if you know the encoding, you can convert between them. This is kind of like how one can represent 5 in any of many dtypes: uint8, int16, int32, float32, float64, etc. Not any value represented by one dtype can be converted to all other dtypes, but many can. Just like encodings. Anyway, all this brings me to think about the use of strings in numpy in this way: if it is meant to be a human-readable piece of text, it should be a unicode object. If not, then it is bytes. So: "fromstring" and the like should, of course, work with bytes (though maybe buffers really...) > Which one it will > be should depend on the use. Users will expect that eg. array([1,2,3], > dtype='f4') still works, and they don't have to do e.g. array([1,2,3], > dtype=b'f4'). Personally, I try to use np.float32 instead, anyway, but I digress. In this case, the "type code" is supposed to be a human-readable bit of text -- it should be a unicode object (convertible to ascii for interfacing with C...) If we used b'f4', it would confuse things, as it couldn't be printed right. Also: would the actual bytes involved potentially change depending on what encoding was used for the literal? i.e. if the code was written in utf16, would that byte string be 4 bytes long? > To summarize the use cases I've ran across so far: > > 1) For 'S' dtype, I believe we use Bytes for the raw data and the > interface. I don't think so here. 'S' is usually used to store human-readable strings, I'd certainly expect to be able to do: s_array = np.array(['this', 'that'], dtype='S10') And I'd expect it to work with non-literals that were unicode strings, i.e. human readable text. In fact, it's pretty rare that I'd ever want bytes here. So I'd see 'S' mapped to 'U' here. Francesc Alted wrote: > the next should still work: > > In [2]: s = np.array(['asa'], dtype="S10") > > In [3]: s[0] > Out[3]: 'asa' # will become b'asa' in Python 3 I don't like that -- I put in a string, and get a bytes object back? > In [4]: s.dtype.itemsize > Out[4]: 10 # still 1-byte per element But what it the the strings passed in aren't representable in one byte per character? Do we define "S" as only supporting ANSI-only string? what encoding? Pauli Virtanen wrote: > 'U' > is same as Python 3 unicode and probably in same internal representation > (need to check). Neither is associated with encoding info. Isn't it? I thought the encoding was always the same internally? so it is known? Francesc Alted wrote: > That could be a good idea because that would ensure compatibility with > existing NumPy scripts (i.e. old 'string' dtypes are mapped to 'bytes', as it > should). What do you mean by compatible? It wold mean a lot of user code would have to change with the 2->3 transition. > The only thing that I don't like is that that 'S' seems to be the > initial letter for 'string', which is actually 'unicode' in Python 3 :-/ > But, for the sake of compatibility, we can probably live with that. I suppose we could at least depricate it. >> Also, what will a bytes dtype mean within a py2 program context? Does >> it matter if the bytes dtype just fails somehow if used in a py2 >> program? well, it should work in 2.6 anyway. > Maybe we want to introduce a separate "bytes" dtype that's an alias > for 'S'? What do we need "bytes" for? does it support anything that np.uint8 doesn't? > 2) The field names: > > a = array([], dtype=[('a', int)]) > a = array([], dtype=[(b'a', int)]) > > This is somewhat of an internal issue. We need to decide whether we > internally coerce input to Unicode or Bytes. Unicode is clear to me here -- it really should match what Python does for variable names -- that is unicode in py3k, no? > 3) Format strings > > a = array([], dtype=b'i4') > > I don't think it makes sense to handle format strings in Unicode > internally -- they should always be coerced to bytes. This should be fine -- we control what is a valid format string, and thus they can always be ASCII-safe. -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion