On Mon, Jan 20, 2014 at 2:27 PM, Oscar Benjamin <oscar.j.benja...@gmail.com>wrote:
> > On Jan 20, 2014 8:35 PM, "Charles R Harris" <charlesr.har...@gmail.com> > wrote: > > > > I think we may want something like PEP 393. The S datatype may be the > wrong place to look, we might want a modification of U instead so as to > transparently get the benefit of python strings. > > The approach taken in PEP 393 (the FSR) makes more sense for str than it > does for numpy arrays for two reasons: str is immutable and opaque. > > Since str is immutable the maximum code point in the string can be > determined once when the string is created before anything else can get a > pointer to the string buffer. > > Since it is opaque no one can rightly expect it to expose a particular > binary format so it is free to choose without compromising any expected > semantics. > > If someone can call buffer on an array then the FSR is a semantic change. > > If a numpy 'U' array used the FSR and consisted only of ASCII characters > then it would have a one byte per char buffer. What then happens if you put > a higher code point in? The buffer needs to be resized and the data copied > over. But then what happens to any buffer objects or array views? They > would be pointing at the old buffer from before the resize. Subsequent > modifications to the resized array would not show up in other views and > vice versa. > > I don't think that this can be done transparently since users of a numpy > array need to know about the binary representation. That's why I suggest a > dtype that has an encoding. Only in that way can it consistently have both > a binary and a text interface. > I didn't say we should change the S type, but that we should have something, say 's', that appeared to python as a string. I think if we want transparent string interoperability with python together with a compressed representation, and I think we need both, we are going to have to deal with the difficulties of utf-8. That means raising errors if the string doesn't fit in the allotted size, etc. Mind, this is a workaround for the mass of ascii data that is already out there, not a substitute for 'U'. Chuck
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion