On Sat, Jul 12, 2014 at 10:17 AM, Charles R Harris < charlesr.har...@gmail.com> wrote:
> As previous posts have pointed out, Numpy's `S` type is currently treated > as a byte string, which leads to more complicated code in python3. > Also, a byte string in py3 is not, in fact the same as the py2 string type. So we have a problem -- if we want 'S' to mean what it essentially does in py2, what do we map it to in pure-python land? I propose we embrace the py3 model as fully as possible: There is text data, and there is binary data. In py3, that is 'str' and 'bytes'. So numpy should have dtypes to match these. We're a bit stuck, however, because 'S' mapped to the py2 string type, which no longer exists in py3. Sorry not running py3 to see what 'S' does now, but I know it's bit broken, and may be too late to change it. But: it is certainly a common case in the scientific world to have 1-byte-per-character string data, and care about store size. So a 1-byte-per-character text data types may be a good idea: As for a bytes type -- do we need it, or are we fine with simply using uint8 arrays? (or, even the most common case, converting directly to the type that is actually stored in those bytes... > especially for ascii strings. This note proposes to adapt the currently > existing 'a' type letter, currently aliased to 'S', as a new fixed encoding > dtype. > +1 > Python 3.3 introduced two one byte internal representations for unicode > strings, ascii and latin1. Ascii has the advantage that it is a subset of > UTF-8, whereas latin1 has a few more symbols. > +1 for latin-1 -- those extra symbols are handy. Also, at least with Python's stdlib encoding, you can round-trip any binary data through latin-1 -- kind of making it act like a bytes object.... > Another possibility is to just make it an UTF-8 encoding, but I think this > would involve more overhead as Python would need to determine the maximum > character size. > yeah -- that is a) overhead, and b) breaks the numpy fixed size dtype model. And it's trickier for numpy arrays, 'cause they are mutable -- python strings can do OK, as they don't need to accommodate potentially changing sizes of strings. On Sat, Jul 12, 2014 at 5:02 PM, Nathaniel Smith <n...@pobox.com> wrote: > I feel like for most purposes, what we *really* want is a variable length > string dtype (I.e., where each element can be a different length.). well, that is fundamentally different than the usual numpy data model -- it would require that the array store pointers and dereference them on use -- is there anywhere else in numpy (other than the object dtype ) that does that? And if we did -- would it end up having any advantage over putting strings in an object array? Or for that matter, using a list of strings instead? > Pandas pays quite some price in overhead to fake this right now. Adding > such a thing will cause some problems regarding compatibility (what to do > with array(["foo"])) and education, but I think it's worth it in the long > run. i.e do you use the fixed-length type or the variable-length type? I'm not sure it's to killer to have a default and let eh user set a dtype if they want something else. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion