On 01/03/2011 06:35 PM, Robert Bradshaw wrote: > On Mon, Jan 3, 2011 at 4:01 AM, Lisandro Dalcin<[email protected]> wrote: > >> On 3 January 2011 04:41, Stefan Behnel<[email protected]> wrote: >> >>> Hi, >>> >>> I've been working on a fix for ticket #602, negative indexing for inferred >>> char*. >>> >>> http://trac.cython.org/cython_trac/ticket/602 >>> >>> Currently, when you write this: >>> >>> s = b'abc' >>> >>> s is inferred as char*. This has several drawbacks. For one, we loose the >>> length information, so "len(s)" becomes O(n) instead of O(1). Negative >>> indexing fails completely because it will use pointer arithmetic, thus >>> leaving the allocated memory area of the string. Also, code like the >>> following is extremely inefficient because it requires multiple conversions >>> from a char* of unknown length to a Python bytes object: >>> >>> s = b'abc' >>> a = s1 + s >>> b = s2 + s >>> >>> I came to the conclusion that the right fix is to stop letting byte string >>> literals start off as char*. This immediately fixes these issues and >>> improves Python compatibility while still allowing automatic coercion, but >>> it also comes with its own drawbacks. >>> >>> In nogil blocks, you will have to explicitly declare a variable as char* >>> when assigning a byte string literal to it, otherwise you'd get a compile >>> time error for a Python object assignment. I think this is a minor issue as >>> most users would declare their variables anyway when using nogil blocks. >>> Given that there isn't much you can do with a Python string inside of a >>> nogil block, we could also honour nogil blocks during type inference and >>> automatically infer char* for literals here. I don't think it would hurt >>> anyone to do that. >>> >>> The second drawback is that it impacts type inference for char loops. >>> Previously, you could write >>> >>> s = b'abc' >>> for c in s: >>> print c >>> >>> and Cython would infer 'char' for c and print integer byte values. When s >>> is inferred as 'bytes', c will be inferred as 'Python object' because >>> Python 2 returns 1-byte strings and Python 3 returns integers on iteration. >>> Thus the loop will run entirely in Python code and return different things >>> in Py2 and Py3. >>> >>> I do not expect that this is a major issue either. Iteration over literals >>> should be rare, after all, and if the byte string is constructed in any >>> way, the type either becomes a bytes object through Python operations (like >>> concatenation) or is explicitly provided, e.g. as a return type of a >>> function call. But it is a clear behavioural change for the type inference >>> in an area where Cython's (and Python's) semantics are tricky anyway. >>> >>> Personally, I think that the advantages outweigh the disadvantages here. >>> Most common use cases won't notice the change because coercion will not be >>> impacted, and most existing code (IMHO) either uses explicit typing or >>> expects a Python bytes object anyway. So my preferred change would be to >>> make byte string literals 'bytes' by default, except in nogil blocks. >>> >>> Opinions? >>> >>> >> +1 >> >> > +1 I might say it should even be required in nogil blocks for consistency. >
+1 to not making nogil blocks a special case, the disadvantage of another special case to remember outweighs the advantage of syntactic brevity IMO. Dag Sverre _______________________________________________ Cython-dev mailing list [email protected] http://codespeak.net/mailman/listinfo/cython-dev
