Re: [Cython] Fixing #602 - type inference for byte string literals

Dag Sverre Seljebotn Mon, 03 Jan 2011 10:10:20 -0800

On 01/03/2011 06:35 PM, Robert Bradshaw wrote:
> On Mon, Jan 3, 2011 at 4:01 AM, Lisandro Dalcin<[email protected]>  wrote:
>    
>> On 3 January 2011 04:41, Stefan Behnel<[email protected]>  wrote:
>>      
>>> Hi,
>>>
>>> I've been working on a fix for ticket #602, negative indexing for inferred
>>> char*.
>>>
>>> http://trac.cython.org/cython_trac/ticket/602
>>>
>>> Currently, when you write this:
>>>
>>>      s = b'abc'
>>>
>>> s is inferred as char*. This has several drawbacks. For one, we loose the
>>> length information, so "len(s)" becomes O(n) instead of O(1). Negative
>>> indexing fails completely because it will use pointer arithmetic, thus
>>> leaving the allocated memory area of the string. Also, code like the
>>> following is extremely inefficient because it requires multiple conversions
>>> from a char* of unknown length to a Python bytes object:
>>>
>>>      s = b'abc'
>>>      a = s1 + s
>>>      b = s2 + s
>>>
>>> I came to the conclusion that the right fix is to stop letting byte string
>>> literals start off as char*. This immediately fixes these issues and
>>> improves Python compatibility while still allowing automatic coercion, but
>>> it also comes with its own drawbacks.
>>>
>>> In nogil blocks, you will have to explicitly declare a variable as char*
>>> when assigning a byte string literal to it, otherwise you'd get a compile
>>> time error for a Python object assignment. I think this is a minor issue as
>>> most users would declare their variables anyway when using nogil blocks.
>>> Given that there isn't much you can do with a Python string inside of a
>>> nogil block, we could also honour nogil blocks during type inference and
>>> automatically infer char* for literals here. I don't think it would hurt
>>> anyone to do that.
>>>
>>> The second drawback is that it impacts type inference for char loops.
>>> Previously, you could write
>>>
>>>      s = b'abc'
>>>      for c in s:
>>>          print c
>>>
>>> and Cython would infer 'char' for c and print integer byte values. When s
>>> is inferred as 'bytes', c will be inferred as 'Python object' because
>>> Python 2 returns 1-byte strings and Python 3 returns integers on iteration.
>>> Thus the loop will run entirely in Python code and return different things
>>> in Py2 and Py3.
>>>
>>> I do not expect that this is a major issue either. Iteration over literals
>>> should be rare, after all, and if the byte string is constructed in any
>>> way, the type either becomes a bytes object through Python operations (like
>>> concatenation) or is explicitly provided, e.g. as a return type of a
>>> function call. But it is a clear behavioural change for the type inference
>>> in an area where Cython's (and Python's) semantics are tricky anyway.
>>>
>>> Personally, I think that the advantages outweigh the disadvantages here.
>>> Most common use cases won't notice the change because coercion will not be
>>> impacted, and most existing code (IMHO) either uses explicit typing or
>>> expects a Python bytes object anyway. So my preferred change would be to
>>> make byte string literals 'bytes' by default, except in nogil blocks.
>>>
>>> Opinions?
>>>
>>>        
>> +1
>>
>>      
> +1 I might say it should even be required in nogil blocks for consistency.
>


+1 to not making nogil blocks a special case, the disadvantage of 
another special case to remember outweighs the advantage of syntactic 
brevity IMO.

Dag Sverre
_______________________________________________
Cython-dev mailing list
[email protected]
http://codespeak.net/mailman/listinfo/cython-dev

Re: [Cython] Fixing #602 - type inference for byte string literals

Reply via email to