Re: [Cython] Py_UNICODE* string support

Nikita Nemkin Sun, 03 Mar 2013 05:46:34 -0800

On Sun, 03 Mar 2013 15:32:36 +0600, Stefan Behnel <stefan...@behnel.de>wrote:

1) I would like to get rid of UnicodeConst. A Py_UNICODE* is notdifferent

from any other C array, except that it can coerce to and from Unicode
strings. So the representation of a literal should be a (properly
reference
counted) Python Unicode object, and users would be allowed to cast them
to <Py_UNICODE*>, just as we support it for <char*> and bytes.


I understand the idea. Since Python unicode literals are implicitly
coercible to Py_UNICODE*, there appears to be no need for C-level
Py_UNICODE[] literals. Indeed, client code will look exactly (!) the same
whether they are supported or not.

Except when it comes to nogil. (For example, native callbacks are almost
guaranteed to be nogil.) Hiding Python operations in what appears to be
pure C-level code will break users' assumptions.
This is #1 reason why I went for C-level literals. #2 reason is efficiency

on Py3.3. C-level literals don't need conversions and don't call anyconversion APIs.

2) non-BMP literals should be supported by representing them as normal
Unicode strings and creating the Py_UNICODE representation at need (i.e.
explicitly through a cast, at runtime). Py_UNICODE[] literals are simply
not portable.

Py_UNICODE[] literals can be made fully portable if non-BMP ones arewrapped

like this:

   #ifdef Py_UNICODE_WIDE
   static const k_xxx[] = { <UTF-32 array without surrogates>, 0 };
   #else
   static const k_xxx[] = { <UTF-16 array with surrogates>, 0 };
   #endif

Literals containing only BMP chars are already portable and don't need
this wrapping.

3) __Pyx_Py_UNICODE_strlen() is ok, but only for the special case thatall we have is a Py_UNICODE*. As long as we are dealing with Unicodestring
objects, that won't be needed, so len() should be constant time in the
normal case instead of linear time.


len(Py_UNICODE*) simply mirrors len(char*). Its putpose is to provide

platform-independent Py_UNICODE_strlen (which is Py3 only and deprecatedin 3.3).

So, the basic idea would be to use Unicode strings and their (optional)
internal representation as Py_UNICODE[] instead of making Py_UNICODE[] a
first class data type. And then go from there and optimise certain things
to use the unpacked array directly, so that users won't need to put
explicit C-API calls into their code.


Please reconsider your decision wrt C-level literals.
I believe that nogil code and a bit of efficiency (on 3.3) justify their
existence. (char* literals do have C-level literals, Py_UNICODE* is in
the same basket when it comes to Windows code).
The code to support them is also small and well-contained.

I've updated my pull request to fully support for non-BMP Py_UNICODE[]literals.

If you are still not convinced, so be it, I'll drop C-level literalsupport.



Best regards,
Nikita Nemkin

PS. I made a false claim in the previous mail. (Some of) Python's wchar_tAPIs

do exist in Py2. But they won't manage the memory automatically anyway.
_______________________________________________
cython-devel mailing list
cython-devel@python.org
http://mail.python.org/mailman/listinfo/cython-devel

Re: [Cython] Py_UNICODE* string support

Reply via email to