Am 23.08.2011 11:46, schrieb Xavier Morel:
> On 2011-08-23, at 10:55 , Martin v. Löwis wrote:
>>> - “The UTF-8 decoding fast path for ASCII only characters was removed
>>>  and replaced with a memcpy if the entire string is ASCII.” 
>>>  The fast path would still be useful for mostly-ASCII strings, which
>>>  are extremely common (unless UTF-8 has become a no-op?).
>>
>> Is it really extremely common to have strings that are mostly-ASCII but
>> not completely ASCII? I would agree that pure ASCII strings are
>> extremely common.
> Mostly ascii is pretty common for western-european languages (French, for
> instance, is probably 90 to 95% ascii). It's also a risk in english, when
> the writer "correctly" spells foreign words (résumé and the like).

I know - I still question whether it is "extremely common" (so much as
to justify a special case). I.e. on what application with what dataset
would you gain what speedup, at the expense of what amount of extra
lines, and potential slow-down for other datasets?

For the record, the optimization in question is the one where it masks
a long word with 0x80808080L, to see whether it is completely
ASCII, and then copies four characters in an unrolled fashion. It stops
doing so when it sees a non-ASCII character, and returns to that mode
when it gets to the next aligned memory address that stores only ASCII
characters.

In the PEP 393 approach, if the string has a two-byte representation,
each character needs to widened to two bytes, and likewise for four
bytes. So three separate copies of the unrolled loop would be needed,
one for each target size.

Regards,
Martin

_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Reply via email to