Re: [Python-Dev] PEP 393 Summer of Code Project

Terry Reedy Tue, 23 Aug 2011 17:48:39 -0700

On 8/23/2011 6:20 AM, "Martin v. Löwis" wrote:

Am 23.08.2011 11:46, schrieb Xavier Morel:

Mostly ascii is pretty common for western-european languages (French, for
instance, is probably 90 to 95% ascii). It's also a risk in english, when
the writer "correctly" spells foreign words (résumé and the like).


I know - I still question whether it is "extremely common" (so much as
to justify a special case). I.e. on what application with what dataset
would you gain what speedup, at the expense of what amount of extra
lines, and potential slow-down for other datasets?

[snip]

In the PEP 393 approach, if the string has a two-byte representation,
each character needs to widened to two bytes, and likewise for four
bytes. So three separate copies of the unrolled loop would be needed,
one for each target size.

I fully support the declared purpose of the PEP, which I understand tobe to have a full,correct Unicode implementation on all new Pythonreleases without paying unnecessary space (and consequent time)penalties. I think the erroneous length, iteration, indexing, andslicing for strings with non-BMP chars in narrow builds needs to befixed for future versions. I think we should at least consideralternatives to the PEP393 solution of double or quadrupling space ifneeded for at least one char.


In utf16.py, attached to http://bugs.python.org/issue12729

I propose for consideration a prototype of different solution to the'mostly BMP chars, few non-BMP chars' case. Rather than expand everycharacter from 2 bytes to 4, attach an array cpdex of character (ie codepoint, not code unit) indexes. Then for indexing and slicing, thecorrection is simple, simpler than I first expected:

  code-unit-index = char-index + bisect.bisect_left(cpdex, char_index)

where code-unit-index is the adjusted index into the full underlyingdouble-byte array. This adds a time penalty of log2(len(cpdex)), butavoids most of the space penalty and the consequent time penalty ofmoving more bytes around and increasing cache misses.

I believe the same idea would work for utf8 and the mostly-ascii case.The main difference is that non-ascii chars have various byte sizesrather than the 1 extra double-byte of non-BMP chars in UCS2 builds. Sothe offset correction would not simply be the bisect-left return butwould require another lookup

  byte-index = char-index + offsets[bisect-left(cpdex, char-index)]

If possible, I would have the with-index-array versions be separatesubtypes, as in utf16.py. I believe either index-array implementationmight benefit from a subtype for single multi-unit chars, as a singlenon-ASCII or non-BMP char does not need an auxiliary [0] array and asenseless lookup therein but does need its length fixed at 1 instead ofthe number of base array units.


--
Terry Jan Reedy


_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 393 Summer of Code Project

Reply via email to