Re: [Python-Dev] 51 Million calls to _PyUnicodeUCS2_IsLinebreak() (???)

M.-A. Lemburg Wed, 24 Aug 2005 03:27:53 -0700

Walter Dörwald wrote:
> I wonder if we should switch back to a simple readline() implementation 
> for those codecs that don't require the current implementation 
> (basically every charmap codec).


That would be my preference as well. The 2.4 .readline() approach
is really only needed for codecs that have to deal with encodings
that:

a) use multi-byte formats, or
b) support more line-end formats than just CR, CRLF, LF, or
c) are stateful.

This can easily be had by using a mix-in class for
codecs which do need the buffered .readline() approach.

> AFAIK source files are opened in 
> universal newline mode, so at least we'd get proper treatment of "\n", 
> "\r" and "\r\n" line ends, but we'd loose u"\x1c", u"\x1d", u"\x1e", 
> u"\x85", u"\u2028" and u"\u2029" (which are line terminators according 
> to unicode.splitlines()).

While the Unicode standard defines these characters as line
end code points, I think their definition does not necessarily
apply to data that is converted from a certain encoding to
Unicode, so that's not a big loss.

E.g. in ASCII or Latin-1, FILE, GROUP and RECORD
SEPARATOR and NEXT LINE characters (0x1c, 0x1d, 0x1e, 0x85)
are not interpreted as line end characters.

Furthermore, we had no reports of anyone complaining in
Python 1.6, 2.0 - 2.3 that line endings were not detected
properly. All these Python versions relied on the stream's
.readline() method to get the next line. The only bug reports
we had were for UTF-16 which falls into the above
category a) and did not support .readline() until Python 2.4.

A note on the performance of _PyUnicode_IsLinebreak():
in Python 2.0 Fredrik changed this to use the two step
lookup (reducing the size of the lookup tables considerably).

I think it's worthwhile reconsidering this approach for
character type queries that do no involve a huge number
of code points.

In Python 1.6 the function looked like this (and was
inlined by the compiler using its own fast lookup
table):

int _PyUnicode_IsLinebreak(register const Py_UNICODE ch)
{
    switch (ch) {
    case 0x000A: /* LINE FEED */
    case 0x000D: /* CARRIAGE RETURN */
    case 0x001C: /* FILE SEPARATOR */
    case 0x001D: /* GROUP SEPARATOR */
    case 0x001E: /* RECORD SEPARATOR */
    case 0x0085: /* NEXT LINE */
    case 0x2028: /* LINE SEPARATOR */
    case 0x2029: /* PARAGRAPH SEPARATOR */
        return 1;
    default:
        return 0;
    }
}

another candidate to convert back is:

int _PyUnicode_IsWhitespace(register const Py_UNICODE ch)
{
    switch (ch) {
    case 0x0009: /* HORIZONTAL TABULATION */
    case 0x000A: /* LINE FEED */
    case 0x000B: /* VERTICAL TABULATION */
    case 0x000C: /* FORM FEED */
    case 0x000D: /* CARRIAGE RETURN */
    case 0x001C: /* FILE SEPARATOR */
    case 0x001D: /* GROUP SEPARATOR */
    case 0x001E: /* RECORD SEPARATOR */
    case 0x001F: /* UNIT SEPARATOR */
    case 0x0020: /* SPACE */
    case 0x0085: /* NEXT LINE */
    case 0x00A0: /* NO-BREAK SPACE */
    case 0x1680: /* OGHAM SPACE MARK */
    case 0x2000: /* EN QUAD */
    case 0x2001: /* EM QUAD */
    case 0x2002: /* EN SPACE */
    case 0x2003: /* EM SPACE */
    case 0x2004: /* THREE-PER-EM SPACE */
    case 0x2005: /* FOUR-PER-EM SPACE */
    case 0x2006: /* SIX-PER-EM SPACE */
    case 0x2007: /* FIGURE SPACE */
    case 0x2008: /* PUNCTUATION SPACE */
    case 0x2009: /* THIN SPACE */
    case 0x200A: /* HAIR SPACE */
    case 0x200B: /* ZERO WIDTH SPACE */
    case 0x2028: /* LINE SEPARATOR */
    case 0x2029: /* PARAGRAPH SEPARATOR */
    case 0x202F: /* NARROW NO-BREAK SPACE */
    case 0x3000: /* IDEOGRAPHIC SPACE */
        return 1;
    default:
        return 0;
    }
}

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Aug 23 2005)
>>> Python/Zope Consulting and Support ...        http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ...             http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ...        http://python.egenix.com/
________________________________________________________________________

::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::
_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] 51 Million calls to _PyUnicodeUCS2_IsLinebreak() (???)

Reply via email to