Walter Dörwald wrote: > I wonder if we should switch back to a simple readline() implementation > for those codecs that don't require the current implementation > (basically every charmap codec).
That would be my preference as well. The 2.4 .readline() approach is really only needed for codecs that have to deal with encodings that: a) use multi-byte formats, or b) support more line-end formats than just CR, CRLF, LF, or c) are stateful. This can easily be had by using a mix-in class for codecs which do need the buffered .readline() approach. > AFAIK source files are opened in > universal newline mode, so at least we'd get proper treatment of "\n", > "\r" and "\r\n" line ends, but we'd loose u"\x1c", u"\x1d", u"\x1e", > u"\x85", u"\u2028" and u"\u2029" (which are line terminators according > to unicode.splitlines()). While the Unicode standard defines these characters as line end code points, I think their definition does not necessarily apply to data that is converted from a certain encoding to Unicode, so that's not a big loss. E.g. in ASCII or Latin-1, FILE, GROUP and RECORD SEPARATOR and NEXT LINE characters (0x1c, 0x1d, 0x1e, 0x85) are not interpreted as line end characters. Furthermore, we had no reports of anyone complaining in Python 1.6, 2.0 - 2.3 that line endings were not detected properly. All these Python versions relied on the stream's .readline() method to get the next line. The only bug reports we had were for UTF-16 which falls into the above category a) and did not support .readline() until Python 2.4. A note on the performance of _PyUnicode_IsLinebreak(): in Python 2.0 Fredrik changed this to use the two step lookup (reducing the size of the lookup tables considerably). I think it's worthwhile reconsidering this approach for character type queries that do no involve a huge number of code points. In Python 1.6 the function looked like this (and was inlined by the compiler using its own fast lookup table): int _PyUnicode_IsLinebreak(register const Py_UNICODE ch) { switch (ch) { case 0x000A: /* LINE FEED */ case 0x000D: /* CARRIAGE RETURN */ case 0x001C: /* FILE SEPARATOR */ case 0x001D: /* GROUP SEPARATOR */ case 0x001E: /* RECORD SEPARATOR */ case 0x0085: /* NEXT LINE */ case 0x2028: /* LINE SEPARATOR */ case 0x2029: /* PARAGRAPH SEPARATOR */ return 1; default: return 0; } } another candidate to convert back is: int _PyUnicode_IsWhitespace(register const Py_UNICODE ch) { switch (ch) { case 0x0009: /* HORIZONTAL TABULATION */ case 0x000A: /* LINE FEED */ case 0x000B: /* VERTICAL TABULATION */ case 0x000C: /* FORM FEED */ case 0x000D: /* CARRIAGE RETURN */ case 0x001C: /* FILE SEPARATOR */ case 0x001D: /* GROUP SEPARATOR */ case 0x001E: /* RECORD SEPARATOR */ case 0x001F: /* UNIT SEPARATOR */ case 0x0020: /* SPACE */ case 0x0085: /* NEXT LINE */ case 0x00A0: /* NO-BREAK SPACE */ case 0x1680: /* OGHAM SPACE MARK */ case 0x2000: /* EN QUAD */ case 0x2001: /* EM QUAD */ case 0x2002: /* EN SPACE */ case 0x2003: /* EM SPACE */ case 0x2004: /* THREE-PER-EM SPACE */ case 0x2005: /* FOUR-PER-EM SPACE */ case 0x2006: /* SIX-PER-EM SPACE */ case 0x2007: /* FIGURE SPACE */ case 0x2008: /* PUNCTUATION SPACE */ case 0x2009: /* THIN SPACE */ case 0x200A: /* HAIR SPACE */ case 0x200B: /* ZERO WIDTH SPACE */ case 0x2028: /* LINE SEPARATOR */ case 0x2029: /* PARAGRAPH SEPARATOR */ case 0x202F: /* NARROW NO-BREAK SPACE */ case 0x3000: /* IDEOGRAPHIC SPACE */ return 1; default: return 0; } } -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Aug 23 2005) >>> Python/Zope Consulting and Support ... http://www.egenix.com/ >>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ >>> mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/ ________________________________________________________________________ ::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! :::: _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com