Seriously, can this discussion move somewhere else? This has nothing to do on python-dev.
Thank you Antoine. On Wed, 17 Sep 2014 18:56:02 +1000 Steven D'Aprano <st...@pearwood.info> wrote: > On Wed, Sep 17, 2014 at 09:21:56AM +0900, Stephen J. Turnbull wrote: > > > Guido's mantra is something like "Python's str doesn't contain > > characters or even code points[1], it contains code units." > > But is that true? If it were true, I would expect to be able to make > Python text strings containing code units that aren't code points, e.g. > something like "\U12340000" or chr(0x12340000) should work, but neither > do. As far as I can tell, there is no way to build a string containing > items which aren't code points. > > I don't think it is useful to say that strings *contain* code units, > more that they *are made up from* code units. Code units are the > implementation: 16-bit code units in narrow builds, 32-bit code units > in wide builds, and either 8-, 16- or 32-bit code units in Python 3.3 and > beyond. (I don't know of any Python implementation which uses UTF-8 > internally, but if there was one, it would use 8-bit code units.) > > It isn't very useful to say that in Python 3.3 the string "A" *contains* > the 8-bit code unit 0x41. That's conflating two different levels of > explanation (the high-level interface and the underlying implemention) > and potentially leads to user confusion like > > # 8-bit code units are bytes, right? > assert b'\41' in "A" > > which is Not Even Wrong. > http://rationalwiki.org/wiki/Not_even_wrong > > I think it is correct to say that Python strings are sequences of > Unicode code points U+0000 through U+10FFFF. There are no other > restrictions, e.g. strings can contain surrogates, noncharacters, or > nonsensical combinations of code points such as a U+0300 COMBINING GRAVE > ACCENT combined with U+000A (newline). > > > > Implying > > that dealing with characters (or the grapheme globs that occasionally > > raise their ugly heads here) is an issue for higher-level facilities > > than str to deal with. > > Agreed that Python doesn't offer a string type based on graphemes, and > that such a facility belongs as a high-level library, not a built-in > type. > > Also agreed that talking about characters is sloppy. Nevertheless, for > English speakers at least, "code point = character" isn't too awful a > first approximation. > > > > The point being that > > > > > Basically, we are pretending that the each smuggled byte is single > > > character > > > > is something of a misstatement (good enough for present purpose of > > discussing email, but not good enough for the general case of > > understanding how this is supposed to work when porting the construct > > to other Python implementations), while > > > > > for string parsing purposes...but they don't match any of our > > > parsing constants. > > > > is precisely Pythonically correct. You might want to add "because all > > parsing constants contain only valid characters by construction." > > I don't understand what you are trying to say here. > > > > > [*] I worried a lot that this was re-introducing the bytes/string > > > problem from python2. > > > > It isn't, because the bytes/str problem was that given a str object > > out of context you could not tell whether it was a binary blob or > > text, and if text, you couldn't tell if it was external encoded text > > or internal abstract text. > > > > That is not true here because the representations of characters vs. > > smuggled bytes in str are disjoint sets. > > Nor am I sure what you are trying to say here either. > > > > Footnotes: > > [1] In Unicode terminology, a code unit is the smallest computer > > object that can represent a character (this is uniquely and sanely > > defined for all real Unicode transformation formats aka UTFs). A code > > point is an integer 0 - (17*256*256-1) that can represent a character, > > but many code points such as surrogates and 0xFFFF are defined to be > > non-characters. > > Actually not quite. "Noncharacter" is concretely defined in Unicode, and > there are only 66 of them, many fewer than the surrogate code points > alone. Surrogates are reserved, not noncharacters. > > http://www.unicode.org/glossary/#surrogate_code_point > http://www.unicode.org/faq/private_use.html#nonchar1 > > It is wrong to talk about "surrogate characters", but perhaps you mean > to say that surrogates (by which I understand you to mean surrogate code > points) are "not human-meaningful characters", which is not the same > thing as a Unicode noncharacter. > > > > Characters are those code points that may be assigned > > an interpretation as a character, including undefined characters > > (private space and reserved). > > So characters are code points which are characters, including undefined > characters? :-) > > http://www.unicode.org/glossary/#character > > > _______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com