Toshio Kuratomi writes: > One comment here -- you can also have uri's that aren't decodable into their > true textual meaning using a single encoding. > > Apache will happily serve out uris that have utf-8, shift-jis, and > euc-jp components inside of their path but the textual > representation that was intended will be garbled (or be represented > by escaped byte sequences). For that matter, apache will serve > requests that have no true textual representation as it is working > on the byte level rather than the character level.
Sure. I've never seen that combination, but I have seen Shift JIS and KOI8-R in the same path. But in that case, just using 'latin-1' as the encoding allows you to use the (unicode) string operations internally, and then spew your mess out into the world for someone else to clean up, just as using bytes would. > So a complete solution really should allow the programmer to pass > in uris as bytes when the programmer knows that they need it. Other than passing bytes into a constructor, I would argue if a complete solution requires, eg, an interface that allows urljoin(base,subdir) where the types of base and subdir are not required to match, then it doesn't belong in the stdlib. For stdlib usage, that's premature optimization IMO. The RFC says that URIs are text, and therefore they can (and IMO should) be operated on as text in the stdlib. It's not just a matter of manipulating the URIs themselves, where working directly on bytes will work just as well and and with the same string operations (as long as everything is bytes). It's also a question of API complexity (eg, Barry's bugaboo of proliferation of encoding= parameters) and of debugging (if URIs are internally str, then they will display sanely in tracebacks and the interpreter). The cases where URIs can't be sanely treated as text are garbage input, and the stdlib should not try to provide a solution. Just passing in bytes and getting out bytes is GIGO. Trying to do "some" error-checking is going to be insufficient much of the time and overly strict most of the rest of the time. The programmer in the trenches is going to need to decide what to allow and what not; I don't think there are general answers because we know that allowing random URLs on the web leads to various kinds of problems. Some sites will need to address some of them. Note also that the "complete solution" argument cuts both ways. Eg, a "complete" solution should implement UTS 39 "confusables detection"[1] and IDNA[2]. Good luck doing that with bytes! If you *need* bytes (rather than simply trying to avoid conversion overhead), you're in a hazmat handling situation. Passing bytes in to stdlib APIs here is the equivalent of carrying around kilograms of fissionables in an open bucket. While the Tokaimura comparison is hyperbole, it can't be denied that use of bytes here shortcuts a lot of processing strongly suggested by the RFCs, and prevents use of various programming conveniences (such as reasonable display of URI values in debugging). Does the efficiency really justify including that in the stdlib? I dunno, I'm not a web programmer in the trenches. But I take my cue from MvL and MAL who don't seem real enthusiastic about this. And as Martin says, there is as yet no evidence offered that the overhead of conversion is a general problem. Footnotes: [1] http://www.unicode.org/reports/tr39/ [2] http://www.rfc-editor.org/rfc/rfc3490.txt _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com