New submission from monk.e.boy <[EMAIL PROTECTED]>: Hi,
The way urljoin works is a bit funky, equivalent paths do not get cleaned in a consistent way: import urlparse import posixpath print urlparse.urljoin('http://www.example.com', '///') print urlparse.urljoin('http://www.example.com/', '///') print urlparse.urljoin('http://www.example.com///', '///') print urlparse.urljoin('http://www.example.com///', '//') print urlparse.urljoin('http://www.example.com///', '/') print urlparse.urljoin('http://www.example.com///', '') print # the above should reduce down to: print posixpath.normpath('///') print print urlparse.urljoin('http://www.example.com///', '.') print urlparse.urljoin('http://www.example.com///', '/.') print urlparse.urljoin('http://www.example.com///', './') print urlparse.urljoin('http://www.example.com///', '/.') print print posixpath.normpath('/.') print print urlparse.urljoin('http://www.example.com///', '..') print urlparse.urljoin('http://www.example.com', '/a/../a/') print urlparse.urljoin('http://www.example.com', '../') print urlparse.urljoin('http://www.example.com', 'a/../a/') print urlparse.urljoin('http://www.example.com', 'a/../a/./') print urlparse.urljoin('http://www.example.com/a/../a/', '../a/./../a/') print urlparse.urljoin('http://www.example.com/a/../a/', '/../a/./../a/') The results of the above code are: http://www.example.com/ http://www.example.com/ http://www.example.com/ http://www.example.com/// http://www.example.com/ http://www.example.com/// / http://www.example.com/// http://www.example.com/. http://www.example.com/// http://www.example.com/. / http://www.example.com http://www.example.com/. http://www.example.com http://www.example.com/. http://www.example.com// http://www.example.com/a/../a/ http://www.example.com/../ http://www.example.com/a/ http://www.example.com/a/ http://www.example.com/a/ http://www.example.com/../a/./../a/ Sometimes the path is cleaned, sometimes it is not. When it is cleaned, the cleaning process is not perfect. The bit of code that is causing problems is commented: # XXX The stuff below is bogus in various ways... If I may be so bold, I would like to see this URL cleaning code stripped from urljoin. A new method/function could be added that cleans a URL. It could have a 'mimic browser' option, because a browser *will* follow URLs like: http://example.com/../../../ (see this non-bug http://bugs.python.org/issue2583 ) The URL cleaner could use some of the code from "posixpath". Shorter URLs would be preferred over longer (e.g: http://example.com preferred to http://example.com/ ) Thanks, monk.e.boy ---------- messages: 75154 nosy: monk.e.boy severity: normal status: open title: urlparse normalize URL path versions: Python 2.6 _______________________________________ Python tracker <[EMAIL PROTECTED]> <http://bugs.python.org/issue4191> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com