New submission from Daniele Sluijters:
Python 2's urlparse.urlparse() and Python 3's urllib.parse.urlparse() accept
URI/URL's with underscores in the host/domain/subdomain. I believe this
behaviour to be incorrect.
A distinction needs to be made between DNS names and Uniform Resource Locators
and Identifiers, urlparse is supposed to deal with the latter (correct me if
I'm wrong).
According to RFC 2181 section 11 on the syntax of DNS names the use of the
underscore is allowed and in use around the internet, especially in TXT and SRV
records.
However, RFC 1738 on Uniform Resource Locators section 3.1 (and its updates)
always define the 'hostname' part of the URL as being:
Such a name consists of a sequence of domain labels separated by ".",
each domain label starting and ending with an alphanumeric character
and possibly also containing "-" characters.
On top of that, RFC 2396 on URI's section 3.2.2:
Hostnames take the form described in Section 3 of [RFC1034] and
Section 2.1 of [RFC1123]: a sequence of domain labels separated by
".", each domain label starting and ending with an alphanumeric
character and possibly also containing "-" characters.
The underscore is never mentioned as being a valid character nor do any of the
references in the RFC's as far as I've been able to see.
Languages implementations vary:
* Ruby URI.parse does not allow for underscores in domain labels.
* Perl URI and URI::URL allow for underscores.
* java.net.uri treats the underscore as an illegal character in the domain
part.
* org.apache.http.httphost since 4.2.3 treats the underscore as an illegal
character in the domain part.
Httpd's:
* Apache: Seems to tolerate underscores but there's been a whole discussion
about this on the mailing lists.
* nginx: Matches a server_name of '_' to 'any invalid domain name'. It seems
to accept server_names with underscores in them but the behaviour is currently
unknown to me.
Browsers:
* IE cannot write cookies since IE 5.5 if host or subdomain part includes an
underscore.
* Just about every other browser is fine with it.
Please note that I'm only talking about the host/domain/subdomain part of URI's
and URL's, something like http://en.wikipedia.org/wiki/12-hour_clock is
perfectly valid and should parse.
--
components: Library (Lib)
messages: 201730
nosy: daenney, orsenthil
priority: normal
severity: normal
status: open
title: urlparse accepts invalid hostnames
type: behavior
versions: Python 2.6, Python 2.7, Python 3.1, Python 3.2, Python 3.3, Python
3.4, Python 3.5
___
Python tracker
<http://bugs.python.org/issue19451>
___
___
Python-bugs-list mailing list
Unsubscribe:
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com