from:"Daniele Sluijters"

[issue19451] urlparse accepts invalid hostnames

2013-10-30 Thread Daniele Sluijters


New submission from Daniele Sluijters:

Python 2's urlparse.urlparse() and Python 3's urllib.parse.urlparse() accept 
URI/URL's with underscores in the host/domain/subdomain. I believe this 
behaviour to be incorrect.

A distinction needs to be made between DNS names and Uniform Resource Locators 
and Identifiers, urlparse is supposed to deal with the latter (correct me if 
I'm wrong).

According to RFC 2181 section 11 on the syntax of DNS names the use of the 
underscore is allowed and in use around the internet, especially in TXT and SRV 
records.

However, RFC 1738 on Uniform Resource Locators section 3.1 (and its updates) 
always define the 'hostname' part of the URL as being:
Such a name consists of a sequence of domain labels separated by ".",
each domain label starting and ending with an alphanumeric character
and possibly also containing "-" characters.

On top of that, RFC 2396 on URI's section 3.2.2:
Hostnames take the form described in Section 3 of [RFC1034] and
Section 2.1 of [RFC1123]: a sequence of domain labels separated by
".", each domain label starting and ending with an alphanumeric
character and possibly also containing "-" characters.  

The underscore is never mentioned as being a valid character nor do any of the 
references in the RFC's as far as I've been able to see. 

Languages implementations vary:
 * Ruby URI.parse does not allow for underscores in domain labels.
 * Perl URI and URI::URL allow for underscores.
 * java.net.uri treats the underscore as an illegal character in the domain 
part.
 * org.apache.http.httphost since 4.2.3 treats the underscore as an illegal 
character in the domain part.

Httpd's:
 * Apache: Seems to tolerate underscores but there's been a whole discussion 
about this on the mailing lists.
 * nginx: Matches a server_name of '_' to 'any invalid domain name'. It seems 
to accept server_names with underscores in them but the behaviour is currently 
unknown to me.

Browsers:
 * IE cannot write cookies since IE 5.5 if host or subdomain part includes an 
underscore.
 * Just about every other browser is fine with it.

Please note that I'm only talking about the host/domain/subdomain part of URI's 
and URL's, something like http://en.wikipedia.org/wiki/12-hour_clock is 
perfectly valid and should parse.

--
components: Library (Lib)
messages: 201730
nosy: daenney, orsenthil
priority: normal
severity: normal
status: open
title: urlparse accepts invalid hostnames
type: behavior
versions: Python 2.6, Python 2.7, Python 3.1, Python 3.2, Python 3.3, Python 
3.4, Python 3.5

___
Python tracker 
<http://bugs.python.org/issue19451>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue19451] urlparse accepts invalid hostnames

2013-10-30 Thread Daniele Sluijters


Daniele Sluijters added the comment:

The link you mention only deals with the DNS side of things, this issue is 
specifically not about that, it's about the URI/URL side of things which is a 
very important distinction in this case.

I'm also not entirely sure I agree with the sentiment of "it's a mess anyway" 
so lets ignore the RFC. There's an RFC for a reason and if more implementations 
started to behave accordingly the mess would clear itself up instead of 
becoming even more of a nightmare.

I can agree with the practical over strict approach though.

--

___
Python tracker 
<http://bugs.python.org/issue19451>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue19451] urlparse accepts invalid hostnames

[issue19451] urlparse accepts invalid hostnames

2 matches

Site Navigation

Mail list logo

Footer information