Req #52923 [Opn]: parse_url corrupts some UTF-8 strings

cataphract Sun, 26 Sep 2010 00:46:53 -0700

Edit report at http://bugs.php.net/bug.php?id=52923&edit=1


 ID:                 52923
 Updated by:         cataphr...@php.net
 Reported by:        masteram at gmail dot com
 Summary:            parse_url corrupts some UTF-8 strings
 Status:             Open
 Type:               Feature/Change Request
 Package:            *URL Functions
 Operating System:   MS Windows XP
 PHP Version:        5.3.3
 Block user comment: N

 New Comment:

The problem is that nothing guarantees a percent-encoded URL should be
interpreted as containing UTF-8 data or that an (invalid) URL containing
non-encoded unreserved characters should be converted to UTF-8 before
being percent-encoded.



In fact, while most browsers will use UTF-8 to build URLs entered in the
address bar, in case of HTML anchors in HTML pages, they will prefer to
use the encoding of the page instead if it's also an ASCII superset.





That said, the corruption you describe seems uncalled for. In fact, I am
unable to reproduce it. This is the value of $url I get in the end:



string(32) "/he/×¤×¨×××§×××/ByYear.html"


Previous Comments:
------------------------------------------------------------------------
[2010-09-25 16:22:19] masteram at gmail dot com

I tend to agree with Pajoye.

Although RFC-3986 calls for the use of percent-encoding for URLs, I
believe that it also mentions the IDN format (and the way things look
today, there is a host of websites that use UTF-8 encoding, which
benefits the readability of internationalized urls). 

I admit not being an expert in URL encoding, but it seems to me that
corrupting a string, even if it does not meet the current standards, is
a bad habit.

In addition, utf-8 encoded URLs seem to be quite common on reality. Take
the international versions of Wikipedia as an example.

If I'm wrong about that, I would be more than happy to know it.



I am not sure that the encode-analyze-merge-decode procedure is really
the best choice. Perhaps the streamlined alternative should be
considered. It sure wouldn't hurt.

I, for one, am currently using 'ASCII-only' URLs.

------------------------------------------------------------------------
[2010-09-25 14:34:34] paj...@php.net

It is not a bogus request. The idea would also to get the decoded (to
UTF-8) URL elements as result. It is also a good complement to IDN
support

------------------------------------------------------------------------
[2010-09-25 14:19:40] cataphr...@php.net

I'd say this request/bug is bogus because such URL is not valid
according to RFC 3986. He should first percent-encode all the characters
that are unreserved (perhaps after doing some unicode normalization) and
only then parse the URL.

------------------------------------------------------------------------
[2010-09-25 12:15:15] paj...@php.net

What's about a parse_url_utf8, like what we have for IDN? It could be
easy to implement it using either native OS APIs (when available) or
using external libraries (there is a couple of good one out there).

------------------------------------------------------------------------
[2010-09-25 11:42:29] ras...@php.net

Reclassifying as a feature request.  parse_url has never been
multibyte-aware.

------------------------------------------------------------------------


The remainder of the comments for this report are too long. To view
the rest of the comments, please view the bug report online at

    http://bugs.php.net/bug.php?id=52923


-- 
Edit this bug report at http://bugs.php.net/bug.php?id=52923&edit=1

Req #52923 [Opn]: parse_url corrupts some UTF-8 strings

Reply via email to