Bug#408126: URL parsing doesn't handle UTF-8 characters

Daniel Burrows Sat, 27 Jan 2007 07:58:14 -0800

On Tue, Jan 23, 2007 at 05:17:14PM +0000, Martin Michlmayr <[EMAIL PROTECTED]> 
was heard to say:
> Package: urlscan
> Version: 0.5.5
> Severity: normal
> 
> I don't know where this is going but... urlscan doesn't handle URLs
> that contain UTF-8 characters.  I just received an email which
> contains the following URL which urlscan fails to parse correctly:
>    http://www.pantherhouse.com/newshelton/my-wife-thinks-i’m-a-swan/
> When I paste it into Firefox directly, it successfully opens
>    http://www.pantherhouse.com/newshelton/my-wife-thinks-i%E2%80%99m-a-swan/
> 
> I'm not quite sure how to handle this since it essentially means that
> virtually any character can appear in an URL.  Maybe you have a good
> idea.


  Ick.  Is that even a legal URL?

  One idea would be to use an exclusive, rather than inclusive, regexp
to find URLs.  Consider a URL to be anything starting with "http" and
terminated by whitespace or a few special characters (say, ["',.?>]).
This seems more failure-prone, though, as I can't possibly predict every
convention people use to terminate URLs (e.g., what about » or ¿; I'm
sure there are more I don't know).

  Another option would be to add a command to "lengthen" a match, telling
urlscan to update the currently selected match with the immediately
following character (or maybe the next character & everything else that
looks like part of a URL).  This might be the best solution, since weird
URLs like that seem like an oddity, and urlscan will probably make
inevitable errors in other situations anyway.

  Daniel

signature.asc
Description: Digital signature

Bug#408126: URL parsing doesn't handle UTF-8 characters

Reply via email to