On Tue, Jan 23, 2007 at 05:17:14PM +0000, Martin Michlmayr <[EMAIL PROTECTED]> was heard to say: > Package: urlscan > Version: 0.5.5 > Severity: normal > > I don't know where this is going but... urlscan doesn't handle URLs > that contain UTF-8 characters. I just received an email which > contains the following URL which urlscan fails to parse correctly: > http://www.pantherhouse.com/newshelton/my-wife-thinks-i’m-a-swan/ > When I paste it into Firefox directly, it successfully opens > http://www.pantherhouse.com/newshelton/my-wife-thinks-i%E2%80%99m-a-swan/ > > I'm not quite sure how to handle this since it essentially means that > virtually any character can appear in an URL. Maybe you have a good > idea.
Ick. Is that even a legal URL? One idea would be to use an exclusive, rather than inclusive, regexp to find URLs. Consider a URL to be anything starting with "http" and terminated by whitespace or a few special characters (say, ["',.?>]). This seems more failure-prone, though, as I can't possibly predict every convention people use to terminate URLs (e.g., what about » or ¿; I'm sure there are more I don't know). Another option would be to add a command to "lengthen" a match, telling urlscan to update the currently selected match with the immediately following character (or maybe the next character & everything else that looks like part of a URL). This might be the best solution, since weird URLs like that seem like an oddity, and urlscan will probably make inevitable errors in other situations anyway. Daniel
signature.asc
Description: Digital signature