character setts in a regexp

Christer Palm Fri, 11 Jan 2013 14:02:23 -0800

Hi!

I have a perl script that parses RSS streams from different news sources and 
experience problems with national characters in a regexp function used for 
matching a keyword list with the RSS data.


Everything works fine with a simple regexp for plain english i.e. words 
containing the letters A-Z, a-z, 0-9.    

if ( $description =~ m/\b$key/i ) {….}

Keywords or RSS data with national characters don’t work at all. I’m not really 
surprised this was expected as character sets used in the different RSS streams 
are outside my control.

I am have the ”use utf8;” function activated but I’m not really sure if it is 
needed. I can’t see any difference used or not. 

If a convert all the national characters used in the keyword list to html type 
”&aring” and so on. Changes every occurrence of octal, unicode characters used 
i.e. decimal and hex to html type in the RSS data in a character parser 
everything works fine but takes time that I don’t what to avoid.   

Do you have suggestions on this character issue? Is it possible to determine 
the character set of a text efficiently? Is it other ways to solve the problem?

/Christer
--
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
http://learn.perl.org/

character setts in a regexp

Reply via email to