Hi!
I have a perl script that parses RSS streams from different news sources and
experience problems with national characters in a regexp function used for
matching a keyword list with the RSS data.
Everything works fine with a simple regexp for plain english i.e. words
containing the letters A-Z, a-z, 0-9.
if ( $description =~ m/\b$key/i ) {….}
Keywords or RSS data with national characters don’t work at all. I’m not really
surprised this was expected as character sets used in the different RSS streams
are outside my control.
I am have the ”use utf8;” function activated but I’m not really sure if it is
needed. I can’t see any difference used or not.
If a convert all the national characters used in the keyword list to html type
”å” and so on. Changes every occurrence of octal, unicode characters used
i.e. decimal and hex to html type in the RSS data in a character parser
everything works fine but takes time that I don’t what to avoid.
Do you have suggestions on this character issue? Is it possible to determine
the character set of a text efficiently? Is it other ways to solve the problem?
/Christer
--
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
http://learn.perl.org/