On Sun, Jan 23, 2005 at 08:52:04PM +0000, Ed Avis <[EMAIL PROTECTED]> wrote: > On Tue, 11 Jan 2005, Marc wrote: > > [UTF-8 problems] > > >I have looked further into this. It seems that tv_grab_de_tvtoday > >already spitzs out double-encoded data in some cases, and every > >filter (such as tv_remove_* or tv_imdb) just make the problem worse. > > If you find an example where tv_grab_de_tvtoday puts out > double-encoded data, please install the DB_File perl module and then > you can say
I will try to, but it happens with tv_cat and tv_imdb, too, so it's unlikely to be a tv_grab_de_tvtoday problem. I'll try to come up with a simple example or a dump (hopefully, soon, but I am quite busy right now). > >does it's own encoding and then lets perl encode again, or sth. > >similar. > > Until now xmltv has essentially ignored the question of encoding. I > need to sit down and read If all you want is read and write UTF-8 encoded xml, then just do: binmode FILEHANDLE, ":utf8"; before reading and writing data. That will read data correctly. If you use XML::Parser etc. it shouldn't matter in any case, (XML::Parser correctly returns text, not some encoding). > <http://www.ahinea.com/en/tech/perl-unicode-struggle.html>, and Haven't read that, but the most basic thing you should internalise is that perl handles two things in string scalars: - TEXT - BINARY OCTETS I am shouting because this is important. For the perl user, it doesn't matter and it isn't visible wether perl stores text as utf-8 internally or in another form. IF you depend on knowing, you have a bug. Instead, you only have two forms: octets (bytes) that might comprise some utf-8 text, or a jpeg file, and text, which consist of characters (without the notion of any encoding, just abstract characters). Problems arise when xs perl modules incorrectly return text data as octets, as old verfsions of XML::Parser did. As long as these modules behave, you shouldn't care, just treat tetx as text as before. It only matters when you di I/O (to file, to terminal etc). In that case, use binmode (man perlio) and perl will handle cases corerctly: as long as you treat text as text, perl will output the correct bytes, no matter wether it is stored in latin1 or utf-8 internally. -- The choice of a -----==- _GNU_ ----==-- _ generation Marc Lehmann ---==---(_)__ __ ____ __ [EMAIL PROTECTED] --==---/ / _ \/ // /\ \/ / http://schmorp.de/ -=====/_/_//_/\_,_/ /_/\_\ XX11-RIPE -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]