Bug#289611: [Xmltv-devel] Re: Bug#289611: xmltv: tv_imdb garbles characters

pcg Sun, 23 Jan 2005 21:53:02 -0800

On Sun, Jan 23, 2005 at 08:52:04PM +0000, Ed Avis <[EMAIL PROTECTED]> wrote:
> On Tue, 11 Jan 2005, Marc wrote:
> 
> [UTF-8 problems]
> 
> >I have looked further into this. It seems that tv_grab_de_tvtoday
> >already spitzs out double-encoded data in some cases, and every
> >filter (such as tv_remove_* or tv_imdb) just make the problem worse.
> 
> If you find an example where tv_grab_de_tvtoday puts out
> double-encoded data, please install the DB_File perl module and then
> you can say


I will try to, but it happens with tv_cat and tv_imdb, too, so it's unlikely
to be a tv_grab_de_tvtoday problem. I'll try to come up with a simple
example or a dump (hopefully, soon, but I am quite busy right now).

> >does it's own encoding and then lets perl encode again, or sth.
> >similar.
> 
> Until now xmltv has essentially ignored the question of encoding.  I
> need to sit down and read

If all you want is read and write UTF-8 encoded xml, then just do:

   binmode FILEHANDLE, ":utf8";

before reading and writing data. That will read data correctly. If you use
XML::Parser etc. it shouldn't matter in any case, (XML::Parser correctly
returns text, not some encoding).

> <http://www.ahinea.com/en/tech/perl-unicode-struggle.html>, and

Haven't read that, but the most basic thing you should internalise is that
perl handles two things in string scalars:

 - TEXT
 - BINARY OCTETS

I am shouting because this is important. For the perl user, it doesn't
matter and it isn't visible wether perl stores text as utf-8 internally or
in another form. IF you depend on knowing, you have a bug.

Instead, you only have two forms: octets (bytes) that might comprise some
utf-8 text, or a jpeg file, and text, which consist of characters (without
the notion of any encoding, just abstract characters).

Problems arise when xs perl modules incorrectly return text data as
octets, as old verfsions of XML::Parser did. As long as these modules
behave, you shouldn't care, just treat tetx as text as before.

It only matters when you di I/O (to file, to terminal etc). In that case,
use binmode (man perlio) and perl will handle cases corerctly: as long
as you treat text as text, perl will output the correct bytes, no matter
wether it is stored in latin1 or utf-8 internally.

-- 
                The choice of a
      -----==-     _GNU_
      ----==-- _       generation     Marc Lehmann
      ---==---(_)__  __ ____  __      [EMAIL PROTECTED]
      --==---/ / _ \/ // /\ \/ /      http://schmorp.de/
      -=====/_/_//_/\_,_/ /_/\_\      XX11-RIPE


-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]

Bug#289611: [Xmltv-devel] Re: Bug#289611: xmltv: tv_imdb garbles characters

Reply via email to