Re: OT: Perl & UTF-8

dman Sun, 06 Jan 2002 14:21:21 -0600

On Sun, Jan 06, 2002 at 08:06:38PM +0100, Holger Rauch wrote:
| Hi!
| 
| I want to substitute element content by entity references in UTF-8 encoded
| XML files using Perl. My script currently only works with ISO 8859
| encodings. Is there a module that can be used in Perl scripts that
| correctly reads and writes files according to specified encoding? If so,
| what's the name of it and where can I obtain it from?
| 
| Additional info that might be helpful:
| 
| I'm not using a DOM module to retrieve an element's contents, just
| ordinary regexps.


So the regexps you're using are in a 8859-n source file, right?  Can
perl handle UTF-8 source files?  Are you trying to use things like the
posix character class [:alpha:]?  I don't think those will handle all
alphabetic characters in all unicode supported languages (probably
just ascii/english alphabet).

I don't know much about perl, but with cpython you have to decode the
the string you read in to get the unicode string in memory, and also
specify your source code string literals as unicode strings.  (cpython
doesn't yet support non-ascii source encodings, but jython does)

HTH,
-D

-- 

He who belongs to God hears what God says.  The reason you do not hear
is that you do not belong to God.
        John 8:47

Re: OT: Perl & UTF-8

Reply via email to