Regex problem with accented characters

Beginner Tue, 27 Mar 2007 00:35:15 -0800

Hi,

I am trying to extract the iso code and country name from a 3 column
table (taken from en.wikipedia.org) and have noticed a problem with
accented characters such as Ô.


Below is my script and a sample of the data I am using. When I run
the script the code beginning CI for Côte d'Ivoire returns the string

"CI\tC" where as I had hoped for "CI\tCôte d'Ivoire"

Does anyone know why \w+ does include Côte d'Ivoire and how I can get
around it in future?

TIA,
Dp.


==== extract.pl ========
#!/usr/bin/perl

use strict;
use warnings;

my $file = 'iso-alpha2.txt';

open(FH,$file) or die "Can't open $file: $!\n";
while (<FH>) {
        chomp;
        next if ($_ !~ /^\w{2}\s+/);
        my ($code,$name) = ($_ =~
/^(\w{2})\s+(\w+\s\w+\s\w+s\w+|\w+\s\w+\s\w+|\w+\s\w+|\w+)/);
        print "$code\t$name\n";
}
===============

======== sample data ========
...snip
BY      Belarus         Previously named "Byelorussian S.S.R."
BZ      Belize
CA      Canada
CC      Cocos (Keeling) Islands
CD      Congo, the Democratic Republic of the   Previously named "Zaire"
ZR
CF      Central African Republic
CG      Congo
CH      Switzerland     Code taken from "Confoederatio Helvetica", its
official Latin name
CI      Côte d'Ivoire
CK      Cook Islands
CL      Chile
CM      Cameroon
===========

--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/

Regex problem with accented characters

Reply via email to