Hi,
I am trying to extract the iso code and country name from a 3 column
table (taken from en.wikipedia.org) and have noticed a problem with
accented characters such as Ô.
Below is my script and a sample of the data I am using. When I run
the script the code beginning CI for Côte d'Ivoire returns the string
"CI\tC" where as I had hoped for "CI\tCôte d'Ivoire"
Does anyone know why \w+ does include Côte d'Ivoire and how I can get
around it in future?
TIA,
Dp.
==== extract.pl ========
#!/usr/bin/perl
use strict;
use warnings;
my $file = 'iso-alpha2.txt';
open(FH,$file) or die "Can't open $file: $!\n";
while (<FH>) {
chomp;
next if ($_ !~ /^\w{2}\s+/);
my ($code,$name) = ($_ =~
/^(\w{2})\s+(\w+\s\w+\s\w+s\w+|\w+\s\w+\s\w+|\w+\s\w+|\w+)/);
print "$code\t$name\n";
}
===============
======== sample data ========
...snip
BY Belarus Previously named "Byelorussian S.S.R."
BZ Belize
CA Canada
CC Cocos (Keeling) Islands
CD Congo, the Democratic Republic of the Previously named "Zaire"
ZR
CF Central African Republic
CG Congo
CH Switzerland Code taken from "Confoederatio Helvetica", its
official Latin name
CI Côte d'Ivoire
CK Cook Islands
CL Chile
CM Cameroon
===========
--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/