Hi Group,
Am working with Unicode (UTF8 coded) stuff and facing problem with regular
expression.
s/(\p{HinNumerals})\s+($tokenize_string)+\s+(\p{HinNumerals})/$1$2$3/g;
and, my HinNumerals is defined as,
sub HinNumerals {
return <<END;
0966\t096F
END
}
$tokenize_string is a set of punctuation marks glued together in a string with
‘|’ between them for OR. Here it goes:
$tokenize_string =
"\x{0021}|\x{0022}|\x{0023}|\x{0025}|\x{0026}|\x{0027}|\x{0028}|\x{0029}|\x{002A}|\x{002B}|\x{002C}|\x{002D}|\x{002E}|\x{002F}|\x{003A}|\x{003B}|\x{003C}|\x{003D}|\x{003E}|\x{003F}|\x{0040}|\x{005B}|\x{005C}|\x{005D}|\x{005E}|\x{005F}|\x{007B}|\x{007C}|\x{007D}|\x{007E}|\x{0964}|\x{0965}";
Initially $_ consists of इस बिन्दु पर लाभ बहुत ही कम , रु .२ , ००० – ४ , ००० रु
प्रति कार हैं । and in the substitute command I am trying to remove the blank
spaces between the hindi numerals and number separators. So, I want the
substring २ , ००० – ४ , ००० to become २,०००–४,०००. The HinNumerals defines the
range of the Hindi numerals and my regular expression looks for punctuation
marks sandwiched (with spaces on both sides) between Hindi numerals and removes
the spaces and the switch ‘g’ ensures that it is applied in as many places as
possible in $_.
But, the result is weird and I get बिन्दु पर लाभ बहुत ही कम , रु .२,०००–४ , ०००
रु प्रति कार हैं । The problem is that the regexp applies correctly for the
first two instances: viz. २ , ० and ० – ४ but doesn’t work for the last
instance, which is ४ , ०. I am puzzled why this happens.
Any clue/ solution will be useful.
Baskaran