Regular expression - Global substitution

Baskaran Sankaran Mon, 13 Mar 2006 17:30:27 -0800

Hi Group,


Am working with Unicode (UTF8 coded) stuff and facing problem with regular 
expression. 

 

      s/(\p{HinNumerals})\s+($tokenize_string)+\s+(\p{HinNumerals})/$1$2$3/g;

 

and, my HinNumerals is defined as,

 

sub HinNumerals {

      return <<END;

0966\t096F

END

}

 

$tokenize_string is a set of punctuation marks glued together in a string with 
‘|’ between them for OR. Here it goes:

$tokenize_string = 
"\x{0021}|\x{0022}|\x{0023}|\x{0025}|\x{0026}|\x{0027}|\x{0028}|\x{0029}|\x{002A}|\x{002B}|\x{002C}|\x{002D}|\x{002E}|\x{002F}|\x{003A}|\x{003B}|\x{003C}|\x{003D}|\x{003E}|\x{003F}|\x{0040}|\x{005B}|\x{005C}|\x{005D}|\x{005E}|\x{005F}|\x{007B}|\x{007C}|\x{007D}|\x{007E}|\x{0964}|\x{0965}";

 

Initially $_ consists of इस बिन्दु पर लाभ बहुत ही कम , रु .२ , ००० – ४ , ००० रु 
प्रति कार हैं । and in the substitute command I am trying to remove the blank 
spaces between the hindi numerals and number separators. So, I want the 
substring २ , ००० – ४ , ००० to become २,०००–४,०००. The HinNumerals defines the 
range of the Hindi numerals and my regular expression looks for punctuation 
marks sandwiched (with spaces on both sides) between Hindi numerals and removes 
the spaces and the switch ‘g’ ensures that it is applied in as many places as 
possible in $_.

 

But, the result is weird and I get बिन्दु पर लाभ बहुत ही कम , रु .२,०००–४ , ००० 
रु प्रति कार हैं । The problem is that the regexp applies correctly for the 
first two instances: viz. २ , ० and ० – ४ but doesn’t work for the last 
instance, which is ४ , ०.  I am puzzled why this happens.

 

Any clue/ solution will be useful.

 

Baskaran

Regular expression - Global substitution

Reply via email to