avinash09 <avinash.i...@gmail.com> wrote:
> regex="^(.*),(.*),(.*),(.*),(.*),(.*),(.*),(.*),(.*),(.*),(.*),(.*),
> (.*),(.*),(.*),(.*),(.*),(.*),(.*),(.*),(.*),(.*),(.*),(.*),(.*),(.*),(.*),(.*)$"

A better solution seems to have been presented, but for the record I would like 
to note that the regexp above is quite an effective performance bomb: For each 
group, the evaluation time roughly doubles. Not a problem for 10 groups, but 
you have 28.

I made a little test and matching a single sample line with 20 groups took 120 
ms/match, 24 groups took 2 seconds and 28 groups took 30 seconds on my machine. 
If you had 50 groups, a single match would take 4 years.

The explanation is that Java regexps are greedy: Every one of your groups 
starts by matching to the end of the line, then a comma is reached in the 
regexp and it backtracks. The solution is fortunately both simple and 
applicable to many other regexps: Make your matches terminate as soon as 
possible.

In this case, instead of having groups with (.*), use ([^,]*) instead, which 
means that each group matches everything, except commas. The combined regexp 
then looks like this:
regex="^([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),...([^,]*)$"

The match speed for 28 groups with that regexp was about 0.002ms (average over 
1000 matches).

- Toke Eskildsen

Reply via email to