On Thu, 30 Oct 2008, Emmanuel Levy wrote:
Dear All,
I have a long string and need to search for regular expressions in
there. However it becomes horribly slow as the string length
increases.
Below is an example: when "i" increases by 5, the time spent increases
by more! (my string is 11,000,000 letters long!)
I also noticed that
- the search time increases dramatically with the number of matches found.
- the perl=T option slows down the search
Any idea to speed this up would be greatly appreciated!
The pattern looks for 4 adjacent elements (x, say) satisfying
x[1] %in% c(3,6,7)
x[2] == 2
x[3] %in% 1-9
x[4] %in c(1,2,9)
This is a much more specialized problem than gregexpr is tooled for.
You can find all such matches (not just the disjoint ones that gregexpr
finds) using something like this:
twomatch <-function(x,y) intersect(x+1,y)
match.list <-
list(
which( vec %in% c(3,6,7) ),
which( vec == 2 ),
which( vec %in% 1:9 ),
which( vec %in% c(1,2,9) ) )
res <- Reduce( twomatch, match.list ) - length(match.list) + 1
If you want to precisely match the gregexpr results, you'll need to filter
out the overlapping matches.
HTH,
Chuck
Best,
Emmanuel
for (i in c(10000, 50000, 100000, 500000)){
+ aa = as.character(sample(1:9, i, replace=T))
+ aa = paste(aa, collapse='')
+ print(i)
+ print(system.time(gregexpr("[367]2[1-9][129]",aa)))
+ }
[1] 10000
user system elapsed
0.004 0.000 0.003
[1] 50000
user system elapsed
0.060 0.000 0.061
[1] 1e+05
user system elapsed
0.240 0.000 0.238
[1] 5e+05
user system elapsed
5.733 0.000 5.732
______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Charles C. Berry (858) 534-2098
Dept of Family/Preventive Medicine
E mailto:[EMAIL PROTECTED] UC San Diego
http://famprevmed.ucsd.edu/faculty/cberry/ La Jolla, San Diego 92093-0901
______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.