Re: [R] [Rd] gregexpr - match overlap mishandled (PR#13391)

Greg Snow Mon, 15 Dec 2008 08:33:50 -0800

> -----Original Message-----
> From: Wacek Kusnierczyk [mailto:waclaw.marcin.kusnierc...@idi.ntnu.no]
> Sent: Sunday, December 14, 2008 5:39 AM
> To: Greg Snow
> Cc: R help
> Subject: Re: [Rd] gregexpr - match overlap mishandled (PR#13391)
>
> Greg Snow wrote:
> > Controlling the pointer is going to be very different from perl since
> the R functions are vectorized rather than focusing on a single string.
> >
> > Here is one approach that will give all the matches and lengths (for
> the original problem at least):
> >
> >
> >> mystr <- paste(rep("1122", 10), collapse="")
> >> n <- nchar(mystr)
> >>
> >> mystr2 <- substr(rep(mystr,n), 1:n, n)
> >>
> >> tmp <- regexpr("^11221122", mystr2)
> >> (tmp + 1:n - 1)[tmp>0]
> >>
> > [1]  1  5  9 13 17 21 25 29 33
> >
> >> attr(tmp,"match.length")[tmp>0]
> >>
> > [1] 8 8 8 8 8 8 8 8 8
> >
> >
>
> while not exactly what i meant, this is an implementation of one of the
> approaches mentioned below,


Yes

>  ith care taken not to report duplicate
> matches:

The use of regexpr rather than gregexpr and the '^' added to the beginning of 
the pattern were included to prevent duplicate matches.  This works for the 
original example and all the cases that I can think of.  If there is some way 
for this strategy to find duplicate matches (without going to extra effort to 
negate the effect of '^') I would be interested in learning about it.


> >> sequentially perform single matches on successive substrings of the
> >> input string (which can give you the same match more than once,
> >> though).
>
> one issue with your solution is that it allocates n substrings at the
> same time, which requires O(n^2) space (with n the length of the
> original string), but it may be faster than a for loop matching one
> substring at a time.

I claimed it was 'one way' not the best.  In fact I hope that there are better 
ways, but I expect that other methods that are better in one way may be worse 
in others.  If the storage is an issue, do it in a loop rather than creating 
all the substrings up front.  Memory and time could also be saved by not 
creating the substrings that are shorter than the matching pattern, I did not 
do this because I am lazy and figure that the amount of time/memory saved in 
this example is probably less than the time and effort needed to type in the 
correction.

If memory, speed, or other efficiencies are really that big an issue, then it 
may be better to use a different tool (your perl code for example), but then it 
moves beyond the scope of this list.

I think the biggest issue with my solution is that it currently only works on 
single strings and will not scale easily to finding all the matches in a vector 
of strings.  But without additional information on the real problem, it is hard 
to know which of memory, time, scalability, etc. is the biggest issue (if an 
issue at all) to address.

> vQ



--
Gregory (Greg) L. Snow Ph.D.
Statistical Data Center
Intermountain Healthcare
greg.s...@imail.org
801.408.8111

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] [Rd] gregexpr - match overlap mishandled (PR#13391)

Reply via email to