Greg Snow wrote:
>>
>>>> mystr <- paste(rep("1122", 10), collapse="")
>>>> n <- nchar(mystr)
>>>>
>>>> mystr2 <- substr(rep(mystr,n), 1:n, n)
>>>>
>>>> tmp <- regexpr("^11221122", mystr2)
>>>> (tmp + 1:n - 1)[tmp>0]
>>>>
>>>>
>>> [1] 1 5 9 13 17 21 25 29 33
>>>
>>>
>>>> attr(tmp,"match.length")[tmp>0]
>>>>
>>>>
>>> [1] 8 8 8 8 8 8 8 8 8
>>>
>>>
>>>
>> while not exactly what i meant, this is an implementation of one of the
>> approaches mentioned below,
>>
>
> Yes
>
>
>> ith care taken not to report duplicate
>> matches:
>>
>
> The use of regexpr rather than gregexpr and the '^' added to the beginning of
> the pattern were included to prevent duplicate matches. This works for the
> original example and all the cases that I can think of. If there is some way
> for this strategy to find duplicate matches (without going to extra effort to
> negate the effect of '^') I would be interested in learning about it.
>
>
>
yes, the use of '^' was what i had in mind, and i'm not sure if there's
a better solution.
>>>> sequentially perform single matches on successive substrings of the
>>>> input string (which can give you the same match more than once,
>>>> though).
>>>>
>> one issue with your solution is that it allocates n substrings at the
>> same time, which requires O(n^2) space (with n the length of the
>> original string), but it may be faster than a for loop matching one
>> substring at a time.
>>
>
> I claimed it was 'one way' not the best. In fact I hope that there are
> better ways, but I expect that other methods that are better in one way may
> be worse in others. If the storage is an issue, do it in a loop rather than
> creating all the substrings up front. Memory and time could also be saved by
> not creating the substrings that are shorter than the matching pattern, I did
> not do this because I am lazy and figure that the amount of time/memory saved
> in this example is probably less than the time and effort needed to type in
> the correction.
>
> If memory, speed, or other efficiencies are really that big an issue, then it
> may be better to use a different tool (your perl code for example), but then
> it moves beyond the scope of this list.
>
> I think the biggest issue with my solution is that it currently only works on
> single strings and will not scale easily to finding all the matches in a
> vector of strings. But without additional information on the real problem,
> it is hard to know which of memory, time, scalability, etc. is the biggest
> issue (if an issue at all) to address.
>
>
did not say this was a bad solution. something like the following,
which, as far as i get it, may avoid the O(n^2) space problem, consumes
approx. an order of magnitude more time:
# string - the tested string
# pattern - the pattern, with '^' as needed
n = nchar(string)
sapply(1:n, function(i) regexpr(pattern, substr(string, i, n)))
# filtering as needed
vQ
______________________________________________
[email protected] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.