Gabor,
Thanks for the hint to use charmatch. It is faster than what I was
doing, but I suspect it might still be suboptimal. I have a list of
words, and I want to know which of those exactly or partially match
one word. By the very description of charmatch's arguments, the
second being "table", I suspect, if it's investing any time at all in
creating a structure that it can search more efficiently, it's going
it with "table", not with the first argument "x". I suspect that the
time difference in favor of charmatch is due to the fact that it's
doing natively what I was doing with R script, i.e., comparing each
element of the first argument to a substring of the second argument
with the same length as the element.
Any insights?
Regards,
Richard
Richard R. Liu
Dittingerstr. 33
CH-4053 Basel
Switzerland
Tel.: +41 61 331 10 47
Mobil: +41 79 708 67 66
Email: richard....@pueo-owl.ch
On Apr 29, 2010, at 13:06 , Gabor Grothendieck wrote:
Using charmatch partial matches of 10,000 5 leters words to the same
list can be done in 10 seconds on my machine and 10,000 5 letter words
to 100,000 10 letter words in 1 minute. Is that good enough? Try
this simulation:
# generate N random words each k long
rwords <- function(N, k) {
L <- sample(letters, N*k, replace = TRUE)
apply(matrix(L, k), 2, paste, collapse = "")
}
w1 <- rwords(1e5, 10)
w2 <- rwords(1e4, 5)
system.time(charmatch(w2, w2))
system.time(charmatch(w2, w1))
On Thu, Apr 29, 2010 at 4:05 AM, Richard R. Liu <richard....@pueo-owl.ch
> wrote:
I have an application that a long list of character strings to
determine which
occur at the beginning of a given word. A straight forward R
script using grep
takes a long time to run. Rewriting it to use substr and match
might be an
option, but I have the impression that preparing the list as a trie
and
performing trie searches might lead to dramatic improvements in
performance.
I have searched the CRAN packages and find no packages that support
Compact
Patricia Trees. Does anybody know of such?
Thanks,
Richard
Richard R. Liu
richard....@pueo-owl.ch
______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.