Given a vector of reference strings Ref and a vector of test strings Test, I would like to find elements of Test which do not contain elements of Ref as \b-delimited substrings.
This can be done straightforwardly for length(Ref) < 6000 or so (R 2.8.1 Windows) by constructing a pattern like \b(a|b|c)\b, but not for larger Refs (see below). The easy workaround for this is to split Ref into smaller subsets and test each subset separately. Is there a better solution e.g. along the lines of fgrep? My real data have length(Ref) == 60000 or more. -s ----------------------------- Example Test <- as.character(floor(runif(2000,1,20000))) # Real data is short phrases testing <- function(n) { Ref <- as.character(1:n) # Real data is sentences Pat <- paste('\\b(',paste(Ref,collapse="|"),')\\b',sep='') grep(Pat,Test) } testing(2000) => no problem However, testing(10000) gives an error message (invalid regular expression) and a warning (memory exhausted), and testing(100000) crashes R (Process R exited abnormally with code 5). Using grep(...,perl=TRUE) as suggested in the man page also fails with testing(10000), though it gives a more helpful error message (regular expression is too large) without crashing the process. ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.