Note that John's solution probably includes incorrect partial matches and that mine fails to match "red" in "this is red." If you change my proposal to
sapply(strsplit(do.call(paste,zz[,2:3]),"\\W"), function(x)any(x %in% alarm.words)) it should agree with Jeff's. Note, however, that you have missed capital letters: "Red" would not match "This is red". Bert Gunter "Data is not information. Information is not knowledge. And knowledge is certainly not wisdom." -- Clifford Stoll On Fri, Jul 10, 2015 at 10:54 AM, Christopher W Ryan <cr...@binghamton.edu> wrote: > Indeed, the perils of syndromic surveillance with free text. > >> with(dd.2, table(fox)) > fox > FALSE TRUE > 74939 1201 > >> with(dd.2, table(gunter)) > gunter > FALSE TRUE > 75213 927 > >> with(dd.2, table(newmiller)) > newmiller > FALSE TRUE > 75028 1112 > > > Of, course, the simplest thing for me to do would be add "heroine" to > the alarm.words. I'm surprised that the US national organization that > promulgated this list of drug-related terms did not include it. Many > other common misspellings are included. I will have to contact them. > > --Chris > > On Fri, Jul 10, 2015 at 1:39 PM, Bert Gunter <bgunter.4...@gmail.com> wrote: >> Yes. This is one of the fundamental challenges in text searching -- >> defining exactly what text defines a match and what doesn't. So, >> continuing your example, one might imagine that heroin and heroine >> might both be matches, but maybe heroines shouldn't be (e.g. if the >> text contains movie reviews). So what one might want to do is add >> semantic analysis to searches, à la google, a topic way beyond the >> simple capabilities discussed, or likely needed, here. >> >> Incidentally, Jeff Newmiller's (final) regular expression solution is >> preferable to mine in all respects, I think. >> >> -- Bert >> >> >> Bert Gunter >> >> "Data is not information. Information is not knowledge. And knowledge >> is certainly not wisdom." >> -- Clifford Stoll >> >> >> On Fri, Jul 10, 2015 at 10:30 AM, Christopher W Ryan >> <cr...@binghamton.edu> wrote: >>> Interesting thoughts about the partial-word matches, and speed On >>> another real data set, about 73,000 records and 6 columns to search >>> through for matches (one column of which contains very long character >>> strings--several paragraphs each), I ran both John's and Bert's >>> solutions. John's was noticeably slower, although still quite >>> tolerable. There were a different number of matches, though: >>> >>> oic.2 >>> oic FALSE TRUE Sum >>> FALSE 74939 0 74939 >>> TRUE 274 927 1201 >>> Sum 75213 927 76140 >>> >>> where oic is the logical vector generated by John's solution, and >>> oic.2 is the logical vector generated by Bert's solution. Bert's >>> solution detected about 77% of the cases detected by John's. >>> >>> I'm still exploring why that might be. One possible explanation, for >>> at least part of the difference, is the issue of partial-word matches. >>> Substantively, I am searching ambulance run records for words related >>> to opioid overdose, and I've noticed that the medics often spell >>> heroin as "heroine" So in this context, I like partial-word >>> matches--I want to pick up records that (partially) match "heroin" >>> because it is contained in the word "heroine" . >>> >>> There may be other things going on too. >>> >>> Thanks. >>> >>> --Chris >>> >>> On Thu, Jul 9, 2015 at 3:24 PM, John Fox <j...@mcmaster.ca> wrote: >>>> Dear Christopher, >>>> >>>> My usual orientation to this kind of one-off problem is that I'm looking >>>> for a simple correct solution. Computing time is usually much smaller than >>>> programming time. >>>> >>>> That said, Bert Gunter's solution was about 5 times faster in a simple >>>> check that I ran with microbenchmark, and Jeff Newmiller's solution was >>>> about 10 times faster. Both Bert's and Jeff's (eventual) solution protect >>>> against partial (rather than full-word) matches, while mine doesn't >>>> (though it could easily be modified to do that). >>>> >>>> Best, >>>> John >>>> >>>>> -----Original Message----- >>>>> From: Christopher W Ryan [mailto:cr...@binghamton.edu] >>>>> Sent: July-09-15 2:49 PM >>>>> To: Bert Gunter >>>>> Cc: Jeff Newmiller; R Help; John Fox >>>>> Subject: Re: [R] detecting any element in a vector of strings, appearing >>>>> anywhere in any of several character variables in a dataframe >>>>> >>>>> Thanks everyone. John's original solution worked great. And with >>>>> 27,000 records, 65 alarm.words, and 6 columns to search, it takes only >>>>> about 15 seconds. That is certainly adequate for my needs. But I >>>>> will try out the other strategies too. >>>>> >>>>> And thanks also for lot's of new R things to learn--grep, grepl, >>>>> do.call . . . that's always a bonus! >>>>> >>>>> --Chris Ryan >>>>> >>>>> On Thu, Jul 9, 2015 at 1:52 PM, Bert Gunter <bgunter.4...@gmail.com> >>>>> wrote: >>>>> > Yup, that does it. Let grep figure out what's a word rather than doing >>>>> > it manually. Forgot about "\b" >>>>> > >>>>> > Cheers, >>>>> > Bert >>>>> > >>>>> > >>>>> > Bert Gunter >>>>> > >>>>> > "Data is not information. Information is not knowledge. And knowledge >>>>> > is certainly not wisdom." >>>>> > -- Clifford Stoll >>>>> > >>>>> > >>>>> > On Thu, Jul 9, 2015 at 10:30 AM, Jeff Newmiller >>>>> > <jdnew...@dcn.davis.ca.us> wrote: >>>>> >> Just add a word break marker before and after: >>>>> >> >>>>> >> zz$v5 <- grepl( paste0( "\\b(", paste0( alarm.words, collapse="|" ), >>>>> ")\\b" ), do.call( paste, zz[ , 2:3 ] ) ) ) >>>>> >> --------------------------------------------------------------------- >>>>> ------ >>>>> >> Jeff Newmiller The ..... ..... Go >>>>> Live... >>>>> >> DCN:<jdnew...@dcn.davis.ca.us> Basics: ##.#. ##.#. Live >>>>> Go... >>>>> >> Live: OO#.. Dead: OO#.. >>>>> Playing >>>>> >> Research Engineer (Solar/Batteries O.O#. #.O#. with >>>>> >> /Software/Embedded Controllers) .OO#. .OO#. >>>>> rocks...1k >>>>> >> --------------------------------------------------------------------- >>>>> ------ >>>>> >> Sent from my phone. Please excuse my brevity. >>>>> >> >>>>> >> On July 9, 2015 10:12:23 AM PDT, Bert Gunter <bgunter.4...@gmail.com> >>>>> wrote: >>>>> >>>Jeff: >>>>> >>> >>>>> >>>Well, it would be much better (no loops!) except, I think, for one >>>>> >>>issue: "red" would match "barred" and I don't think that this is what >>>>> >>>is wanted: the matches should be on whole "words" not just string >>>>> >>>patterns. >>>>> >>> >>>>> >>>So you would need to fix up the matching pattern to make this work, >>>>> >>>but it may be a little tricky, as arbitrary whitespace characters, >>>>> >>>e.g. " " or "\n" etc. could be in the strings to be matched >>>>> separating >>>>> >>>the words or ending the "sentence." I'm sure it can be done, but >>>>> I'll >>>>> >>>leave it to you or others to figure it out. >>>>> >>> >>>>> >>>Of course, if my diagnosis is wrong or silly, please point this out. >>>>> >>> >>>>> >>>Cheers, >>>>> >>>Bert >>>>> >>> >>>>> >>> >>>>> >>>Bert Gunter >>>>> >>> >>>>> >>>"Data is not information. Information is not knowledge. And knowledge >>>>> >>>is certainly not wisdom." >>>>> >>> -- Clifford Stoll >>>>> >>> >>>>> >>> >>>>> >>>On Thu, Jul 9, 2015 at 9:34 AM, Jeff Newmiller >>>>> >>><jdnew...@dcn.davis.ca.us> wrote: >>>>> >>>> I think grep is better suited to this: >>>>> >>>> >>>>> >>>> zz$v5 <- grepl( paste0( alarm.words, collapse="|" ), do.call( >>>>> paste, >>>>> >>>zz[ , 2:3 ] ) ) ) >>>>> >>>> >>>>> >>>--------------------------------------------------------------------- >>>>> ------ >>>>> >>>> Jeff Newmiller The ..... ..... Go >>>>> >>>Live... >>>>> >>>> DCN:<jdnew...@dcn.davis.ca.us> Basics: ##.#. ##.#. >>>>> Live >>>>> >>>Go... >>>>> >>>> Live: OO#.. Dead: OO#.. >>>>> >>>Playing >>>>> >>>> Research Engineer (Solar/Batteries O.O#. #.O#. >>>>> with >>>>> >>>> /Software/Embedded Controllers) .OO#. .OO#. >>>>> >>>rocks...1k >>>>> >>>> >>>>> >>>--------------------------------------------------------------------- >>>>> ------ >>>>> >>>> Sent from my phone. Please excuse my brevity. >>>>> >>>> >>>>> >>>> On July 9, 2015 8:51:10 AM PDT, Bert Gunter >>>>> <bgunter.4...@gmail.com> >>>>> >>>wrote: >>>>> >>>>>Here's a way to do it that uses %in% (i.e. match() ) and uses only >>>>> a >>>>> >>>>>single, not a double, loop. It should be more efficient. >>>>> >>>>> >>>>> >>>>>> sapply(strsplit(do.call(paste,zz[,2:3]),"[[:space:]]+"), >>>>> >>>>>+ function(x)any(x %in% alarm.words)) >>>>> >>>>> >>>>> >>>>> [1] FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE TRUE TRUE >>>>> >>>>> >>>>> >>>>>The idea is to paste the strings in each row (do.call allows an >>>>> >>>>>arbitrary number of columns) into a single string and then use >>>>> >>>>>strsplit to break the string into individual "words" on whitespace. >>>>> >>>>>Then the matching is vectorized with the any( %in% ... ) call. >>>>> >>>>> >>>>> >>>>>Cheers, >>>>> >>>>>Bert >>>>> >>>>>Bert Gunter >>>>> >>>>> >>>>> >>>>>"Data is not information. Information is not knowledge. And >>>>> knowledge >>>>> >>>>>is certainly not wisdom." >>>>> >>>>> -- Clifford Stoll >>>>> >>>>> >>>>> >>>>> >>>>> >>>>>On Thu, Jul 9, 2015 at 6:05 AM, John Fox <j...@mcmaster.ca> wrote: >>>>> >>>>>> Dear Chris, >>>>> >>>>>> >>>>> >>>>>> If I understand correctly what you want, how about the following? >>>>> >>>>>> >>>>> >>>>>>> rows <- apply(zz[, 2:3], 1, function(x) any(sapply(alarm.words, >>>>> >>>>>grepl, x=x))) >>>>> >>>>>>> zz[rows, ] >>>>> >>>>>> >>>>> >>>>>> v1 v2 v3 v4 >>>>> >>>>>> 3 -1.022329 green turtle ronald weasley 2 >>>>> >>>>>> 6 0.336599 waffle the hamster red sparks 1 >>>>> >>>>>> 9 -1.631874 yellow giraffe with a long neck gandalf the white 1 >>>>> >>>>>> 10 1.130622 black bear gandalf the grey 2 >>>>> >>>>>> >>>>> >>>>>> I hope this helps, >>>>> >>>>>> John >>>>> >>>>>> >>>>> >>>>>> ------------------------------------------------ >>>>> >>>>>> John Fox, Professor >>>>> >>>>>> McMaster University >>>>> >>>>>> Hamilton, Ontario, Canada >>>>> >>>>>> http://socserv.mcmaster.ca/jfox/ >>>>> >>>>>> >>>>> >>>>>> >>>>> >>>>>> On Wed, 08 Jul 2015 22:23:37 -0400 >>>>> >>>>>> "Christopher W. Ryan" <cr...@binghamton.edu> wrote: >>>>> >>>>>>> Running R 3.1.1 on windows 7 >>>>> >>>>>>> >>>>> >>>>>>> I want to identify as a case any record in a dataframe that >>>>> >>>contains >>>>> >>>>>any >>>>> >>>>>>> of several keywords in any of several variables. >>>>> >>>>>>> >>>>> >>>>>>> Example: >>>>> >>>>>>> >>>>> >>>>>>> # create a dataframe with 4 variables and 10 records >>>>> >>>>>>> v2 <- c("white bird", "blue bird", "green turtle", "quick brown >>>>> >>>>>fox", >>>>> >>>>>>> "big black dog", "waffle the hamster", "benny likes food a lot", >>>>> >>>>>"hello >>>>> >>>>>>> world", "yellow giraffe with a long neck", "black bear") >>>>> >>>>>>> v3 <- c("harry potter", "hermione grainger", "ronald weasley", >>>>> >>>>>"ginny >>>>> >>>>>>> weasley", "dudley dursley", "red sparks", "blue sparks", "white >>>>> >>>>>dress >>>>> >>>>>>> robes", "gandalf the white", "gandalf the grey") >>>>> >>>>>>> zz <- data.frame(v1=rnorm(10), v2=v2, v3=v3, v4=rpois(10, >>>>> >>>lambda=2), >>>>> >>>>>>> stringsAsFactors=FALSE) >>>>> >>>>>>> str(zz) >>>>> >>>>>>> zz >>>>> >>>>>>> >>>>> >>>>>>> # here are the keywords >>>>> >>>>>>> alarm.words <- c("red", "green", "turtle", "gandalf") >>>>> >>>>>>> >>>>> >>>>>>> # For each row/record, I want to test whether the string in v2 >>>>> or >>>>> >>>>>the >>>>> >>>>>>> string in v3 contains any of the strings in alarm.words. And >>>>> then >>>>> >>>if >>>>> >>>>>so, >>>>> >>>>>>> set zz$v5=TRUE for that record. >>>>> >>>>>>> >>>>> >>>>>>> # I'm thinking the str_detect function in the stringr package >>>>> >>>ought >>>>> >>>>>to >>>>> >>>>>>> be able to help, perhaps with some use of apply over the rows, >>>>> but >>>>> >>>I >>>>> >>>>>>> obviously misunderstand something about how str_detect works >>>>> >>>>>>> >>>>> >>>>>>> library(stringr) >>>>> >>>>>>> >>>>> >>>>>>> str_detect(zz[,2:3], alarm.words) # error: the target of the >>>>> >>>>>search >>>>> >>>>>>> # must be a vector, not >>>>> >>>>>multiple >>>>> >>>>>>> # columns >>>>> >>>>>>> >>>>> >>>>>>> str_detect(zz[1:4,2:3], alarm.words) # same error >>>>> >>>>>>> >>>>> >>>>>>> str_detect(zz[,2], alarm.words) # error, length of >>>>> >>>alarm.words >>>>> >>>>>>> # is less than the number >>>>> of >>>>> >>>>>>> # rows I am using for the >>>>> >>>>>>> # comparison >>>>> >>>>>>> >>>>> >>>>>>> str_detect(zz[1:4,2], alarm.words) # works as hoped when >>>>> >>>>>>> length(alarm.words) # confining nrows >>>>> >>>>>>> # to the length of >>>>> >>>alarm.words >>>>> >>>>>>> >>>>> >>>>>>> str_detect(zz, alarm.words) # obviously not right >>>>> >>>>>>> >>>>> >>>>>>> # maybe I need apply() ? >>>>> >>>>>>> my.f <- function(x){str_detect(x, alarm.words)} >>>>> >>>>>>> >>>>> >>>>>>> apply(zz[,2], 1, my.f) # again, a mismatch in lengths >>>>> >>>>>>> # between alarm.words and that >>>>> >>>>>>> # in which I am searching for >>>>> >>>>>>> # matching strings >>>>> >>>>>>> >>>>> >>>>>>> apply(zz, 2, my.f) # now I'm getting somewhere >>>>> >>>>>>> apply(zz[1:4,], 2, my.f) # but still only works with 4 >>>>> >>>>>>> # rows of the dataframe >>>>> >>>>>>> >>>>> >>>>>>> >>>>> >>>>>>> # perhaps %in% could do the job? >>>>> >>>>>>> >>>>> >>>>>>> Appreciate any advice. >>>>> >>>>>>> >>>>> >>>>>>> --Chris Ryan >>>>> >>>>>>> >>>>> >>>>>>> ______________________________________________ >>>>> >>>>>>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, >>>>> see >>>>> >>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help >>>>> >>>>>>> PLEASE do read the posting guide >>>>> >>>>>http://www.R-project.org/posting-guide.html >>>>> >>>>>>> and provide commented, minimal, self-contained, reproducible >>>>> code. >>>>> >>>>>> >>>>> >>>>>> ______________________________________________ >>>>> >>>>>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see >>>>> >>>>>> https://stat.ethz.ch/mailman/listinfo/r-help >>>>> >>>>>> PLEASE do read the posting guide >>>>> >>>>>http://www.R-project.org/posting-guide.html >>>>> >>>>>> and provide commented, minimal, self-contained, reproducible >>>>> code. >>>>> >>>>> >>>>> >>>>>______________________________________________ >>>>> >>>>>R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see >>>>> >>>>>https://stat.ethz.ch/mailman/listinfo/r-help >>>>> >>>>>PLEASE do read the posting guide >>>>> >>>>>http://www.R-project.org/posting-guide.html >>>>> >>>>>and provide commented, minimal, self-contained, reproducible code. >>>>> >>>> >>>>> >> >>>>> > >>>>> > ______________________________________________ >>>>> > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see >>>>> > https://stat.ethz.ch/mailman/listinfo/r-help >>>>> > PLEASE do read the posting guide http://www.R-project.org/posting- >>>>> guide.html >>>>> > and provide commented, minimal, self-contained, reproducible code. >>>> >>>> >>>> --- >>>> This email has been checked for viruses by Avast antivirus software. >>>> https://www.avast.com/antivirus >>>> ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.