Sorry. Typo. The last line should be: ans$Result <- apply(ans,1,function(r)phrasewords[[r[1]]] %allin% tweets[[r[2]]])
-- Bert On Thu, Oct 18, 2018 at 7:04 PM Bert Gunter <bgunter.4...@gmail.com> wrote: > All (especially Nathan): **Please feel free to ignore this post without > response.** It just represents a bit of OCD-ness on my part that may or may > not be of interest to anyone else. > > Purpose of this post: To give an alternative considerably simpler and > considerably faster solution to the problem than those which I offered > previously. It may or may not be what the OP asked for, but the improvement > exercise was instructive to me . Notation as previously in this thread. > > New solution: > > getwords <- > function(x)strsplit(gsub("(^[[:space:]]+)|([[:space:]]+)$)","",tolower(x)),split > = " +") > ## split lower-cased text into a vector of "words" > ## I made this a bit fancier to handle some "corner" cases, but the > previous simpler version may well suffice. > > '%allin%' <- function(x, table)prod(match(x,table, nomatch = 0L)) > 0L > ## a convenience function/operator that improves efficiency. > > ## lists of search word vectors as before > phrasewords <- getwords(st$terms) > tweets <- getwords(c(th$text, " i xxxx worthless yxxc ght feel")) ## the > tweets + one additional > > ## simpler approach just using indexing for the bookkeeping that nested > _apply > ## loops previously were used for > ans <- expand.grid(phrases = seq_along(phrasewords),tweets = > seq_along(tweets), Result = FALSE) > ans$Result <- apply(ind,1,function(r)phrasewords[[r[1]]] %allin% > tweets[[r[2]]]) > > ## ans is a data frame in which the first column indexes phrases and the > second tweets > ## The ith row of ans$Result == TRUE iff all the words in the phrase > indexed by the ith row of the > ## phrase column are contained in the tweet indexed by that row's tweet > column. > > This was way faster than my previous offerings. > > Note also that just the matching phrases and tweets can be extracted as > usual by: > > > ans[ans[,3],] > phrases tweets Result > 42 6 7 TRUE > ## all the words in the 6th search phrase appeared in the 7th tweet. > > ** I promise to natter on about this no longer! ** > > Cheers, > Bert > > > On Wed, Oct 17, 2018 at 7:50 PM Bert Gunter <bgunter.4...@gmail.com> > wrote: > >> >> If you wish to use R, you need to at least understand its basic data >> structures and functionality. Expecting that mimickry of code in special >> packages will suffice is, I believe, an illusion. If you haven't already >> done so, you should go through a basic R tutorial or two (there are many on >> the web; some recommendations, by no means necessarily "the best", can be >> found here: >> https://www.rstudio.com/online-learning/#r-programming). >> >> Having said that, I realized that my previous "solution" using regular >> expressions was more complicated than it needed to be and somewhat foolish >> ( so much for all my "expertise"). A simpler and better approach is simply >> to break up both the tweet texts and your search phrases into vectors of >> their "words" (i.e. character strings surrounded by spaces) using >> strplit(), and then using R's built-in matching capabilities with %in%. >> This is quite straightforward, pretty robust (no regex's to wrestle with), >> and does not require "herculean efforts" to understand. The only wrinkle is >> some bookkeeping with the "apply" family of functions. These are, as you >> may know, the functional programming way of handling iteration (loops), but >> they are what I would consider part of "basic" R functionality and worth >> spending the time to learn about. >> >> Herewith my better, simpler proposal, using your example data as before: >> >> getwords <- function(x)strsplit(tolower(x),split = " +") >> ## split text into a vector of lower-cased "words" >> >> phrasewords <- structure(getwords(st$terms), names = st$terms) >> ## named list of your search word vectors >> >> tweets <- getwords(c(th$text, " i xxxx worthless yxxc ght feel")) >> ## the tweets + one additional that should match the last phrase >> >> ans <- lapply(phrasewords, function(x) apply(sapply(tweets,function(y)x >> %in% y), 2, all)) >> ## a list indexed by the search phrases, >> ## with each component a vector of logicals with vec[i] == TRUE iff >> ## the ith tweet contains all the words in the search phrase >> >> > ans >> $`me abused depressed` >> [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE >> >> $`me hurt depressed` >> [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE >> >> $`feel hopeless depressed` >> [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE >> >> $`feel alone depressed` >> [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE >> >> $`i feel helpless` >> [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE >> >> $`i feel worthless` >> [1] FALSE FALSE FALSE FALSE FALSE FALSE TRUE >> >> -- Bert >> >> On Wed, Oct 17, 2018 at 9:20 AM Nathan Parsons < >> nathan.f.pars...@gmail.com> wrote: >> >>> I do not have your command of base r, Bert. That is a herculean effort! >>> Here’s what I spent my night putting together: >>> >>> ## Create search terms >>> ## dput(st) >>> st <- structure(list(word1 = c("technique", "me", "me", "feel", "feel" >>> ), word2 = c("olympic", "abused", "hurt", "hopeless", "alone" >>> ), word3 = c("lifts", "depressed", "depressed", "depressed", >>> "depressed")), class = c("tbl_df", "tbl", "data.frame"), row.names = >>> c(NA, >>> -5L)) >>> >>> ## Create tweets >>> ## dput(th) >>> th <- structure(list(status_id = c("x1047841705729306624", >>> "x1046966595610927105", >>> "x1047094786610552832", "x1046988542818308097", "x1046934493553221632", >>> "x1047227442899775488", "x1048126008941981696", "x1047798782673543173", >>> "x1048269727582355457", "x1048092408544677890"), created_at = >>> c("2018-10-04T13:31:45Z", >>> "2018-10-02T03:34:22Z", "2018-10-02T12:03:45Z", "2018-10-02T05:01:35Z", >>> "2018-10-02T01:26:49Z", "2018-10-02T20:50:53Z", "2018-10-05T08:21:28Z", >>> "2018-10-04T10:41:11Z", "2018-10-05T17:52:33Z", "2018-10-05T06:07:57Z" >>> ), text = c("technique is everything with olympic lifts ! @ body by john >>> ", >>> "@subtronics just went back and rewatched ur fblice with ur cdjs and let >>> me tell you man. you are the fucking messiah", >>> "@ic4rus1 opportunistic means short-game. as in getting drunk now vs. >>> not being hung over tomorrow vs. not fucking up your life ten years later.", >>> "i tend to think about my dreams before i sleep.", "@michaelavenatti >>> @senatorcollins so if your client was in her 20s attending parties with >>> teenagers doesnt that make her at the least immature as hell or at the >>> worst a pedophile and a person contributing to the delinquency of minors?", >>> "i wish i could take credit for this", "i woulda never imagined. >>> #lakeshow ", >>> "@philipbloom @blackmagic_news its ok phil! i feel your pain! ", >>> "sunday ill have a booth in katy at the real craft wives of katy fest >>> @nolabelbrewco cmon yall!everything is better when you top it with >>> tias!order today we ship to all 50 ", >>> "dolly is so baddd"), lat = c(43.6835853, 40.284123, 37.7706565, >>> 40.431389, 31.1688935, 33.9376735, 34.0207895, 44.900818, 29.7926, >>> 32.364145), lng = c(-70.3284118, -83.078589, -122.4359785, -79.9806895, >>> -100.0768885, -118.130426, -118.4119065, -89.5694915, -95.8224, >>> -86.2447285), county_name = c("Cumberland County", "Delaware County", >>> "San Francisco County", "Allegheny County", "Concho County", >>> "Los Angeles County", "Los Angeles County", "Marathon County", >>> "Harris County", "Montgomery County"), fips = c(23005L, 39041L, >>> 6075L, 42003L, 48095L, 6037L, 6037L, 55073L, 48201L, 1101L), >>> state_name = c("Maine", "Ohio", "California", "Pennsylvania", >>> "Texas", "California", "California", "Wisconsin", "Texas", >>> "Alabama"), state_abb = c("ME", "OH", "CA", "PA", "TX", "CA", >>> "CA", "WI", "TX", "AL"), urban_level = c("Medium Metro", >>> "Large Fringe Metro", "Large Central Metro", "Large Central Metro", >>> "NonCore (Nonmetro)", "Large Central Metro", "Large Central Metro", >>> "Small Metro", "Large Central Metro", "Medium Metro"), urban_code = c(3L, >>> 2L, 1L, 1L, 6L, 1L, 1L, 4L, 1L, 3L), population = c(277308L, >>> 184029L, 830781L, 1160433L, 4160L, 9509611L, 9509611L, 127612L, >>> 4233913L, 211037L), linenumber = 1:10), row.names = c(NA, >>> 10L), class = "data.frame") >>> >>> ## Clean tweets - basically just remove everything we don’t need from >>> the text including punctuation and urls >>> th %>% >>> mutate(linenumber = row_number(), >>> text = str_remove_all(text, "[^\x01-\x7F]"), >>> text = str_remove_all(text, "\n"), >>> text = str_remove_all(text, ","), >>> text = str_remove_all(text, "'"), >>> text = str_remove_all(text, "&"), >>> text = str_remove_all(text, "<"), >>> text = str_remove_all(text, ">"), >>> text = str_remove_all(text, "http[s]?://[[:alnum:].\\/]+"), >>> text = tolower(text)) -> th >>> >>> ## Create search function that looks for each search term in the >>> provided string, evaluates if all three search terms have been found, and >>> returns a logical >>> srchr <- function(df) { >>> str_detect(df, "olympic") -> a >>> str_detect(df, "technique") -> b >>> str_detect(df, "lifts") -> c >>> ifelse(a == TRUE & b == TRUE & c == TRUE, TRUE, FALSE) >>> } >>> >>> ## Evaluate tweets for presence of search term >>> th %>% >>> mutate(flag = map_chr(text, srchr)) -> th_flagged >>> >>> As far as I can tell, this works. I have to manually enter each set of >>> search terms into the function, which is not ideal. Also, this only >>> generates a True/False for each tweet based on one search term - I end up >>> with an evaluatory column for each search term that I would then have to >>> collapse together somehow. I’m sure there’s a more elegant solution. >>> >>> -- >>> >>> Nate Parsons >>> Pronouns: He, Him, His >>> Graduate Teaching Assistant >>> Department of Sociology >>> Portland State University >>> Portland, Oregon >>> >>> 503-725-9025 >>> 503-725-3957 FAX >>> On Oct 16, 2018, 7:20 PM -0700, Bert Gunter <bgunter.4...@gmail.com>, >>> wrote: >>> >>> OK, as no one else has offered a solution, I'll take a whack at it. >>> >>> Caveats: This is a brute force attempt using R's basic regular >>> expression engine. It is inelegant and barely tested, so likely to be at >>> best incomplete and buggy, and at worst, incorrect. But maybe Nathan or >>> someone else on the list can fix it up. So if (when) it breaks, complain on >>> the list to give someone (almost certainly not me) the opportunity. >>> >>> The basic idea is that the tweets are just character strings and the >>> search phrases are just character vectors all of whose elements must match >>> "appropriately" -- i.e. they must match whole words -- in the character >>> strings. So my desired output from the code is a list indexed by the search >>> phrases, each of whose components if a logical vector of length the number >>> of tweets each of whose elements = TRUE iff all the words in the search >>> phrase match somewhere in the tweet. >>> >>> Here's the code(using the data Nathan provided): >>> >>> > words <- sapply(st[[1]],strsplit,split = " +" ) >>> ## convert the phrases to a list of character vectors of the words >>> ## Result: >>> > words >>> $`me abused depressed` >>> [1] "me" "abused" "depressed" >>> >>> $`me hurt depressed` >>> [1] "me" "hurt" "depressed" >>> >>> $`feel hopeless depressed` >>> [1] "feel" "hopeless" "depressed" >>> >>> $`feel alone depressed` >>> [1] "feel" "alone" "depressed" >>> >>> $`i feel helpless` >>> [1] "i" "feel" "helpless" >>> >>> $`i feel worthless` >>> [1] "i" "feel" "worthless" >>> >>> > expand.words <- function(z)lapply(z,function(x)paste0(c("^ *"," "," >>> "),x, c(" "," "," *$"))) >>> ## function to create regexes for words when they are at the beginning, >>> middle, or end of tweets >>> >>> > wordregex <- lapply(words,expand.words) >>> ##Result >>> ## too lengthy to include >>> ## >>> > tweets <- th$text >>> ##extract the tweets >>> > findin <- function(x,y) >>> ## x is a vector of regex patterns >>> ## y is a character vector >>> ## value = vector,vec, with length(vec) == length(y) and vec[i] == >>> TRUE iff any of x matches y[i] >>> { apply(sapply(x,function(z)grepl(z,y)), 1,any) >>> } >>> >>> ## add a matching "tweet" to the tweet vector: >>> > tweets <- c(tweets," i xxxx worthless yxxc ght feel") >>> >>> > ans <- >>> lapply(wordregex,function(z)apply(sapply(z,function(x)findin(x,tweets)), 1, >>> all)) >>> ## Result: >>> > ans >>> $`me abused depressed` >>> [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE >>> >>> $`me hurt depressed` >>> [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE >>> >>> $`feel hopeless depressed` >>> [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE >>> >>> $`feel alone depressed` >>> [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE >>> >>> $`i feel helpless` >>> [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE >>> >>> $`i feel worthless` >>> [1] FALSE FALSE FALSE FALSE FALSE FALSE TRUE >>> >>> ## None of the tweets match any of the phrases except for the last tweet >>> that I added. >>> >>> ## Note: you need to add capabilities to handle upper and lower case. >>> See, e.g. ?casefold >>> >>> Cheers, >>> Bert >>> >>> Bert Gunter >>> >>> "The trouble with having an open mind is that people keep coming along >>> and sticking things into it." >>> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip ) >>> >>> >>> On Tue, Oct 16, 2018 at 3:03 PM Bert Gunter <bgunter.4...@gmail.com> >>> wrote: >>> >>>> The problem wasn't the data tibbles. You posted in html -- which you >>>> were explictly warned against -- and that corrupted your text (e.g. some >>>> quotes became "smart quotes", which cannot be properly cut and pasted into >>>> R). >>>> >>>> Bert >>>> >>>> >>>> On Tue, Oct 16, 2018 at 2:47 PM Nathan Parsons < >>>> nathan.f.pars...@gmail.com> wrote: >>>> >>>>> Argh! Here are those two example datasets as data frames (not tibbles). >>>>> Sorry again. This apparently is just not my day. >>>>> >>>>> >>>>> th <- structure(list(status_id = c("x1047841705729306624", >>>>> "x1046966595610927105", >>>>> >>>>> "x1047094786610552832", "x1046988542818308097", "x1046934493553221632", >>>>> >>>>> "x1047227442899775488"), created_at = c("2018-10-04T13:31:45Z", >>>>> >>>>> "2018-10-02T03:34:22Z", "2018-10-02T12:03:45Z", "2018-10-02T05:01:35Z", >>>>> >>>>> "2018-10-02T01:26:49Z", "2018-10-02T20:50:53Z"), text = c("Technique is >>>>> everything with olympic lifts ! @ Body By John https://t.co/UsfR6DafZt >>>>> ", >>>>> >>>>> "@Subtronics just went back and rewatched ur FBlice with ur CDJs and >>>>> let me >>>>> tell you man. You are the fucking messiah", >>>>> >>>>> "@ic4rus1 Opportunistic means short-game. As in getting drunk now vs. >>>>> not >>>>> being hung over tomorrow vs. not fucking up your life ten years >>>>> later.", >>>>> >>>>> "I tend to think about my dreams before I sleep.", "@MichaelAvenatti >>>>> @SenatorCollins So, if your client was in her 20s, attending parties >>>>> with >>>>> teenagers, doesn't that make her at the least immature as hell, or at >>>>> the >>>>> worst, a pedophile and a person contributing to the delinquency of >>>>> minors?", >>>>> >>>>> >>>>> "i wish i could take credit for this"), lat = c(43.6835853, 40.284123, >>>>> >>>>> 37.7706565, 40.431389, 31.1688935, 33.9376735), lng = c(-70.3284118, >>>>> >>>>> -83.078589, -122.4359785, -79.9806895, -100.0768885, -118.130426 >>>>> >>>>> ), county_name = c("Cumberland County", "Delaware County", "San >>>>> Francisco >>>>> County", >>>>> >>>>> "Allegheny County", "Concho County", "Los Angeles County"), fips = >>>>> c(23005L, >>>>> >>>>> >>>>> 39041L, 6075L, 42003L, 48095L, 6037L), state_name = c("Maine", >>>>> >>>>> "Ohio", "California", "Pennsylvania", "Texas", "California"), >>>>> >>>>> state_abb = c("ME", "OH", "CA", "PA", "TX", "CA"), urban_level = >>>>> c("Medium Metro", >>>>> >>>>> "Large Fringe Metro", "Large Central Metro", "Large Central Metro", >>>>> >>>>> "NonCore (Nonmetro)", "Large Central Metro"), urban_code = c(3L, >>>>> >>>>> 2L, 1L, 1L, 6L, 1L), population = c(277308L, 184029L, 830781L, >>>>> >>>>> 1160433L, 4160L, 9509611L)), class = "data.frame", row.names = >>>>> c(NA, >>>>> >>>>> -6L)) >>>>> >>>>> >>>>> st <- structure(list(terms = c("me abused depressed", "me hurt >>>>> depressed", >>>>> >>>>> "feel hopeless depressed", "feel alone depressed", "i feel helpless", >>>>> >>>>> "i feel worthless")), row.names = c(NA, -6L), class = c("tbl_df", >>>>> >>>>> "tbl", "data.frame")) >>>>> >>>>> On Tue, Oct 16, 2018 at 2:39 PM Nathan Parsons < >>>>> nathan.f.pars...@gmail.com> >>>>> wrote: >>>>> >>>>> > Thanks all for your patience. Here’s a second go that is perhaps more >>>>> > explicative of what it is I am trying to accomplish (and hopefully >>>>> in plain >>>>> > text form)... >>>>> > >>>>> > >>>>> > I’m using the following packages: tidyverse, purrr, tidytext >>>>> > >>>>> > >>>>> > I have a number of tweets in the following form: >>>>> > >>>>> > >>>>> > th <- structure(list(status_id = c("x1047841705729306624", >>>>> > "x1046966595610927105", >>>>> > >>>>> > "x1047094786610552832", "x1046988542818308097", >>>>> "x1046934493553221632", >>>>> > >>>>> > "x1047227442899775488"), created_at = c("2018-10-04T13:31:45Z", >>>>> > >>>>> > "2018-10-02T03:34:22Z", "2018-10-02T12:03:45Z", >>>>> "2018-10-02T05:01:35Z", >>>>> > >>>>> > "2018-10-02T01:26:49Z", "2018-10-02T20:50:53Z"), text = c("Technique >>>>> is >>>>> > everything with olympic lifts ! @ Body By John >>>>> https://t.co/UsfR6DafZt", >>>>> > >>>>> > "@Subtronics just went back and rewatched ur FBlice with ur CDJs and >>>>> let >>>>> > me tell you man. You are the fucking messiah", >>>>> > >>>>> > "@ic4rus1 Opportunistic means short-game. As in getting drunk now >>>>> vs. not >>>>> > being hung over tomorrow vs. not fucking up your life ten years >>>>> later.", >>>>> > >>>>> > "I tend to think about my dreams before I sleep.", "@MichaelAvenatti >>>>> > @SenatorCollins So, if your client was in her 20s, attending parties >>>>> with >>>>> > teenagers, doesn't that make her at the least immature as hell, or >>>>> at the >>>>> > worst, a pedophile and a person contributing to the delinquency of >>>>> minors?", >>>>> > >>>>> > "i wish i could take credit for this"), lat = c(43.6835853, >>>>> 40.284123, >>>>> > >>>>> > 37.7706565, 40.431389, 31.1688935, 33.9376735), lng = c(-70.3284118, >>>>> > >>>>> > -83.078589, -122.4359785, -79.9806895, -100.0768885, -118.130426 >>>>> > >>>>> > ), county_name = c("Cumberland County", "Delaware County", "San >>>>> Francisco >>>>> > County", >>>>> > >>>>> > "Allegheny County", "Concho County", "Los Angeles County"), fips = >>>>> > c(23005L, >>>>> > >>>>> > 39041L, 6075L, 42003L, 48095L, 6037L), state_name = c("Maine", >>>>> > >>>>> > "Ohio", "California", "Pennsylvania", "Texas", "California"), >>>>> > >>>>> > state_abb = c("ME", "OH", "CA", "PA", "TX", "CA"), urban_level = >>>>> c("Medium >>>>> > Metro", >>>>> > >>>>> > "Large Fringe Metro", "Large Central Metro", "Large Central Metro", >>>>> > >>>>> > "NonCore (Nonmetro)", "Large Central Metro"), urban_code = c(3L, >>>>> > >>>>> > 2L, 1L, 1L, 6L, 1L), population = c(277308L, 184029L, 830781L, >>>>> > >>>>> > 1160433L, 4160L, 9509611L)), class = c("data.table", "data.frame" >>>>> > >>>>> > ), row.names = c(NA, -6L), .internal.selfref = ) >>>>> > >>>>> > >>>>> > I also have a number of search terms in the following form: >>>>> > >>>>> > >>>>> > st <- structure(list(terms = c("me abused depressed", "me hurt >>>>> depressed", >>>>> > >>>>> > "feel hopeless depressed", "feel alone depressed", "i feel helpless", >>>>> > >>>>> > "i feel worthless")), row.names = c(NA, -6L), class = c("tbl_df", >>>>> > >>>>> > "tbl", "data.frame”)) >>>>> > >>>>> > >>>>> > I am trying to isolate the tweets that contain all of the words in >>>>> each of >>>>> > the search terms, i.e “me” “abused” and “depressed” from the first >>>>> example >>>>> > search term, but they do not have to be in order or even next to one >>>>> > another. >>>>> > >>>>> > >>>>> > I am familiar with the dplyr suite of tools and have been attempting >>>>> to >>>>> > generate some sort of ‘filter()’ to do this. I am not very familiar >>>>> with >>>>> > purrr, but there may be a solution using the map function? I have >>>>> also >>>>> > explored the tidytext ‘unnest_tokens’ function which transforms the >>>>> ’th’ >>>>> > data in the following way: >>>>> > >>>>> > >>>>> > > tidytext::unnest_tokens(th, word, text, token = "tweets") -> tt >>>>> > >>>>> > > head(tt) >>>>> > >>>>> > status_id created_at lat lng >>>>> > >>>>> > 1: x1047841705729306624 2018-10-04T13:31:45Z 43.68359 -70.32841 >>>>> > >>>>> > 2: x1047841705729306624 2018-10-04T13:31:45Z 43.68359 -70.32841 >>>>> > >>>>> > 3: x1047841705729306624 2018-10-04T13:31:45Z 43.68359 -70.32841 >>>>> > >>>>> > 4: x1047841705729306624 2018-10-04T13:31:45Z 43.68359 -70.32841 >>>>> > >>>>> > 5: x1047841705729306624 2018-10-04T13:31:45Z 43.68359 -70.32841 >>>>> > >>>>> > 6: x1047841705729306624 2018-10-04T13:31:45Z 43.68359 -70.32841 >>>>> > >>>>> > county_name fips state_name state_abb urban_level urban_code >>>>> > >>>>> > 1: Cumberland County 23005 Maine ME Medium Metro 3 >>>>> > >>>>> > 2: Cumberland County 23005 Maine ME Medium Metro 3 >>>>> > >>>>> > 3: Cumberland County 23005 Maine ME Medium Metro 3 >>>>> > >>>>> > 4: Cumberland County 23005 Maine ME Medium Metro 3 >>>>> > >>>>> > 5: Cumberland County 23005 Maine ME Medium Metro 3 >>>>> > >>>>> > 6: Cumberland County 23005 Maine ME Medium Metro 3 >>>>> > >>>>> > population word >>>>> > >>>>> > 1: 277308 technique >>>>> > >>>>> > 2: 277308 is >>>>> > >>>>> > 3: 277308 everything >>>>> > >>>>> > 4: 277308 with >>>>> > >>>>> > 5: 277308 olympic >>>>> > >>>>> > 6: 277308 lifts >>>>> > >>>>> > >>>>> > but once I have unnested the tokens, I am unable to recombine them >>>>> back >>>>> > into tweets. >>>>> > >>>>> > >>>>> > Ideally the end result would append a new column to the ‘th’ data >>>>> that >>>>> > would flag a tweet that contained all of the search words for any of >>>>> the >>>>> > search terms; so the work flow would look like >>>>> > >>>>> > 1) look for all search words for one search term in a tweet >>>>> > >>>>> > 2) if all of the search words in the search term are found, create a >>>>> flag >>>>> > (mutate(flag = 1) or some such) >>>>> > >>>>> > 3) do this for all of the tweets >>>>> > >>>>> > 4) move on the next search term and repeat >>>>> > >>>>> > >>>>> > Again, my thanks for your patience. >>>>> > >>>>> > >>>>> > -- >>>>> > >>>>> > >>>>> > Nate Parsons >>>>> > >>>>> > Pronouns: He, Him, His >>>>> > >>>>> > Graduate Teaching Assistant >>>>> > >>>>> > Department of Sociology >>>>> > >>>>> > Portland State University >>>>> > >>>>> > Portland, Oregon >>>>> > >>>>> > >>>>> > 503-725-9025 >>>>> > >>>>> > 503-725-3957 FAX >>>>> > >>>>> >>>>> [[alternative HTML version deleted]] >>>>> >>>>> ______________________________________________ >>>>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see >>>>> https://stat.ethz.ch/mailman/listinfo/r-help >>>>> PLEASE do read the posting guide >>>>> http://www.R-project.org/posting-guide.html >>>>> and provide commented, minimal, self-contained, reproducible code. >>>>> >>>> [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.