Re: [R] Identifying words from a list and code as 0 or 1 and words NOT on the list code as 1

Debbie Hahs-Vaughn Mon, 14 Jun 2021 18:48:00 -0700

This also seems to work beautifully!  I'm all for having multiple approaches so 
appreciate the time you took to do this and am particularly appreciative of the 
annotation on the script.  That definitely helps clarify what's happening and 
am sure that will be helpful to others working on similar tasks as well.  
Thanks again, very much!

________________________________
From: Bert Gunter <bgunter.4...@gmail.com>
Sent: Friday, June 11, 2021 7:10 PM
To: Debbie Hahs-Vaughn <deb...@ucf.edu>; Rui Barradas <ruipbarra...@sapo.pt>
Cc: r-help@R-project.org <r-help@r-project.org>
Subject: Re: [R] Identifying words from a list and code as 0 or 1 and words NOT 
on the list code as 1

First, if Rui's solution works for you, I recommend that you stop reading and 
discard this email. Why  bother wasting time with stuff you don't need?!

If it doesn't work or if you would like another approach -- perhaps as a check 
-- then read on.

Warning: I am a dinosaur and just use base R functionality , including regular 
expressions, for these sorts of relatively simple tasks. I also eschew pipes. 
So my code for your example is simply:

matchpat <- paste("\\b",coreWords, "\\b", sep = "",collapse = "|")
out <- gsub(matchpat,"",Utterance)
Core <- nchar(out) != nchar(Utterance)
Fringe <-  nchar(gsub(" +","",out)) > 0

Note that I have given the results as logical TRUE or FALSE. If you insist on 
1's and 0's, just instead do:
Core <- (nchar(out) != nchar(Utterance)) + 0
Fringe <- sign(nchar(gsub(" +","",out)))

Now for an explanation. My approach was simply to create a regular expression 
(regex) match pattern that would match any of your words or phrases. The 
matchpat assignment does this just by logically "or"ing (with the"|" symbol) 
together all your words and phrases, each of which is surrounded by the edge of 
word symbol, "\\b" (so only whole words or phrases are matched). This is 
standard regex stuff, and I could do it rather handily with r's paste() 
function. One word of caution, though: R's ?regex says:
"Long regular expression patterns may or may not be accepted: the POSIX 
standard only requires up to 256 bytes." So what works for your reprex might 
not work for your full list of coreWords. It is possible to work around this by 
repeatedly applying subsets of your coreWords **provided** you make sure that 
you order these subsets by the number of words in each coreWord phrase. That 
is, bigger phrases must be applied first before applying smaller phrases/words 
to the results. This is not hard to do, but adds complexity, and may not be 
necessary. See below for an explanation.

What the second line of code does is to use the gsub() function to remove all 
matches to matchpat -- which, via the "|" construction -- is anything in your 
coreWord list. So this means that if you have any matches in an utterance, what 
remains after gsubbing will be shorter -- fewer characters -- than the original 
utterance. The Core assignment checks for this using the nchar() function and 
returns TRUE or FALSE as appropriate. If all the words in the utterance matched 
code words, you would be left with nothing or a bunch of spaces. The Fringe 
assignment just first removes all spaces via the gsub() and then returns TRUE 
if there's nothing (0 characters) left or FALSE if there still are some left.

Finally, why do you have to start with longer phrases first if you have to do 
this sequentially? Suppose you have the phrases "good night" in your phrase 
list, and also the word "night". If you have to do things sequentially instead 
of as one swell foop, if you applied the gsub() with a bunch including only 
"night" first, then "night" will be removed and "good" will be left. Then when 
the bunch containing "good night" is gsubbed after, it won't see the whole 
phrase any more and "good" will be left in, which is *not* what you said you 
wanted.

Finally,it is of course possible to do these things by sequentially applying 
one word/phrase at a time in a loop (again, longest phrases first for the same 
reason as above), but I believe this might take quite a while with a big list 
of coreWords (and Utterances). The above approach using "|" vectorizes things 
and takes advantage of the power of the regex engine, so I think it will be 
more efficient **if it's accepted.**  But if you run into the problem of 
pattern length limitations, then sequentially, one at a time, might be simpler. 
My judgments of computational efficiency are often wrong anyway.

Note: I think my approach works, but I would appreciate an on-list response if 
I have erred. Also, even if correct, alternative cleverer approaches are always 
welcome.

Cheers,
Bert

Bert Gunter

"The trouble with having an open mind is that people keep coming along and 
sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )

On Fri, Jun 11, 2021 at 11:54 AM Debbie Hahs-Vaughn 
<deb...@ucf.edu<mailto:deb...@ucf.edu>> wrote:
Thank you for noting this. The utterance has to match the exact phrase (e.g., 
"all done") for it to constitute a match in the utterance.

________________________________
From: Bert Gunter <bgunter.4...@gmail.com<mailto:bgunter.4...@gmail.com>>
Sent: Friday, June 11, 2021 2:42 PM
To: Debbie Hahs-Vaughn <deb...@ucf.edu<mailto:deb...@ucf.edu>>
Cc: r-help@R-project.org <r-help@r-project.org<mailto:r-help@r-project.org>>
Subject: Re: [R] Identifying words from a list and code as 0 or 1 and words NOT 
on the list code as 1

Note that your specification is ambiguous. "all done" is not a single word -- 
it's a phrase. So what do you want to do if:

1) "all"  and/or "done"  are also among your core words?
2) "I'm all done" is another of your core phrases.

The existence of phrases in your core list allows such conflicts to arise. Do 
you claim that phrases would be chosen so that this can never happen? -- or 
what is your specification if they can (what constitutes a match and in what 
priority)?

Bert Gunter

"The trouble with having an open mind is that people keep coming along and 
sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )

On Fri, Jun 11, 2021 at 10:06 AM Debbie Hahs-Vaughn 
<deb...@ucf.edu<mailto:deb...@ucf.edu>> wrote:
I am working with utterances, statements spoken by children.  From each 
utterance, if one or more words in the statement match a predefined list of 
multiple 'core' words (probably 300 words), then I want to input '1' into 
'Core' (and if none, then input '0' into 'Core').

If there are one or more words in the statement that are NOT core words, then I 
want to input '1' into 'Fringe' (and if there are only core words and nothing 
extra, then input '0' into 'Fringe').  I will not have a list of Fringe words.

Basically, right now I have a child ID and only the utterances.  Here is a 
snippet of my data.

ID      Utterance
1       a baby
2       small
3       yes
4       where's his bed
5       there's his bed
6       where's his pillow
7       what is that on his head
8       hey he has his arm stuck here
9       there there's it
10      now you're gonna go night-night
11      and that's the thing you can turn on
12      yeah where's the music box
13      what is this
14      small
15      there you go baby

The following code runs but isn't doing exactly what I need--which is:  1) the 
ability to detect words from the list and define as core; 2) the ability to 
search the utterance and if there are any words in the utterance that are NOT 
core, to identify those as �1� as I will not have a list of fringe words.

```

library(dplyr)
library(stringr)
library(tidyr)

coreWords <-c("I", "no", "yes", "my", "the", "want", "is", "it", "that", "a", 
"go", "mine", "you", "what", "on", "in", "here", "more", "out", "off", "some", 
"help", "all done", "finished")

str_detect(df,)

dfplus <- df %>%
  mutate(id = row_number()) %>%
  separate_rows(Utterance, sep = ' ') %>%
  mutate(Core = + str_detect(Utterance, str_c(coreWords, collapse = '|')),
         Fringe = + !Core) %>%
  group_by(id) %>%
  mutate(Core = + (sum(Core) > 0),
         Fringe = + (sum(Fringe) > 0)) %>%
  slice(1) %>%
  select(-Utterance) %>%
  left_join(df) %>%
  ungroup() %>%
  select(Utterance, Core, Fringe, ID)

```

The dput() code is:

structure(list(Utterance = c("a baby", "small", "yes", "where's his bed",
"there's his bed", "where's his pillow", "what is that on his head",
"hey he has his arm stuck here", "there there's it", "now you're gonna go 
night-night",
"and that's the thing you can turn on", "yeah where's the music box",
"what is this", "small", "there you go baby ", "what is this for ",
"a ", "and the go goodnight here ", "and what is this ", " what's that sound ",
"what does she say ", "what she say", "should I turn the on so Laura doesn't 
cry ",
"what is this ", "what is that ", "where's clothes ", " where's the baby's 
bedroom ",
"that might be in dad's bed+room ", "yes ", "there you go baby ",
"you're welcome "), Core = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L), Fringe = c(0L, 0L, 0L, 1L, 1L, 1L,
0L, 1L, 0L, 1L, 1L, 1L, 0L, 0L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), ID = 1:31), row.names = c(NA,
-31L), class = c("tbl_df", "tbl", "data.frame"))

```

The first 10 rows of output looks like this:

Utterance       Core    Fringe  ID
1       a baby  1       0       1
2       small   1       0       2
3       yes     1       0       3
4       where's his bed 1       1       4
5       there's his bed 1       1       5
6       where's his pillow      1       1       6
7       what is that on his head        1       0       7
8       hey he has his arm stuck here   1       1       8
9       there there's it        1       0       9
10      now you're gonna go night-night 1       1       10

For example, in line 1 of the output, �a� is a core word so �1� for core is 
correct.  However, �baby� should be picked up as fringe so there should be �1�, 
not �0�, for fringe. Lines 7 and 9 also have words that should be identified as 
fringe but are not.

Additionally, it seems like if the utterance has parts of a core word in it, 
it�s being counted. For example, �small� is identified as a core word even 
though it's not (but 'all done' is a core word). 'Where's his bed' is 
identified as core and fringe, although none of the words are core.

Any suggestions on what is happening and how to correct it are greatly 
appreciated.

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org<mailto:R-help@r-project.org> mailing list -- To 
UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fstat.ethz.ch%2Fmailman%2Flistinfo%2Fr-help&data=04%7C01%7Cdebbie%40ucf.edu%7C4281e4e2bef34fd68d3c08d92d2e157e%7Cbb932f15ef3842ba91fcf3c59d5dd1f1%7C0%7C1%7C637590498253390595%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=2hXSrCtfbIEk4gqHumpNCgkzr1pVWuukB48laLhDHQI%3D&reserved=0>
PLEASE do read the posting guide 
http://www.R-project.org/posting-guide.html<https://nam02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.r-project.org%2Fposting-guide.html&data=04%7C01%7Cdebbie%40ucf.edu%7C4281e4e2bef34fd68d3c08d92d2e157e%7Cbb932f15ef3842ba91fcf3c59d5dd1f1%7C0%7C1%7C637590498253400588%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=sTXDvMD%2B7UZynCzuoEovyBfOwwgmUlpBV7szxQYwJVg%3D&reserved=0>
and provide commented, minimal, self-contained, reproducible code.

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Identifying words from a list and code as 0 or 1 and words NOT on the list code as 1

Reply via email to