If you use Jim's example and use grep() with ordinary and and then
negative indexing, you get these results:
> x[grep("her", x$input),]
input output corpusFreq pvolOT pvolRatioOT
2 donate(her,thebook) P 48.7 68928 0.1899471
6 give(her,it) P 100.0 157248 0.4333333
8 give(her,thebook) P 5.7 65232 0.1797619
9 donate(her,it) P 100.0 152064 0.4190476
> x[-grep("her", x$input),]
input output corpusFreq pvolOT pvolRatioOT
1 give(mysister,theoldbook) P 47.0 56016 0.1543651
3 give(mysister,thebook) P 73.4 80136 0.2208333
4 donate(mysister,theoldbook) P 79.0 57024 0.1571429
5 give(mysister,it) P 100.0 132408 0.3648810
7 donate(mysister,it) P 100.0 130720 0.3602293
10 give(mylittlesister,thebook) P 91.8 112032 0.3087302
11 donate(mylittlesister,thebook) P 98.4 114048 0.3142857
12 donate(mysister,thebook) P 94.4 82800 0.2281746
--
David.
On Mar 20, 2009, at 9:39 PM, jim holtman wrote:
grep and regexpr return different values. regexpr returns a vector of
the same length as the input and this can be used to construct a
logical subscript. grep return a vector of only the matches, in which
case you can have a length of zero if there are no matches. Makes it
harder to create the subsets. You have to test for zero length and
then do something special.
On Fri, Mar 20, 2009 at 9:20 PM, Max Bane <max.b...@gmail.com> wrote:
Thanks, Jim (and Mark, who replied off-list) -- that does the
trick. I
had tried using an index expression with grep, but that failed in the
same way as the subset method. It is still rather mysterious why this
works with regexpr but not with grep :)
-Max
On Fri, Mar 20, 2009 at 7:57 PM, jim holtman <jholt...@gmail.com>
wrote:
Try using regexpr instead:
x <- read.table(textConnection("input output corpusFreq pvolOT
pvolRatioOT
+ give(mysister,theoldbook) P 47.0 56016 0.1543651
+ donate(her,thebook) P 48.7 68928 0.1899471
+ give(mysister,thebook) P 73.4 80136 0.2208333
+ donate(mysister,theoldbook) P 79.0 57024 0.1571429
+ give(mysister,it) P 100.0 132408 0.3648810
+ give(her,it) P 100.0 157248 0.4333333
+ donate(mysister,it) P 100.0 130720 0.3602293
+ give(her,thebook) P 5.7 65232 0.1797619
+ donate(her,it) P 100.0 152064 0.4190476
+ give(mylittlesister,thebook) P 91.8 112032 0.3087302
+ donate(mylittlesister,thebook) P 98.4 114048
0.3142857
+ donate(mysister,thebook) P 94.4 82800 0.2281746"),
header=TRUE)
# use regexpr
matched <- regexpr("her", x$input) != -1
notMatched <- !matched
x[matched,]
input output corpusFreq pvolOT pvolRatioOT
2 donate(her,thebook) P 48.7 68928 0.1899471
6 give(her,it) P 100.0 157248 0.4333333
8 give(her,thebook) P 5.7 65232 0.1797619
9 donate(her,it) P 100.0 152064 0.4190476
x[notMatched,]
input output corpusFreq pvolOT
pvolRatioOT
1 give(mysister,theoldbook) P 47.0 56016
0.1543651
3 give(mysister,thebook) P 73.4 80136
0.2208333
4 donate(mysister,theoldbook) P 79.0 57024
0.1571429
5 give(mysister,it) P 100.0 132408
0.3648810
7 donate(mysister,it) P 100.0 130720
0.3602293
10 give(mylittlesister,thebook) P 91.8 112032
0.3087302
11 donate(mylittlesister,thebook) P 98.4 114048
0.3142857
12 donate(mysister,thebook) P 94.4 82800
0.2281746
On Fri, Mar 20, 2009 at 8:25 PM, Max Bane <max.b...@gmail.com>
wrote:
I have some data that looks like this:
dataP
input output corpusFreq pvolOT
pvolRatioOT
1 give(my sister, the old book) P 47.0 56016
0.1543651
5 donate(her, the book) P 48.7 68928
0.1899471
9 give(my sister, the book) P 73.4 80136
0.2208333
13 donate(my sister, the old book) P 79.0 57024
0.1571429
20 give(my sister, it) P 100.0 132408
0.3648810
21 give(her, it) P 100.0 157248
0.4333333
24 donate(my sister, it) P 100.0 130720
0.3602293
28 give(her, the book) P 5.7 65232
0.1797619
31 donate(her, it) P 100.0 152064
0.4190476
35 give(my little sister, the book) P 91.8 112032
0.3087302
39 donate(my little sister, the book) P 98.4 114048
0.3142857
43 donate(my sister, the book) P 94.4 82800
0.2281746
I would like to extract the subset of this data in which the
value of
the "input" column contains the substring "her". I was thinking I
could use the grep function to test for the presence of this
substring. For instance, if a string does not contain it, then grep
returns a zero length integer vector:
grep("her", "give(my sister, it)")
integer(0)
And if the string does contain the substring, grep returns a
vector of
the indices where the substring is located:
grep("her", "give(her, it)")
[1] 1
I can thus test for the presence of the substring by converting the
length of the result of grep into a boolean:
as.logical(length(grep("her", "give(my sister, it)")))
[1] FALSE
as.logical(length(grep("her", "give(her, it)")))
[1] TRUE
as.logical(length(grep("her", "give(her, it)"))) == TRUE
[1] TRUE
as.logical(length(grep("her", "give(my sister, it)"))) == TRUE
[1] FALSE
I would like to use this test as a criterion for constructing a
subset
of my data. Unfortunately, it does not work:
subset(dataP, as.logical(length(grep("her", input)))==TRUE)
input output corpusFreq pvolOT
pvolRatioOT
1 give(my sister, the old book) P 47.0 56016
0.1543651
5 donate(her, the book) P 48.7 68928
0.1899471
9 give(my sister, the book) P 73.4 80136
0.2208333
13 donate(my sister, the old book) P 79.0 57024
0.1571429
20 give(my sister, it) P 100.0 132408
0.3648810
21 give(her, it) P 100.0 157248
0.4333333
24 donate(my sister, it) P 100.0 130720
0.3602293
28 give(her, the book) P 5.7 65232
0.1797619
31 donate(her, it) P 100.0 152064
0.4190476
35 give(my little sister, the book) P 91.8 112032
0.3087302
39 donate(my little sister, the book) P 98.4 114048
0.3142857
43 donate(my sister, the book) P 94.4 82800
0.2281746
As you can see, I get back the whole data set, rather than just the
subset where the input column contains "her". And if I invert the
test, which I would expect to give the subset *not* containing
"her",
I instead get the empty subset, rather mysteriously:
subset(dataP, as.logical(length(grep("her", input)))==FALSE)
[1] input output corpusFreq pvolOT pvolRatioOT
<0 rows> (or 0-length row.names)
The type of the input column is definitely character. To be
double sure:
subset(dataP, as.logical(length(grep("her",
as.character(input))))==TRUE)
does the same thing.
Could somebody with more R experience than I have please explain
what
I am doing wrong here? I'll be much obliged.
David Winsemius, MD
Heritage Laboratories
West Hartford, CT
______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.