Try using regexpr instead: > x <- read.table(textConnection("input output corpusFreq pvolOT pvolRatioOT + give(mysister,theoldbook) P 47.0 56016 0.1543651 + donate(her,thebook) P 48.7 68928 0.1899471 + give(mysister,thebook) P 73.4 80136 0.2208333 + donate(mysister,theoldbook) P 79.0 57024 0.1571429 + give(mysister,it) P 100.0 132408 0.3648810 + give(her,it) P 100.0 157248 0.4333333 + donate(mysister,it) P 100.0 130720 0.3602293 + give(her,thebook) P 5.7 65232 0.1797619 + donate(her,it) P 100.0 152064 0.4190476 + give(mylittlesister,thebook) P 91.8 112032 0.3087302 + donate(mylittlesister,thebook) P 98.4 114048 0.3142857 + donate(mysister,thebook) P 94.4 82800 0.2281746"), header=TRUE) > # use regexpr > matched <- regexpr("her", x$input) != -1 > notMatched <- !matched > x[matched,] input output corpusFreq pvolOT pvolRatioOT 2 donate(her,thebook) P 48.7 68928 0.1899471 6 give(her,it) P 100.0 157248 0.4333333 8 give(her,thebook) P 5.7 65232 0.1797619 9 donate(her,it) P 100.0 152064 0.4190476 > x[notMatched,] input output corpusFreq pvolOT pvolRatioOT 1 give(mysister,theoldbook) P 47.0 56016 0.1543651 3 give(mysister,thebook) P 73.4 80136 0.2208333 4 donate(mysister,theoldbook) P 79.0 57024 0.1571429 5 give(mysister,it) P 100.0 132408 0.3648810 7 donate(mysister,it) P 100.0 130720 0.3602293 10 give(mylittlesister,thebook) P 91.8 112032 0.3087302 11 donate(mylittlesister,thebook) P 98.4 114048 0.3142857 12 donate(mysister,thebook) P 94.4 82800 0.2281746 > >
On Fri, Mar 20, 2009 at 8:25 PM, Max Bane <max.b...@gmail.com> wrote: > I have some data that looks like this: > >> dataP > input output corpusFreq pvolOT pvolRatioOT > 1 give(my sister, the old book) P 47.0 56016 0.1543651 > 5 donate(her, the book) P 48.7 68928 0.1899471 > 9 give(my sister, the book) P 73.4 80136 0.2208333 > 13 donate(my sister, the old book) P 79.0 57024 0.1571429 > 20 give(my sister, it) P 100.0 132408 0.3648810 > 21 give(her, it) P 100.0 157248 0.4333333 > 24 donate(my sister, it) P 100.0 130720 0.3602293 > 28 give(her, the book) P 5.7 65232 0.1797619 > 31 donate(her, it) P 100.0 152064 0.4190476 > 35 give(my little sister, the book) P 91.8 112032 0.3087302 > 39 donate(my little sister, the book) P 98.4 114048 0.3142857 > 43 donate(my sister, the book) P 94.4 82800 0.2281746 > > I would like to extract the subset of this data in which the value of > the "input" column contains the substring "her". I was thinking I > could use the grep function to test for the presence of this > substring. For instance, if a string does not contain it, then grep > returns a zero length integer vector: > >> grep("her", "give(my sister, it)") > integer(0) > > And if the string does contain the substring, grep returns a vector of > the indices where the substring is located: > >> grep("her", "give(her, it)") > [1] 1 > > I can thus test for the presence of the substring by converting the > length of the result of grep into a boolean: > >> as.logical(length(grep("her", "give(my sister, it)"))) > [1] FALSE >> as.logical(length(grep("her", "give(her, it)"))) > [1] TRUE >> as.logical(length(grep("her", "give(her, it)"))) == TRUE > [1] TRUE >> as.logical(length(grep("her", "give(my sister, it)"))) == TRUE > [1] FALSE > > I would like to use this test as a criterion for constructing a subset > of my data. Unfortunately, it does not work: > >> subset(dataP, as.logical(length(grep("her", input)))==TRUE) > input output corpusFreq pvolOT pvolRatioOT > 1 give(my sister, the old book) P 47.0 56016 0.1543651 > 5 donate(her, the book) P 48.7 68928 0.1899471 > 9 give(my sister, the book) P 73.4 80136 0.2208333 > 13 donate(my sister, the old book) P 79.0 57024 0.1571429 > 20 give(my sister, it) P 100.0 132408 0.3648810 > 21 give(her, it) P 100.0 157248 0.4333333 > 24 donate(my sister, it) P 100.0 130720 0.3602293 > 28 give(her, the book) P 5.7 65232 0.1797619 > 31 donate(her, it) P 100.0 152064 0.4190476 > 35 give(my little sister, the book) P 91.8 112032 0.3087302 > 39 donate(my little sister, the book) P 98.4 114048 0.3142857 > 43 donate(my sister, the book) P 94.4 82800 0.2281746 > > As you can see, I get back the whole data set, rather than just the > subset where the input column contains "her". And if I invert the > test, which I would expect to give the subset *not* containing "her", > I instead get the empty subset, rather mysteriously: > >> subset(dataP, as.logical(length(grep("her", input)))==FALSE) > [1] input output corpusFreq pvolOT pvolRatioOT > <0 rows> (or 0-length row.names) > > The type of the input column is definitely character. To be double sure: > >> subset(dataP, as.logical(length(grep("her", as.character(input))))==TRUE) > > does the same thing. > > Could somebody with more R experience than I have please explain what > I am doing wrong here? I'll be much obliged. > > -- > Max Bane > PhD Student, Linguistics > University of Chicago > b...@uchicago.edu > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > -- Jim Holtman Cincinnati, OH +1 513 646 9390 What is the problem that you are trying to solve? ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.