I have some data that looks like this: > dataP input output corpusFreq pvolOT pvolRatioOT 1 give(my sister, the old book) P 47.0 56016 0.1543651 5 donate(her, the book) P 48.7 68928 0.1899471 9 give(my sister, the book) P 73.4 80136 0.2208333 13 donate(my sister, the old book) P 79.0 57024 0.1571429 20 give(my sister, it) P 100.0 132408 0.3648810 21 give(her, it) P 100.0 157248 0.4333333 24 donate(my sister, it) P 100.0 130720 0.3602293 28 give(her, the book) P 5.7 65232 0.1797619 31 donate(her, it) P 100.0 152064 0.4190476 35 give(my little sister, the book) P 91.8 112032 0.3087302 39 donate(my little sister, the book) P 98.4 114048 0.3142857 43 donate(my sister, the book) P 94.4 82800 0.2281746
I would like to extract the subset of this data in which the value of the "input" column contains the substring "her". I was thinking I could use the grep function to test for the presence of this substring. For instance, if a string does not contain it, then grep returns a zero length integer vector: > grep("her", "give(my sister, it)") integer(0) And if the string does contain the substring, grep returns a vector of the indices where the substring is located: > grep("her", "give(her, it)") [1] 1 I can thus test for the presence of the substring by converting the length of the result of grep into a boolean: > as.logical(length(grep("her", "give(my sister, it)"))) [1] FALSE > as.logical(length(grep("her", "give(her, it)"))) [1] TRUE > as.logical(length(grep("her", "give(her, it)"))) == TRUE [1] TRUE > as.logical(length(grep("her", "give(my sister, it)"))) == TRUE [1] FALSE I would like to use this test as a criterion for constructing a subset of my data. Unfortunately, it does not work: > subset(dataP, as.logical(length(grep("her", input)))==TRUE) input output corpusFreq pvolOT pvolRatioOT 1 give(my sister, the old book) P 47.0 56016 0.1543651 5 donate(her, the book) P 48.7 68928 0.1899471 9 give(my sister, the book) P 73.4 80136 0.2208333 13 donate(my sister, the old book) P 79.0 57024 0.1571429 20 give(my sister, it) P 100.0 132408 0.3648810 21 give(her, it) P 100.0 157248 0.4333333 24 donate(my sister, it) P 100.0 130720 0.3602293 28 give(her, the book) P 5.7 65232 0.1797619 31 donate(her, it) P 100.0 152064 0.4190476 35 give(my little sister, the book) P 91.8 112032 0.3087302 39 donate(my little sister, the book) P 98.4 114048 0.3142857 43 donate(my sister, the book) P 94.4 82800 0.2281746 As you can see, I get back the whole data set, rather than just the subset where the input column contains "her". And if I invert the test, which I would expect to give the subset *not* containing "her", I instead get the empty subset, rather mysteriously: > subset(dataP, as.logical(length(grep("her", input)))==FALSE) [1] input output corpusFreq pvolOT pvolRatioOT <0 rows> (or 0-length row.names) The type of the input column is definitely character. To be double sure: > subset(dataP, as.logical(length(grep("her", as.character(input))))==TRUE) does the same thing. Could somebody with more R experience than I have please explain what I am doing wrong here? I'll be much obliged. -- Max Bane PhD Student, Linguistics University of Chicago b...@uchicago.edu ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.