Marc Schwartz (via MN) wrote: > On Mon, 2006-06-05 at 13:45 -0700, Bill Dunlap wrote: > >>On Mon, 5 Jun 2006, Marc Schwartz (via MN) wrote: >> >> >>>Based upon an offlist communication this morning, I am somewhat confused >>>(more than I usually am on most Monday mornings...) about the use of >>>grep() with factors as the 'x' argument. >>> ... >>> >>>>grep("[a-z]", letters) >>> >>> [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 >>>[23] 23 24 25 26 >>> >>> >>>>grep("[a-z]", factor(letters)) >>> >>>numeric(0) >> >>I was recently surprised by this also. In addition, if >>R's grep did support factors in this way, what sort of >>object (factor or character) should it return when value=T? >>I recently changed Splus's grep to return a character vector in >>that case. >> >> Splus> grep("[def]", letters[26:1]) >> [1] 21 22 23 >> Splus> grep("[def]", factor(letters[26:1], levels=letters[26:1])) >> [1] 21 22 23 >> Splus> grep("[def]", letters[26:1], value=T) >> [1] "f" "e" "d" >> Splus> grep("[def]", factor(letters[26:1], levels=letters[26:1]), value=T) >> [1] "f" "e" "d" >> Splus> class(.Last.value) >> [1] "character" >> >>R does this when grepping an integer vector. >> R> grep("1", 0:11, value=T) >> [1] "1" "10" "11" >>help(grep) says it returns "the matching elements themselves", but >>doesn't say if "themselves" means before or after the conversion to >>character. > > > Bill, > > My first inclination for the return value when used on a factor would be > the indexed factor elements where grep() would otherwise simply return > the indices. This would also maintain the factor levels from the > original source factor since "[".factor would normally retain these when > drop = FALSE. > > For example: > > # Return the indexed values as would otherwise be done > # in grep() if the factor to character coercion takes place: > # Use the same indices 21:23 as above > > >>factor(letters[26:1], levels = letters[26:1])[21:23] > > [1] f e d > Levels: z y x w v u t s r q p o n m l k j i h g f e d c b a > > > >>From my read of the C code in do_grep() in character.c (again, if > correct), when 'value = TRUE', the C code appears to first get the > indices and then build the returned vector from the indexed values from > the source vector in a for() loop. So this should not be a problem > philosophically. > > However, given your example of the coercion of integers, perhaps with > grep() at least, consistent behavior would dictate that return values > are always character vectors. These could then be coerced manually back > to a factor, using the original levels, as may be required: > > >>factor.letters <- factor(letters[26:1], levels=letters[26:1]) >>factor.letters > > [1] z y x w v u t s r q p o n m l k j i h g f e d c b a > Levels: z y x w v u t s r q p o n m l k j i h g f e d c b a > > >>grep("[def]", as.character(factor.letters)) > > [1] 21 22 23 > > >>res <- grep("[def]", as.character(factor.letters), value = TRUE) >>res > > [1] "f" "e" "d" > > >>factor(res, levels = levels(factor.letters)) > > [1] f e d > Levels: z y x w v u t s r q p o n m l k j i h g f e d c b a > > Which of course is the same result I proposed initially above. > > I could be convinced either way. The concern of course being that (given > the offlist replies I have received today) even experienced users are > getting bitten by the current behavior versus their intuitive > expectations, which are at least loosely supported by the documentation.
I'll chime in on-list to say that I have had the same experience with expecting grep to coerce to text. Despite the question of return values, I think of grep (not equivalent to the unix command, I understand, but it does have the same name) as operating on "text", not the factor levels themselves. Not a big deal, but it does lead to sometimes hard to track bugs if one is not careful to put in as.character all the time. Sean ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel