H Roark wrote:
I'm wondering about the behavior of the merge function when using factors as by
variables. I know that when you combine two factors using c() the results can
be odd, as in:
c(factor(1:5),factor(6:10))
which prints: [1] 1 2 3 4 5 1 2 3 4 5
I presume this is because factors are actually stored as integers, with
6,7,8,9,10 stored internally as 1,2,3,4,5.
This concerns me somewhat, as I often merge data frames using factors as the by
variables. From what I can tell, the merge function creates matches based on
factor labels (i.e. the result of as.character(factor_var)) and not the
internally stored integers, but I'm wondering if there are particular lurking
problems that I should be aware of? I'm especially curious as to how R
recalculates the levels of the by variables in outer joins where not every
observation is matched, as in:
df1<-data.frame(a=factor(c("a","b")),b=1:2)
df2<-data.frame(a=factor(c("b","c")),c=2:3)
df3<-merge(df1,df2,by="a",all=T)
As far as I know, there is no reason to be concerned when using merge
as you do.
The magic that ?merge is performing is actually being done in ?rbind,
and you should read the help for that, particularly under "Data frame
methods". You can also study the code of base.rbind.data.frame to see
what it's actually doing.
--Erik
______________________________________________
[email protected] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.