I'm wondering about the behavior of the merge function when using factors as by 
variables. I know that when you combine two factors using c() the results can 
be odd, as in:

c(factor(1:5),factor(6:10))

which prints: [1] 1 2 3 4 5 1 2 3 4 5

I presume this is because factors are actually stored as integers, with 
6,7,8,9,10 stored internally as 1,2,3,4,5.

This concerns me somewhat, as I often merge data frames using factors as the by 
variables. From what I can tell, the merge function creates matches based on 
factor labels (i.e. the result of as.character(factor_var)) and not the 
internally stored integers, but I'm wondering if there are particular lurking 
problems that I should be aware of?  I'm especially curious as to how R 
recalculates the levels of the by variables in outer joins where not every 
observation is matched, as in:

df1<-data.frame(a=factor(c("a","b")),b=1:2)
df2<-data.frame(a=factor(c("b","c")),c=2:3)
df3<-merge(df1,df2,by="a",all=T)

Many thanks!
                                          
        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to