I'm wondering about the behavior of the merge function when using factors as by variables. I know that when you combine two factors using c() the results can be odd, as in:
c(factor(1:5),factor(6:10)) which prints: [1] 1 2 3 4 5 1 2 3 4 5 I presume this is because factors are actually stored as integers, with 6,7,8,9,10 stored internally as 1,2,3,4,5. This concerns me somewhat, as I often merge data frames using factors as the by variables. From what I can tell, the merge function creates matches based on factor labels (i.e. the result of as.character(factor_var)) and not the internally stored integers, but I'm wondering if there are particular lurking problems that I should be aware of? I'm especially curious as to how R recalculates the levels of the by variables in outer joins where not every observation is matched, as in: df1<-data.frame(a=factor(c("a","b")),b=1:2) df2<-data.frame(a=factor(c("b","c")),c=2:3) df3<-merge(df1,df2,by="a",all=T) Many thanks! [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.