On Thu, Aug 9, 2018 at 3:57 AM Joris Meys <jorism...@gmail.com> wrote: > > I sent this to Iñaki personally by mistake. Thank you for notifying me. > > On Wed, Aug 8, 2018 at 7:53 PM Iñaki Úcar <i.uca...@gmail.com> wrote: > > > > > For what it's worth, I always thought about factors as fundamentally > > characters, but with restrictions: a subspace of all possible strings. > > And I'd say that a non-negligible number of R users may think about > > them in a similar way. > > > > That idea has been a common source of bugs and the most important reason > why I always explain my students that factors are a special kind of > numeric(integer), not character. Especially people coming from SPSS see > immediately the link with categorical variables in that way, and understand > that a factor is a modeling aid rather than an alternative for characters. > It is a categorical variable and a more readable way of representing a set > of dummy variables. > > I do agree that some of the factor behaviour is confusing at best, but that > doesn't change the appropriate use and meaning of factors as categorical > variables. > > Even more, I oppose the ideas that : > > 1) factors with different levels should be concatenated. > > 2) when combining factors, the union of the levels would somehow be a good > choice. > > Factors with different levels are variables with different information, not > more or less information. If one factor codes low and high and another > codes low, mid and high, you can't say whether mid in one factor would be > low or high in the first one. The second has a higher resolution, and > that's exactly the reason why they should NOT be combined. Different levels > indicate a different grouping, and hence that data should never be used as > one set of dummy variables in any model. > > Even when combining factors, the union of levels only makes sense to me if > there's no overlap between levels of both factors. In all other cases, a > researcher will need to determine whether levels with the same label do > mean the same thing in both factors, and that's not guaranteed. And when > we're talking a factor with a higher resolution and a lower resolution, the > correct thing to do modelwise is to recode one of the factors so they have > the same resolution and every level the same definition before you merge > that data. > > So imho the combination of two factors with different levels (or even > levels in a different order) should give an error. Which R currently > doesn't throw, so I get there's room for improvement.
I 100% agree with you, and is this the behaviour that vctrs used to have and dplyr currently has (at least in bind_rows()). But pragmatically, my experience with dplyr is that people find this behaviour confusing and unhelpful. And when I played the full expression of this behaviour in vctrs, I found that it forced me to think about the levels of factors more than I'd otherwise like to: it made me think like a programmer, not like a data analyst. So in an ideal world, yes, I think factors would have stricter behaviour, but my sense is that imposing this strictness now will be onerous to most analysts. Hadley -- http://hadley.nz ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel