Since R is object-oriented, data frame set operations should be the natural
operations for their class. There are, I suppose, two natural ways: the
column-wise (variable-wise) and the row-wise (observation-wise) one. The
row-wise one seems more natural and more useful to me.
The current implementation is column-wise, though it is inconsistent in its
return class (the man page defines return modes, but is silent on return
classes):
class(union(df1,df2))
[1] "list"
> class(intersect(df1,df2))
[1] "data.frame"
> class(setdiff(df1,df2))
[1] "data.frame"
Unlike other cases, I don't think this inconsistency brings any user
convenience (though it may reflect programmer convenience).
The column-wise interpretation makes sense in cases where variables with the
same vector value (ignoring the variable name) can be considered redundant.
I suppose there are cases where that could be useful, though it does seem
hazardous.
The row-wise interpretation makes sense in cases where observations with the
same values for all variables can be considered redundant. That seems to me
a much more useful interpretation. The union, intersection, and set
difference of two sets of observations would seem to all be highly useful.
-s
On Sat, May 30, 2009 at 10:21 AM, G. Jay Kerns <[email protected]> wrote:
> On Sat, May 30, 2009 at 8:50 AM, Stavros Macrakis <[email protected]>
> wrote:
> > It seems to me that, abstractly, a dataframe is just as
> > straightforwardly a sequence of tuples/observations as a vector is a
> > sequence of scalars. R's convention is that a 1-vector represents a
> > scalar, and similarly, a 1-dataframe can represent a tuple (though it
> > can also be represented as a list). Of course, a dataframe can *also*
> > be interpreted as a list of vectors.
> >
> > Just as a sequence of scalars can be interpreted as a set of scalars
> > by the order- and repetition-ignoring homomophism, so can a sequence
> > of tuples. It seems to me natural that set operations should follow
> > that interpretation.
> >
> > -s
>
>
> After a good night's sleep, the documentation says clearly that
> setdiff() operates on two vectors (of the same mode), so my message
> would be an example of "garbage in, garbage out".
>
> It would be nice if there were an error thrown, but surely there are
> more mission critical problems than this one.
>
> Thanks anyway.
> Jay
>
[[alternative HTML version deleted]]
______________________________________________
[email protected] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel