On Mar 15, 2015, at 1:06 PM, Jocelyn Ireson-Paine wrote: > David, and also William Dunlap, thanks for taking the time to reply, with > examples. Both your answers are very helpful. > > William noted that 'reshape2' is not 'R', but a user-contributed package that > runs in R. I agree, and I'm not confusing one with the other. But what I > don't like is that somewhere in the interaction between them, generality is > lost. > > I contrast this with a means of aggregating data that I use when programming > in Lisp, Prolog, and other "functional" languages. This is aggregation by > "folding" a list of values. The idea is explained at http://wiki.tcl.tk/17983 > , "Fold in functional programming" by "juef", amongst other places. He/she > gives a common example: take a list of values, such as > (1 2 3 4) > and "fold" the + operation over it. Doing so runs + along the list forming > intermediate sums and adding the next value to them, until all values have > been summed. > > Here, 'fold' is analogous to dcast, with + being analogous to the function > dcast takes for its fun.aggregate argument. But the good thing about 'fold' > is that it does not restrict the type of result that its aggregation function > can return. The result can be a number, a string, a list, a list of lists, an > array, or any other type. I'd like dcast to be as general.
Since `dcast` is part of what I have seen called the "hadleyverse", your feature request should go to Hadley Wickham. Base R has the Reduce function: > Reduce("+", 1:4) [1] 10 > Reduce("+", 1:4, accumulate=TRUE) [1] 1 3 6 10 It's described in the help page ?Reduce for other functional programming methods that were modeled after Lisp macros. -- David. > > Jocelyn Ireson-Paine > 07768 534 091 > http://www.jocelyns-cartoons.uk > http://www.j-paine.org > > On Thu, 12 Mar 2015, David Barron wrote: > >> Most of this question is over my head, I'm afraid, but looking at what >> I think is the crux of your question, couldn't you achieve the results >> you want in two steps, like this: >> >> dta <- data.frame(ID=c(1,1,1,1,2,2,3,3,3,3), >> Day=c(1,2,4,7,2,3,1,3,4,8),Pain=c(10,9,7,2,8,7,10,6,6,2)) >> >> l1 <- tapply(dta$Day, dta$ID, function(x) x) >> >> sapply(l1, function(x) all(c(1,4,8) %in% x )) >> >> I'm not sure you really need to do it in two steps, but given you said >> you wanted a flattened data frame with the Days as a vector, this will >> give it to you. Actually, l1 is a list, but you can turn it in to a >> data frame if you really want to. In the sapply call I changed the >> days required to 1, 4 and 8 to show that it does return TRUE if there >> is a patient that meets the required criterion. >> >> David >> >> On 12 March 2015 at 07:55, Jocelyn Ireson-Paine <p...@j-paine.org> wrote: >>> This is a fairly long question. It's about a problem that's easy to specify >>> in terms of sets, but that I found hard to solve in R by using them, because >>> of the strange design of R data structures. In explaining it, I'm going to >>> touch on the reshape2 library, dcast, sets, and the non-orthogonality of R. >>> >>> My problem stems from some drug-trial data that I've been analysing for the >>> Oxford Pain Research Unit. Here's an example. Imagine a data frame >>> representing patients in a trial of pain-relief drugs. The trial lasts for >>> ten days. Each patient's pain is measured once a day, and the values are >>> recorded in a data frame, one row per patient per day. Like this: >>> >>> ID Day Pain >>> 1 1 10 >>> 1 2 9 >>> 1 4 7 >>> 1 7 2 >>> 2 2 8 >>> 2 3 7 >>> 3 1 10 >>> 3 3 6 >>> 3 4 6 >>> 3 8 2 >>> >>> Unfortunately, many patients have measurements missing. Thus, in the example >>> above, patient 1 was only observed on days 1, 2, 4, and 7, rather than on >>> the full ten days. But a patient's measurements are only useful to us if >>> that patient has a certain minimum set of days, so I need to check for >>> patients who lack those days. Let's assume that these days are numbers 1, 4, >>> and 9. >>> >>> Such a question is trivial to state in terms of sets. Let D(i) denote the >>> set of days on which patient i was measured: then I want to find out which >>> patients p, or how many patients p, have a D(p) that contains the set >>> {1,4,9}. >>> >>> The obvious way to solve this is to write a function that tells me whether >>> one set is a superset of another. Then flatten my data frame so that it >>> looks like this: >>> >>> ID Days >>> 1 {1,2,4,7} >>> 2 {2,3} >>> 3 {1,3,4,8} >>> >>> And finally, filter it by some R translation of >>> >>> flattened[ includes( flattened$Days, {1,4,9} ), ] >>> >>> I started with the built-in functions that operate on sets represented as >>> vectors. These are described in >>> https://stat.ethz.ch/R-manual/R-devel/library/base/html/sets.html , >>> "Set Operations". For example: >>> >>> > union( c(1,2,3), c(2,4,6) ) >>> [1] 1 2 3 4 6 >>> > intersect( c(1,2,3), c(2,4,6) ) >>> [1] 2 >>> >>> So I first wrote a set-inclusion function: >>> >>> # True if vector a is a superset of vector b. >>> # >>> includes <- function( a, b ) >>> { >>> return( setequal( union( a, b ), a ) ) >>> } >>> >>> Here are some sample calls: >>> >>> > includes( c(1), c() ) >>> [1] TRUE >>> > includes( c(1), c(1) ) >>> [1] TRUE >>> > includes( c(1), c(1,2) ) >>> [1] FALSE >>> > includes( c(2,1), c(1,2) ) >>> [1] TRUE >>> > includes( c(2,1,3), c(1,2) ) >>> [1] TRUE >>> > includes( c(2,1,3), c(4,1,2) ) >>> [1] FALSE >>> >>> I then made myself a variable holding my sample data frame: >>> >>> df <- data.frame( ID = c( 1, 1, 1, 1, 2, 2, 3, 3, 3, 3 ) >>> , Day = c( 1, 2, 4, 7, 2, 3, 1, 3, 4, 8 ) >>> ) >>> >>> And I tried flattening it, using dcast and an aggregator function as >>> described in (amongst many other places) >>> http://seananderson.ca/2013/10/19/reshape.html , "An Introduction to >>> reshape2" by Sean C. Anderson. >>> >>> The idea behind this is that (for my data) dcast will call the aggregator >>> function once per patient ID, passing it all the Day values for the patient. >>> The aggregator must combine them in some way, and dcast puts its results >>> into a new column. For example, here's an aggregator that merely sums its >>> arguments: >>> >>> aggregator_making_sum <- function( ... ) >>> { >>> return( sum( ... ) ) >>> } >>> >>> If I call it, I get this: >>> >>> > dcast( df, ID~. , fun.aggregate=aggregator_making_sum ) >>> Using Day as value column: use value.var to override. >>> ID . >>> 1 1 14 >>> 2 2 5 >>> 3 3 16 >>> >>> And here's an aggregator that converts the argument list to a string: >>> >>> aggregator_making_string <- function( ... ) >>> { >>> return( toString( ... ) ) >>> } >>> >>> Calling it gives this: >>> >>> > dcast( df, ID~. , fun.aggregate=aggregator_making_string ) >>> Using Day as value column: use value.var to override. >>> ID . >>> 1 1 1, 2, 4, 7 >>> 2 2 2, 3 >>> 3 3 1, 3, 4, 8 >>> >>> In both of these, the three dots denote all arguments to the aggregator, as >>> explained in Burns Statistics's >>> http://www.burns-stat.com/the-three-dots-construct-in-r/ . My first >>> aggregator sums them; my second converts them to a string. Both uses of >>> dcast generate a data frame with a column named "." , which contains the >>> aggregates. In the second data frame, that may not be so clear: the first >>> column of numbers is row numbers; the second column of numbers are the IDs; >>> and the remaining columns form the strings, belonging to "." . >>> >>> But what I want is neither a sum nor a string but a set. Specifically, a set >>> that's compatible with the R set operations I called in my 'includes' >>> function. Since these sets are vectors, my aggregator should just pack its >>> arguments into a vector: >>> >>> aggregator_making_set <- function( ... ) >>> { >>> return( c( ... ) ) >>> } >>> >>> But when I tried it, I got an error: >>> >>> > dcast( df, ID~. , fun.aggregate=aggregator_making_set ) >>> Using Day as value column: use value.var to override. >>> Error in vapply(indices, fun, .default) : values must be length 0, >>> but FUN(X[[1]]) result is length 4 >>> >>> It's not an informative error message, because it expects me to know how >>> dcast is coded. And I'm surprised that values need to be length 0: length 1 >>> would seem more appropriate. But perhaps it's trying to say that 'c' doesn't >>> work on three-dots argument lists. Let's test that hypothesis: >>> >>> test_c_on_three_dots <- function( ... ) >>> { >>> return( c( ... ) ) >>> } >>> >>> > test_c_on_three_dots( 1 ) >>> [1] 1 >>> > test_c_on_three_dots( 1, 2 ) >>> [1] 1 2 >>> > test_c_on_three_dots( 1, 2, 3 ) >>> [1] 1 2 3 >>> >>> So 'c' does indeed work on three-dots argument lists. The error must have >>> been caused by something else. Let's try making a set and putting it into a >>> data frame directly: >>> >>> > df <- data.frame( col1=c(1,2), col2=c(3,4) ) >>> > df >>> col1 col2 >>> 1 1 3 >>> 2 2 4 >>> > set <- union( c(5,6), c(6,7) ) >>> > set >>> [1] 5 6 7 >>> > df[ 1, ]$col1 <- set >>> Error in `$<-.data.frame`(`*tmp*`, "col1", value = c(5, 6, 7)) : >>> replacement has 3 rows, data has 1 >>> >>> So that's the problem. Already in 1968, there was a language named Algol68 >>> which had arrays and, in order to make things easy for its programmers, >>> allowed you to create arrays of every data type the language provided. You >>> could have arrays of Booleans, arrays of integers, arrays of records, arrays >>> of discriminated unions, arrays of procedures, arrays of I/O formats, arrays >>> of pointers, and arrays of arrays. The idea was "orthogonality" (see for >>> example http://stackoverflow.com/questions/1527393/what-is-orthogonality ): >>> that the programmer does not have to think about unexpected interactions >>> between the concept of array and the concept of the element type, because >>> there are none. If you have a data type, you can make arrays of that type. >>> Pop-2 (1970), Snobol4 (1966), and Lisp (1958) were similarly generous. But R >>> (1993) isn't. It wants to make life hard by forcing me to use different >>> kinds of container for different kinds of element. And by providing a nice >>> implementation of sets and then not letting me store them. >>> >>> So I thought about the kinds of data that I _can_ store in a data frame and >>> generate by flattening. Strings! So I decided to use my >>> aggregator_making_string function to make a string representation of the set >>> of days, and to write a set-inclusion function that compared these sets >>> against sets represented as vectors: >>> >>> includes2 <- function( a_as_string, b ) >>> { >>> a <- as.numeric( unlist( strsplit( a_as_string, split="," ) ) ) >>> return( setequal( union( a, b ), a ) ) >>> } >>> >>> Here are some example calls: >>> >>> > includes2( '1,2,3', c(1) ) >>> [1] TRUE >>> > includes2( '1,2,3', c(1,2) ) >>> [1] TRUE >>> > includes2( '1,2,3', c(1,2,4) ) >>> [1] FALSE >>> > includes2( '1,2,3', c(3) ) >>> [1] TRUE >>> > includes2( '1,2,3', c(0,3) ) >>> [1] FALSE >>> > >>> >>> I then tried using it: >>> >>> df <- data.frame( ID = c( 1, 1, 1, 1, 2, 2, 3, 3, 3, 3 ) >>> , Day = c( 1, 2, 4, 7, 2, 3, 1, 3, 4, 8 ) >>> ) >>> >>> aggregator_making_string <- function( ... ) >>> { >>> return( toString( ... ) ) >>> } >>> >>> flattened <- dcast( df, ID~. , fun.aggregate=aggregator_making_string ) >>> >>> # Which patients have a day 1? >>> flattened[ includes2( flattened$. , c(1) ), ] >>> >>> Unfortunately, that didn't work. The final statement selected every row of >>> 'flattened'. I eventually realised that I had to vectorise 'includes2': >>> >>> includes3 <- Vectorize( includes2, "a_as_string" ) >>> >>> And that did work: >>> >>> > flattened[ includes3( flattened$. , c(1) ), ] >>> ID . >>> 1 1 1, 2, 4, 7 >>> 3 3 1, 3, 4, 8 >>> > flattened[ includes3( flattened$. , c(1,2) ), ] >>> ID . >>> 1 1 1, 2, 4, 7 >>> > flattened[ includes3( flattened$. , c(1,3) ), ] >>> ID . >>> 3 3 1, 3, 4, 8 >>> > flattened[ includes3( flattened$. , c(2) ), ] >>> ID . >>> 1 1 1, 2, 4, 7 >>> 2 2 2, 3 >>> >>> The moral of this email tale is that sets are really useful for filtering >>> data, and dcast ought to be really useful for generating sets, but R refuses >>> to let me store them in the data frame that dcast generates. I can fudge it >>> by representing the sets as strings, but is there a cleaner way to solve the >>> problem? >>> >>> Cheers, >>> >>> Jocelyn Ireson-Paine >>> 07768 534 091 >>> http://www.jocelyns-cartoons.uk >>> http://www.j-paine.org >>> >>> ______________________________________________ >>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see >>> https://stat.ethz.ch/mailman/listinfo/r-help >>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html >>> and provide commented, minimal, self-contained, reproducible code. >> > > ______________________________________________ > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. David Winsemius Alameda, CA, USA ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.