Most of this question is over my head, I'm afraid, but looking at what I think is the crux of your question, couldn't you achieve the results you want in two steps, like this:
dta <- data.frame(ID=c(1,1,1,1,2,2,3,3,3,3), Day=c(1,2,4,7,2,3,1,3,4,8),Pain=c(10,9,7,2,8,7,10,6,6,2)) l1 <- tapply(dta$Day, dta$ID, function(x) x) sapply(l1, function(x) all(c(1,4,8) %in% x )) I'm not sure you really need to do it in two steps, but given you said you wanted a flattened data frame with the Days as a vector, this will give it to you. Actually, l1 is a list, but you can turn it in to a data frame if you really want to. In the sapply call I changed the days required to 1, 4 and 8 to show that it does return TRUE if there is a patient that meets the required criterion. David On 12 March 2015 at 07:55, Jocelyn Ireson-Paine <p...@j-paine.org> wrote: > This is a fairly long question. It's about a problem that's easy to specify > in terms of sets, but that I found hard to solve in R by using them, because > of the strange design of R data structures. In explaining it, I'm going to > touch on the reshape2 library, dcast, sets, and the non-orthogonality of R. > > My problem stems from some drug-trial data that I've been analysing for the > Oxford Pain Research Unit. Here's an example. Imagine a data frame > representing patients in a trial of pain-relief drugs. The trial lasts for > ten days. Each patient's pain is measured once a day, and the values are > recorded in a data frame, one row per patient per day. Like this: > > ID Day Pain > 1 1 10 > 1 2 9 > 1 4 7 > 1 7 2 > 2 2 8 > 2 3 7 > 3 1 10 > 3 3 6 > 3 4 6 > 3 8 2 > > Unfortunately, many patients have measurements missing. Thus, in the example > above, patient 1 was only observed on days 1, 2, 4, and 7, rather than on > the full ten days. But a patient's measurements are only useful to us if > that patient has a certain minimum set of days, so I need to check for > patients who lack those days. Let's assume that these days are numbers 1, 4, > and 9. > > Such a question is trivial to state in terms of sets. Let D(i) denote the > set of days on which patient i was measured: then I want to find out which > patients p, or how many patients p, have a D(p) that contains the set > {1,4,9}. > > The obvious way to solve this is to write a function that tells me whether > one set is a superset of another. Then flatten my data frame so that it > looks like this: > > ID Days > 1 {1,2,4,7} > 2 {2,3} > 3 {1,3,4,8} > > And finally, filter it by some R translation of > > flattened[ includes( flattened$Days, {1,4,9} ), ] > > I started with the built-in functions that operate on sets represented as > vectors. These are described in > https://stat.ethz.ch/R-manual/R-devel/library/base/html/sets.html , > "Set Operations". For example: > > > union( c(1,2,3), c(2,4,6) ) > [1] 1 2 3 4 6 > > intersect( c(1,2,3), c(2,4,6) ) > [1] 2 > > So I first wrote a set-inclusion function: > > # True if vector a is a superset of vector b. > # > includes <- function( a, b ) > { > return( setequal( union( a, b ), a ) ) > } > > Here are some sample calls: > > > includes( c(1), c() ) > [1] TRUE > > includes( c(1), c(1) ) > [1] TRUE > > includes( c(1), c(1,2) ) > [1] FALSE > > includes( c(2,1), c(1,2) ) > [1] TRUE > > includes( c(2,1,3), c(1,2) ) > [1] TRUE > > includes( c(2,1,3), c(4,1,2) ) > [1] FALSE > > I then made myself a variable holding my sample data frame: > > df <- data.frame( ID = c( 1, 1, 1, 1, 2, 2, 3, 3, 3, 3 ) > , Day = c( 1, 2, 4, 7, 2, 3, 1, 3, 4, 8 ) > ) > > And I tried flattening it, using dcast and an aggregator function as > described in (amongst many other places) > http://seananderson.ca/2013/10/19/reshape.html , "An Introduction to > reshape2" by Sean C. Anderson. > > The idea behind this is that (for my data) dcast will call the aggregator > function once per patient ID, passing it all the Day values for the patient. > The aggregator must combine them in some way, and dcast puts its results > into a new column. For example, here's an aggregator that merely sums its > arguments: > > aggregator_making_sum <- function( ... ) > { > return( sum( ... ) ) > } > > If I call it, I get this: > > > dcast( df, ID~. , fun.aggregate=aggregator_making_sum ) > Using Day as value column: use value.var to override. > ID . > 1 1 14 > 2 2 5 > 3 3 16 > > And here's an aggregator that converts the argument list to a string: > > aggregator_making_string <- function( ... ) > { > return( toString( ... ) ) > } > > Calling it gives this: > > > dcast( df, ID~. , fun.aggregate=aggregator_making_string ) > Using Day as value column: use value.var to override. > ID . > 1 1 1, 2, 4, 7 > 2 2 2, 3 > 3 3 1, 3, 4, 8 > > In both of these, the three dots denote all arguments to the aggregator, as > explained in Burns Statistics's > http://www.burns-stat.com/the-three-dots-construct-in-r/ . My first > aggregator sums them; my second converts them to a string. Both uses of > dcast generate a data frame with a column named "." , which contains the > aggregates. In the second data frame, that may not be so clear: the first > column of numbers is row numbers; the second column of numbers are the IDs; > and the remaining columns form the strings, belonging to "." . > > But what I want is neither a sum nor a string but a set. Specifically, a set > that's compatible with the R set operations I called in my 'includes' > function. Since these sets are vectors, my aggregator should just pack its > arguments into a vector: > > aggregator_making_set <- function( ... ) > { > return( c( ... ) ) > } > > But when I tried it, I got an error: > > > dcast( df, ID~. , fun.aggregate=aggregator_making_set ) > Using Day as value column: use value.var to override. > Error in vapply(indices, fun, .default) : values must be length 0, > but FUN(X[[1]]) result is length 4 > > It's not an informative error message, because it expects me to know how > dcast is coded. And I'm surprised that values need to be length 0: length 1 > would seem more appropriate. But perhaps it's trying to say that 'c' doesn't > work on three-dots argument lists. Let's test that hypothesis: > > test_c_on_three_dots <- function( ... ) > { > return( c( ... ) ) > } > > > test_c_on_three_dots( 1 ) > [1] 1 > > test_c_on_three_dots( 1, 2 ) > [1] 1 2 > > test_c_on_three_dots( 1, 2, 3 ) > [1] 1 2 3 > > So 'c' does indeed work on three-dots argument lists. The error must have > been caused by something else. Let's try making a set and putting it into a > data frame directly: > > > df <- data.frame( col1=c(1,2), col2=c(3,4) ) > > df > col1 col2 > 1 1 3 > 2 2 4 > > set <- union( c(5,6), c(6,7) ) > > set > [1] 5 6 7 > > df[ 1, ]$col1 <- set > Error in `$<-.data.frame`(`*tmp*`, "col1", value = c(5, 6, 7)) : > replacement has 3 rows, data has 1 > > So that's the problem. Already in 1968, there was a language named Algol68 > which had arrays and, in order to make things easy for its programmers, > allowed you to create arrays of every data type the language provided. You > could have arrays of Booleans, arrays of integers, arrays of records, arrays > of discriminated unions, arrays of procedures, arrays of I/O formats, arrays > of pointers, and arrays of arrays. The idea was "orthogonality" (see for > example http://stackoverflow.com/questions/1527393/what-is-orthogonality ): > that the programmer does not have to think about unexpected interactions > between the concept of array and the concept of the element type, because > there are none. If you have a data type, you can make arrays of that type. > Pop-2 (1970), Snobol4 (1966), and Lisp (1958) were similarly generous. But R > (1993) isn't. It wants to make life hard by forcing me to use different > kinds of container for different kinds of element. And by providing a nice > implementation of sets and then not letting me store them. > > So I thought about the kinds of data that I _can_ store in a data frame and > generate by flattening. Strings! So I decided to use my > aggregator_making_string function to make a string representation of the set > of days, and to write a set-inclusion function that compared these sets > against sets represented as vectors: > > includes2 <- function( a_as_string, b ) > { > a <- as.numeric( unlist( strsplit( a_as_string, split="," ) ) ) > return( setequal( union( a, b ), a ) ) > } > > Here are some example calls: > > > includes2( '1,2,3', c(1) ) > [1] TRUE > > includes2( '1,2,3', c(1,2) ) > [1] TRUE > > includes2( '1,2,3', c(1,2,4) ) > [1] FALSE > > includes2( '1,2,3', c(3) ) > [1] TRUE > > includes2( '1,2,3', c(0,3) ) > [1] FALSE > > > > I then tried using it: > > df <- data.frame( ID = c( 1, 1, 1, 1, 2, 2, 3, 3, 3, 3 ) > , Day = c( 1, 2, 4, 7, 2, 3, 1, 3, 4, 8 ) > ) > > aggregator_making_string <- function( ... ) > { > return( toString( ... ) ) > } > > flattened <- dcast( df, ID~. , fun.aggregate=aggregator_making_string ) > > # Which patients have a day 1? > flattened[ includes2( flattened$. , c(1) ), ] > > Unfortunately, that didn't work. The final statement selected every row of > 'flattened'. I eventually realised that I had to vectorise 'includes2': > > includes3 <- Vectorize( includes2, "a_as_string" ) > > And that did work: > > > flattened[ includes3( flattened$. , c(1) ), ] > ID . > 1 1 1, 2, 4, 7 > 3 3 1, 3, 4, 8 > > flattened[ includes3( flattened$. , c(1,2) ), ] > ID . > 1 1 1, 2, 4, 7 > > flattened[ includes3( flattened$. , c(1,3) ), ] > ID . > 3 3 1, 3, 4, 8 > > flattened[ includes3( flattened$. , c(2) ), ] > ID . > 1 1 1, 2, 4, 7 > 2 2 2, 3 > > The moral of this email tale is that sets are really useful for filtering > data, and dcast ought to be really useful for generating sets, but R refuses > to let me store them in the data frame that dcast generates. I can fudge it > by representing the sets as strings, but is there a cleaner way to solve the > problem? > > Cheers, > > Jocelyn Ireson-Paine > 07768 534 091 > http://www.jocelyns-cartoons.uk > http://www.j-paine.org > > ______________________________________________ > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.