Re: [R] How to filter data using sets generated by flattening with dcast, when I can't store those sets in a data frame

David Winsemius Mon, 16 Mar 2015 10:25:08 -0700

On Mar 15, 2015, at 1:06 PM, Jocelyn Ireson-Paine wrote:

> David, and also William Dunlap, thanks for taking the time to reply, with 
> examples. Both your answers are very helpful.
> 
> William noted that 'reshape2' is not 'R', but a user-contributed package that 
> runs in R. I agree, and I'm not confusing one with the other. But what I 
> don't like is that somewhere in the interaction between them, generality is 
> lost.
> 
> I contrast this with a means of aggregating data that I use when programming 
> in Lisp, Prolog, and other "functional" languages. This is aggregation by 
> "folding" a list of values. The idea is explained at http://wiki.tcl.tk/17983 
> , "Fold in functional programming" by "juef", amongst other places. He/she 
> gives a common example: take a list of values, such as
>  (1 2 3 4)
> and "fold" the + operation over it. Doing so runs + along the list forming 
> intermediate sums and adding the next value to them, until all values have 
> been summed.
> 
> Here, 'fold' is analogous to dcast, with + being analogous to the function 
> dcast takes for its fun.aggregate argument. But the good thing about 'fold' 
> is that it does not restrict the type of result that its aggregation function 
> can return. The result can be a number, a string, a list, a list of lists, an 
> array, or any other type. I'd like dcast to be as general.


Since `dcast` is part of what I have seen called the "hadleyverse", your 
feature request should go to Hadley Wickham. Base R has the Reduce function:

> Reduce("+", 1:4)
[1] 10
> Reduce("+", 1:4, accumulate=TRUE)
[1]  1  3  6 10

It's described in the help page ?Reduce for other functional programming 
methods that were modeled after Lisp macros.

-- 
David.
> 
> Jocelyn Ireson-Paine
> 07768 534 091
> http://www.jocelyns-cartoons.uk
> http://www.j-paine.org
> 
> On Thu, 12 Mar 2015, David Barron wrote:
> 
>> Most of this question is over my head, I'm afraid, but looking at what
>> I think is the crux of your question, couldn't you achieve the results
>> you want in two steps, like this:
>> 
>> dta <- data.frame(ID=c(1,1,1,1,2,2,3,3,3,3),
>> Day=c(1,2,4,7,2,3,1,3,4,8),Pain=c(10,9,7,2,8,7,10,6,6,2))
>> 
>> l1 <- tapply(dta$Day, dta$ID, function(x) x)
>> 
>> sapply(l1, function(x) all(c(1,4,8) %in% x ))
>> 
>> I'm not sure you really need to do it in two steps, but given you said
>> you wanted a flattened data frame with the Days as a vector, this will
>> give it to you.  Actually, l1 is a list, but you can turn it in to a
>> data frame if you really want to.  In the sapply call I changed the
>> days required to 1, 4 and 8 to show that it does return TRUE if there
>> is a patient that meets the required criterion.
>> 
>> David
>> 
>> On 12 March 2015 at 07:55, Jocelyn Ireson-Paine <p...@j-paine.org> wrote:
>>> This is a fairly long question. It's about a problem that's easy to specify
>>> in terms of sets, but that I found hard to solve in R by using them, because
>>> of the strange design of R data structures. In explaining it, I'm going to
>>> touch on the reshape2 library, dcast, sets, and the non-orthogonality of R.
>>> 
>>> My problem stems from some drug-trial data that I've been analysing for the
>>> Oxford Pain Research Unit. Here's an example. Imagine a data frame
>>> representing patients in a trial of pain-relief drugs. The trial lasts for
>>> ten days. Each patient's pain is measured once a day, and the values are
>>> recorded in a data frame, one row per patient per day. Like this:
>>> 
>>>  ID  Day  Pain
>>>   1    1  10
>>>   1    2   9
>>>   1    4   7
>>>   1    7   2
>>>   2    2   8
>>>   2    3   7
>>>   3    1  10
>>>   3    3   6
>>>   3    4   6
>>>   3    8   2
>>> 
>>> Unfortunately, many patients have measurements missing. Thus, in the example
>>> above, patient 1 was only observed on days 1, 2, 4, and 7, rather than on
>>> the full ten days. But a patient's measurements are only useful to us if
>>> that patient has a certain minimum set of days, so I need to check for
>>> patients who lack those days. Let's assume that these days are numbers 1, 4,
>>> and 9.
>>> 
>>> Such a question is trivial to state in terms of sets. Let D(i) denote the
>>> set of days on which patient i was measured: then I want to find out which
>>> patients p, or how many patients p, have a D(p) that contains the set
>>> {1,4,9}.
>>> 
>>> The obvious way to solve this is to write a function that tells me whether
>>> one set is a superset of another. Then flatten my data frame so that it
>>> looks like this:
>>> 
>>>  ID  Days
>>>   1  {1,2,4,7}
>>>   2  {2,3}
>>>   3  {1,3,4,8}
>>> 
>>> And finally, filter it by some R translation of
>>> 
>>>  flattened[ includes( flattened$Days, {1,4,9} ), ]
>>> 
>>> I started with the built-in functions that operate on sets represented as
>>> vectors. These are described in
>>> https://stat.ethz.ch/R-manual/R-devel/library/base/html/sets.html ,
>>> "Set Operations". For example:
>>> 
>>> > union( c(1,2,3), c(2,4,6) )
>>>  [1] 1 2 3 4 6
>>> > intersect( c(1,2,3), c(2,4,6) )
>>>  [1] 2
>>> 
>>> So I first wrote a set-inclusion function:
>>> 
>>>  # True if vector a is a superset of vector b.
>>>  #
>>>  includes <- function( a, b )
>>>  {
>>>    return( setequal( union( a, b ), a ) )
>>>  }
>>> 
>>> Here are some sample calls:
>>> 
>>> > includes( c(1), c() )
>>>  [1] TRUE
>>> > includes( c(1), c(1) )
>>>  [1] TRUE
>>> > includes( c(1), c(1,2) )
>>>  [1] FALSE
>>> > includes( c(2,1), c(1,2) )
>>>  [1] TRUE
>>> > includes( c(2,1,3), c(1,2) )
>>>  [1] TRUE
>>> > includes( c(2,1,3), c(4,1,2) )
>>>  [1] FALSE
>>> 
>>> I then made myself a variable holding my sample data frame:
>>> 
>>>  df <- data.frame( ID = c( 1, 1, 1, 1, 2, 2, 3, 3, 3, 3 )
>>>                  , Day = c( 1, 2, 4, 7, 2, 3, 1, 3, 4, 8 )
>>>                  )
>>> 
>>> And I tried flattening it, using dcast and an aggregator function as
>>> described in (amongst many other places)
>>> http://seananderson.ca/2013/10/19/reshape.html , "An Introduction to
>>> reshape2" by Sean C. Anderson.
>>> 
>>> The idea behind this is that (for my data) dcast will call the aggregator
>>> function once per patient ID, passing it all the Day values for the patient.
>>> The aggregator must combine them in some way, and dcast puts its results
>>> into a new column. For example, here's an aggregator that merely sums its
>>> arguments:
>>> 
>>>  aggregator_making_sum <- function( ... )
>>>  {
>>>    return( sum( ... ) )
>>>  }
>>> 
>>> If I call it, I get this:
>>> 
>>> >  dcast( df, ID~. , fun.aggregate=aggregator_making_sum )
>>>  Using Day as value column: use value.var to override.
>>>    ID  .
>>>  1  1 14
>>>  2  2  5
>>>  3  3 16
>>> 
>>> And here's an aggregator that converts the argument list to a string:
>>> 
>>>  aggregator_making_string <- function( ... )
>>>  {
>>>    return( toString( ... ) )
>>>  }
>>> 
>>> Calling it gives this:
>>> 
>>> >  dcast( df, ID~. , fun.aggregate=aggregator_making_string )
>>>  Using Day as value column: use value.var to override.
>>>    ID          .
>>>  1  1 1, 2, 4, 7
>>>  2  2       2, 3
>>>  3  3 1, 3, 4, 8
>>> 
>>> In both of these, the three dots denote all arguments to the aggregator, as
>>> explained in Burns Statistics's
>>> http://www.burns-stat.com/the-three-dots-construct-in-r/ . My first
>>> aggregator sums them; my second converts them to a string. Both uses of
>>> dcast generate a data frame with a column named "." , which contains the
>>> aggregates. In the second data frame, that may not be so clear: the first
>>> column of numbers is row numbers; the second column of numbers are the IDs;
>>> and the remaining columns form the strings, belonging to "." .
>>> 
>>> But what I want is neither a sum nor a string but a set. Specifically, a set
>>> that's compatible with the R set operations I called in my 'includes'
>>> function. Since these sets are vectors, my aggregator should just pack its
>>> arguments into a vector:
>>> 
>>>  aggregator_making_set <- function( ... )
>>>  {
>>>    return( c( ... ) )
>>>  }
>>> 
>>> But when I tried it, I got an error:
>>> 
>>> > dcast( df, ID~. , fun.aggregate=aggregator_making_set )
>>>  Using Day as value column: use value.var to override.
>>>  Error in vapply(indices, fun, .default) : values must be length 0,
>>>   but FUN(X[[1]]) result is length 4
>>> 
>>> It's not an informative error message, because it expects me to know how
>>> dcast is coded. And I'm surprised that values need to be length 0: length 1
>>> would seem more appropriate. But perhaps it's trying to say that 'c' doesn't
>>> work on three-dots argument lists. Let's test that hypothesis:
>>> 
>>>  test_c_on_three_dots <- function( ... )
>>>  {
>>>    return( c( ... ) )
>>>  }
>>> 
>>> >   test_c_on_three_dots( 1 )
>>>  [1] 1
>>> >   test_c_on_three_dots( 1, 2 )
>>>  [1] 1 2
>>> >   test_c_on_three_dots( 1, 2, 3 )
>>>  [1] 1 2 3
>>> 
>>> So 'c' does indeed work on three-dots argument lists. The error must have
>>> been caused by something else. Let's try making a set and putting it into a
>>> data frame directly:
>>> 
>>> > df <- data.frame( col1=c(1,2), col2=c(3,4) )
>>> > df
>>>    col1 col2
>>>  1    1    3
>>>  2    2    4
>>> > set <- union( c(5,6), c(6,7) )
>>> > set
>>>  [1] 5 6 7
>>> > df[ 1, ]$col1 <- set
>>>  Error in `$<-.data.frame`(`*tmp*`, "col1", value = c(5, 6, 7)) :
>>>    replacement has 3 rows, data has 1
>>> 
>>> So that's the problem. Already in 1968, there was a language named Algol68
>>> which had arrays and, in order to make things easy for its programmers,
>>> allowed you to create arrays of every data type the language provided. You
>>> could have arrays of Booleans, arrays of integers, arrays of records, arrays
>>> of discriminated unions, arrays of procedures, arrays of I/O formats, arrays
>>> of pointers, and arrays of arrays. The idea was "orthogonality" (see for
>>> example http://stackoverflow.com/questions/1527393/what-is-orthogonality ):
>>> that the programmer does not have to think about unexpected interactions
>>> between the concept of array and the concept of the element type, because
>>> there are none. If you have a data type, you can make arrays of that type.
>>> Pop-2 (1970), Snobol4 (1966), and Lisp (1958) were similarly generous. But R
>>> (1993) isn't. It wants to make life hard by forcing me to use different
>>> kinds of container for different kinds of element. And by providing a nice
>>> implementation of sets and then not letting me store them.
>>> 
>>> So I thought about the kinds of data that I _can_ store in a data frame and
>>> generate by flattening. Strings! So I decided to use my
>>> aggregator_making_string function to make a string representation of the set
>>> of days, and to write a set-inclusion function that compared these sets
>>> against sets represented as vectors:
>>> 
>>>  includes2 <- function( a_as_string, b )
>>>  {
>>>    a <- as.numeric( unlist( strsplit( a_as_string, split="," ) ) )
>>>    return( setequal( union( a, b ), a ) )
>>>  }
>>> 
>>> Here are some example calls:
>>> 
>>> > includes2( '1,2,3', c(1) )
>>>  [1] TRUE
>>> > includes2( '1,2,3', c(1,2) )
>>>  [1] TRUE
>>> > includes2( '1,2,3', c(1,2,4) )
>>>  [1] FALSE
>>> > includes2( '1,2,3', c(3) )
>>>  [1] TRUE
>>> > includes2( '1,2,3', c(0,3) )
>>>  [1] FALSE
>>> >
>>> 
>>> I then tried using it:
>>> 
>>>  df <- data.frame( ID = c( 1, 1, 1, 1, 2, 2, 3, 3, 3, 3 )
>>>                  , Day = c( 1, 2, 4, 7, 2, 3, 1, 3, 4, 8 )
>>>                  )
>>> 
>>>  aggregator_making_string <- function( ... )
>>>  {
>>>    return( toString( ... ) )
>>>  }
>>> 
>>>  flattened <- dcast( df, ID~. , fun.aggregate=aggregator_making_string )
>>> 
>>>  # Which patients have a day 1?
>>>  flattened[ includes2( flattened$. , c(1) ), ]
>>> 
>>> Unfortunately, that didn't work. The final statement selected every row of
>>> 'flattened'. I eventually realised that I had to vectorise 'includes2':
>>> 
>>>  includes3 <- Vectorize( includes2, "a_as_string" )
>>> 
>>> And that did work:
>>> 
>>> >   flattened[ includes3( flattened$. , c(1) ), ]
>>>    ID          .
>>>  1  1 1, 2, 4, 7
>>>  3  3 1, 3, 4, 8
>>> >   flattened[ includes3( flattened$. , c(1,2) ), ]
>>>    ID          .
>>>  1  1 1, 2, 4, 7
>>> >   flattened[ includes3( flattened$. , c(1,3) ), ]
>>>    ID          .
>>>  3  3 1, 3, 4, 8
>>> >   flattened[ includes3( flattened$. , c(2) ), ]
>>>    ID          .
>>>  1  1 1, 2, 4, 7
>>>  2  2       2, 3
>>> 
>>> The moral of this email tale is that sets are really useful for filtering
>>> data, and dcast ought to be really useful for generating sets, but R refuses
>>> to let me store them in the data frame that dcast generates. I can fudge it
>>> by representing the sets as strings, but is there a cleaner way to solve the
>>> problem?
>>> 
>>> Cheers,
>>> 
>>> Jocelyn Ireson-Paine
>>> 07768 534 091
>>> http://www.jocelyns-cartoons.uk
>>> http://www.j-paine.org
>>> 
>>> ______________________________________________
>>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>> 
> 
> ______________________________________________
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

David Winsemius
Alameda, CA, USA

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] How to filter data using sets generated by flattening with dcast, when I can't store those sets in a data frame

Reply via email to