?aggregate > aggregate(df$score, list(df$var1, df$var2, df$id), mean, na.rm=TRUE) Group.1 Group.2 Group.3 x 1 1 1 1 0.1053576980 2 2 1 1 0.1514888520 3 3 1 1 0.1270477403 4 4 1 1 -0.0193129404 5 5 1 1 0.2574346931 6 1 2 1 0.0185013523 7 2 2 1 -0.0886420632 8 3 2 1 -0.1304342272 9 4 2 1 -0.0972963702 10 5 2 1 -0.1463502593
On Fri, May 2, 2008 at 6:43 PM, Adam D. I. Kramer <[EMAIL PROTECTED]> wrote: > Dear Colleagues, > > Apologies for a long email to ask what I feel may be a very simple > question; I figure it's better to overspecify my situation. > > I was asked a question, recently, by a colleague in my department > about pre-aggregating variables, i.e., computing the mean of defined subsets > of a data frame. Naturally, I thought of the 'by' and 'tapply' functions, as > they have always been the solution for me. However, my colleague had three > indices, and as such needs to pay attention to the indices of the > output...this is to say, the "create an array" function of tapply doesn't > quite work because an array is not quite what we want. > > Consider this data set: > > df <- data.frame(var1= factor(rep(rep(1:5,25*5),10)), > var2= factor(rep(rep(1:5,each=25*5),10), > trial= rep(rep(1:25,25),10), > id= factor(rep(1:10,each=5*5*25)), > score= rnorm(n=5*5*25*10) ) > > ...this is to say, each of 10 ids has scores for 5 different levels of > var1 and 5 different levels of var2...across 25 trials. Basically, a > three-way crossed repeated measures design...where tapply does what I want > for a two-way design, it does not quite suit my purposes for a 3-way or > n-way for n > 2. > > The goal is to predict score from var1 and var2. The straightforward guess > of what to do would be to simply have the AOV function aggregate across > trials: > > aov(score ~ var1*var2 + Error(id/(var1*var2)), data=df) > > (or lm with defined contrasts) > > ...however, there are missing data on some trials for some people, which > makes this design unbalanced (i.e., it introduces a correlation between var1 > and var2). Because my colleague knows (from a theoretical standpoint) that > he wants to analyze the mean, his ANOVA on the aggregated trial means WOULD > be balanced, which is to say, the analysis he wants to run would produce > different output from the above. > > So, what he needs is a data frame with four variables instead of five: var1, > var2, id, and mscore (mean score), which has been averaged across trials. > > Clearly (to me, it seems), the way to do this is with tapply: > > x <- tapply(df$score, list(df$var1,df$var2,df$id), mean, na.rm=TRUE) > > ...which returns a var1*var2 matrix for each ID, when what I want is a > observation-per-row data frame. > > So, my question: How do I end up with what I'm looking for? > > My current process involves setting df2 <- data.frame(mscore=c(x), ...) > where ... is a bunch of factor(rep) columns that would specify the var1 var2 > and id levels. My problem with this approach is that it seems like a hack; > it is not a general solution because I must use knowledge of the process by > which x was generated in order to "get it right," and there's a decent > amount of room for unnoticed error on my part. > > I suppose what I'm looking for is either a way to take by or tapply and have > it return a set of index variable columns based on the list of indices I > provide to it...or a way to collapse an n-way table into a single data frame > with index variables. Any suggestions? > > Cordially, > > Adam D. I. Kramer > Ph.D. Candidate, Social Psychology > University of Oregon > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > -- Jim Holtman Cincinnati, OH +1 513 646 9390 What is the problem you are trying to solve? ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.