on 01/14/2009 02:51 PM Matthew Pettis wrote: > I have a specific question and a general question. > > Specific Question: I want to do an analysis on a data frame by 2 or more > class variables (i.e., use 2 or more columns in a dataframe to do > statistical classing). Coming from SAS, I'm used to being able to take a > data set and have the output of the analysis in a dataset for further > manipulation. I have a data set with vote totals, with one column being the > office name being voted on, and the other being the party of the candidate. > My votes are in the column "vc.n". I did the analysis I want with: > > work <- by(sd62[,"vc.n"], sd62[,c("office.nm","party.abbr")], sum) > > the str() output of work looks like: > >> str(work) > 'by' int [1:9, 1:11] NA 30 NA NA 0 0 0 NA 33 25678 ... > - attr(*, "dimnames")=List of 2 > ..$ office.nm : chr [1:9] "ATTORNEY GENERAL" "GOVERNOR & LT GOVERNOR" > "SECRETARY OF STATE" "STATE AUDITOR" ... > ..$ party.abbr: chr [1:11] "CP" "DFL" "DFL2" "GP" ... > - attr(*, "call")= language by.default(data = sd62[, "vc.n"], INDICES = > sd62[, c("office.nm", "party.abbr")], FUN = sum) > > > > > work is now a list. I'd really like to have work be a data frame with 3 > columns: The rows of the first two columns show the office and party levels > being considered, and the third being the sum of the votes for that level > combination. How do I cast this list/output into a data frame? using > 'as.data.frame' doesn't work. > > General Question: I assume the answer to the specific question is dependent > on my understanding list objects and accessing their attributes. Can anyone > point me to a good, throrough treatment of these R topics? Specifically how > to read and interpret the output of the str(), and attributes() function, > how to extract the values of the 'by' output object into a data frame, etc.? > > Thanks, > Matt
Matt, Welcome to R. The help pages for each function, while they can be intentionally terse, are a good first place to look. Many will include links/references to related sources. "An Introduction to R" is a good general place to start. A more thorough treatment is in the "R Language Definition" manual. There are also a plethora of contributed documents: http://cran.r-project.org/other-docs.html and books on R and using R within specific domains: http://www.r-project.org/doc/bib/R-books.html There are (at least) three ways to generate summary statistics based upon multi-level groupings. These include by(), tapply() and aggregate(). The key difference between the three is the class/structure of the results object and the print (output) method. In the specific case of aggregate(), it must also return a scalar. Thus for example, unlike with by() and tapply(), you cannot use summary(), which returns multiple values. Thus the choice for which approach to take, to an extent, is founded on what you may subsequently do with the data. As an example, using the same set of data (warpbreaks): > str(warpbreaks) 'data.frame': 54 obs. of 3 variables: $ breaks : num 26 30 54 25 70 52 51 26 67 18 ... $ wool : Factor w/ 2 levels "A","B": 1 1 1 1 1 1 1 1 1 1 ... $ tension: Factor w/ 3 levels "L","M","H": 1 1 1 1 1 1 1 1 1 2 ... # Use by() > by(warpbreaks[, 1], list(wool = warpbreaks$wool, tension = warpbreaks$tension), sum) wool: A tension: L [1] 401 ------------------------------------------------------ wool: B tension: L [1] 254 ------------------------------------------------------ wool: A tension: M [1] 216 ------------------------------------------------------ wool: B tension: M [1] 259 ------------------------------------------------------ wool: A tension: H [1] 221 ------------------------------------------------------ wool: B tension: H [1] 169 Note, because the result of using by() is at its core, a matrix/table, you can also do the following, explicitly using the print method for a table: > print.table(by(warpbreaks[, 1], list(wool = warpbreaks$wool, tension = warpbreaks$tension), sum)) tension wool L M H A 401 216 221 B 254 259 169 which gives you printed output in the same format as tapply() below, without altering the structure of the result itself. # tapply() directly gives you a tabular output > tapply(warpbreaks[, 1], list(wool = warpbreaks$wool, tension = warpbreaks$tension), sum) tension wool L M H A 401 216 221 B 254 259 169 Note that the structure of the result from by() and the result from tapply() are quite similar: > str(by(warpbreaks[, 1], list(wool = warpbreaks$wool, tension = warpbreaks$tension), sum)) by [1:2, 1:3] 401 254 216 259 221 169 - attr(*, "dimnames")=List of 2 ..$ wool : chr [1:2] "A" "B" ..$ tension: chr [1:3] "L" "M" "H" - attr(*, "call")= language by.default(data = warpbreaks[, 1], INDICES = list(wool = warpbreaks$wool, tension = warpbreaks$tension), FUN = sum) > str(tapply(warpbreaks[, 1], list(wool = warpbreaks$wool, tension = warpbreaks$tension), sum)) num [1:2, 1:3] 401 254 216 259 221 169 - attr(*, "dimnames")=List of 2 ..$ wool : chr [1:2] "A" "B" ..$ tension: chr [1:3] "L" "M" "H" Both are at their core, a 2 x 3 matrix. The key difference is in the 'class' of the result, which affects subsequent operations, such as the print method used. # aggregate() gives you a data frame, with the summary statistic as the # 'x' column > aggregate(warpbreaks[, 1], list(wool = warpbreaks$wool, tension = warpbreaks$tension), sum) wool tension x 1 A L 401 2 B L 254 3 A M 216 4 B M 259 5 A H 221 6 B H 169 > str(aggregate(warpbreaks[, 1], list(wool = warpbreaks$wool, tension = warpbreaks$tension), sum)) 'data.frame': 6 obs. of 3 variables: $ wool : Factor w/ 2 levels "A","B": 1 2 1 2 1 2 $ tension: Factor w/ 3 levels "L","M","H": 1 1 2 2 3 3 $ x : num 401 254 216 259 221 169 Thus, bottom line, given your intended application, I would suggest using aggregate() rather than by(). HTH, Marc Schwartz ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.