Thanks so much everybody, especially to Dennis. I didn't really occur to me that I could put the models into a list. I have used dplyr for simple data transformations and will definitely look into it.
On Thu, Sep 1, 2016 at 8:10 AM, Dennis Murphy <djmu...@gmail.com> wrote: > Hi: > > See inline. > > On Wed, Aug 31, 2016 at 2:58 PM, Kai Mx <govo...@gmail.com> wrote: > > Hi all, > > > > I am having trouble wrapping my head around a probably simple issue: > > > > After using the reshape package, I have a melted dataframe with the > columns > > group (factor), time (int), condition (factor), value(int). > > > > These are experimental data. The data were obtained from different > > treatment groups (group) under different conditions at different time > > points. > > > > I would now like to perform ANOVA, boxplots and calculate means to > compare > > groups for all combinations of conditions and time points with something > > like > > > > fit <- lm(value~group, data=[subset of data with combination of > > condition/timepoint]) > > summary (fit) > > p <- ggplot([subset of data with combination of condition/timepoint], > > aes(x= group, y=value)) + geom_boxplot () > > print (p) > > tapply ([subset of data with combination of condition/timepoint]$value, > > subset of data with combination of condition/timepoint]$group, mean) > > There is a traditional approach to this class of problem and an > evolving 'modern' approach. The traditional approach is to use > lapply() to produce a list of model objects (one per subgroup); the > more recent approach is to take advantage of the pipeline operations > enabled by the magrittr package, e.g.,via the dplyr and tidyr > packages. Related packages are purrr and broom; the former applies > functional programming ideas to map a function recursively across a > list, while the latter focuses on converting information from model > objects into data frames that are more compatible with dplyr and > friends.A package released to CRAN within the past couple of days, > modelr, adds a few bells and whistles (e.g., bootstrap regression, > adding columns of predicted values or residuals to a data frame), but > you don't need it for your immediate purposes. > > Below is a generic approach to solving the types of problems you > described above, which is the best one can do in the absence of a > reproducible example. Therefore, if this doesn't work out of the box, > you'll have to fix your data. (I won't do it for you, sorry.) > > You could do something like what you have in mind in plyr as follows, > where md is a surrogate for the name of your melted data frame: > > library(plyr) > L <- dlply(md, .(condition, time), function(d) lm(value ~ group, data = d)) > > This would produce a list of models, one per condition * time > combination. You could then use do.call() or another plyr function to > extract elements of interest from the list, which generally would > require that you write one or more (anonymous) functions to extract > the information of interest. A similar approach can be used to > generate a list of ggplots. It's cleaner if you put your code into > functions and have it return the output you want, but you have to be > careful about the form of the input and output - for dlply(), you want > a function that takes a data frame as input. If you just want the > plots printed, you could write a function to do that for a single plot > (again with a data frame as input) and then use the d_ply() function > in plyr to print them en masse, but it would generally make more sense > to write them to files, so you'd probably be better off writing a > function that ends with a ggsave() call and call d_ply(). [Note: the _ > is used when your function creates a side effect, such as printing or > saving a plot object - it returns nothing to the R console.] > > As for the numeric summaries, > > ddply(md, .(condition, time, group), function(d) mean(d$value, na.rm = > TRUE)) > > would work. The advantage of plyr (and its successor, dplyr) is that > you can pass arbitrary functions as the third argument as long as the > input is a data frame and the output is a data frame (or something > that can be coerced to a data frame). This is more robust than > tapply(). > > Comment: plyr/reshape2 is a good starter package combination as it > teaches you the value of the split-apply-combine approach to data > analysis, but it can be (very) slow. The dplyr/tidyr package > combination is faster, more computationally efficient version of > plyr::ddply() and reshape2 and is recommended for use, although you > have to learn a somewhat different approach to R programming in the > process. If you're fairly new to R, that shouldn't matter much. > > There has been a lot of work in the last year or two to improve the > flow of programming for tasks such as recursive plotting or model > fitting. The dplyr and tidyr packages are meant to be replacements for > plyr and reshape[2], respectively; both are written by Hadley Wickham. > These packages extensively use operators created by Stephan Bache in > the magrittr package, particularly %>%, the pipeline operator, and ., > the current data object operator. %>% is read as "then": its purpose > is to take a data frame object as input and return the result of an > operation on the data (i.e., a function call) as a data frame. > Repeated application results in a chain of operations that establishes > a programming flow and has the nice side effect of producing more > readable, better organized code. > > There are some beautiful advantages to this approach to programming, > but there are also some challenges, particularly when attempting to > write functions in dplyr. At this point, you don't need to worry about > it. The analogy to plyr::dlply() is dplyr::do(), which applies a > function to each subgroup of data defined by the grouping variables. > Related packages for modeling include broom (highly recommended) and > perhaps modelr, although the latter is probably not relevant for your > particular problem. OTOH, broom is certainly relevant. > > The most important functions in dplyr are "verbs" that apply a > specific action to the input data: > > * group_by() defines grouping variables, applies the "split" aspect > of split-apply-combine > * filter() selects rows either by index or by logical expression > * select() selects variables (columns) either by column number or name > * arrange() sorts rows by value with respect to one or more > grouping variables > * mutate() defines new variables or modifies existing ones > * summarise() applies a one-number summary function to a variable > > These are referred to as "one-table verbs". Functions that are > "two-table" verbs are typically used to produce joins of two tables. > > The broom package is designed to summarize the output from model > objects into data frames. The two primary functions are tidy(), which > returns a summary data frame of the model coefficients, and glance(), > which returns a vector of summary statistics (e.g, r^2, RMSE) > associated with the model. The advantage is that the functions are > designed to work in a dplyr data pipeline and can be used to combine > the results from a list of model fits. > > Here's a template to pull off the model fitting exercise in dplyr, > tidyr and broom. (Note: dplyr has many of the same functions as plyr, > so if you need to load both, load plyr first and then load dplyr.) > > library(dplyr) > library(tidyr) > library(broom) > > ## Return an object that is effectively a list of models > mods <- md %>% > group_by(condition, time) %>% > do(mod = lm(value ~ group, data = .) # mod is an > object of class lm > > # Note: . is a placeholder for the current data > > # Now access tidy summaries of the models using broom - see its > documentation > # for details > mods %>% tidy(mod, conf.int = TRUE) > mods %>% glance(mod) > > > For the ggplots, something like the following would apply: > > md %>% group_by(condition, time) %>% # splits the data into subgroups > ggplot(., aes(x = group, y = value)) + > geom_boxplot() + > ggsave(paste(paste(group, value, sep = "_"), "png", > sep = ".")) > > > The equivalent of tapply() in dplyr is a summarise() function: > > md %>% group_by(condition, time, group) %>% > summarise(mean = mean(value, na.rm = TRUE)) > > Several variations of summarise() are present in dplyr which give it a > great deal of flexibility; e.g., summarise_each() applies the same > function to a set of variables or allows multiple functions to be > applied to one or more variables: > > md %>% group_by(condition, time, group) %>% > summarise_at(vars(value), funs(mean, sd, min, max), na.rm = > TRUE) > > > If you [intend to] use R on a regular basis and commonly encounter the > types of problems you've described here, I'd strongly suggest that you > look carefully into dplyr and friends. If you have the time, I'd also > suggest looking into the data.table package, which is usually a bit > faster than dplyr. It does many of the same types of tasks; recently, > Hadley recently released the dtplyr package which allows data.table > objects to be passed through the pipeline and allows use of data.table > code in the pipeline. The combination of the two is powerful. > > Comment: Bert Gunter's advice about modeling the data en masse is > important, because it allows one to investigate potential interactions > and relationships that you cannot do with data subsets. His cautions > should not be ignored. OTOH, sometimes data is intentionally > segregated and listwise modeling/summarization is appropriate. I don't > know which of these situations applies to you, but you need to > consider the implications of sub-analyses carefully. > > Sorry for the length of this missive, but I hope it is of some help to you. > > Dennis > > > > How can I loop through these combinations and output the data in an > elegant > > way? > > > > Thanks so much! > > > > Best, > > > > Kai > > > > [[alternative HTML version deleted]] > > > > ______________________________________________ > > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide http://www.R-project.org/ > posting-guide.html > > and provide commented, minimal, self-contained, reproducible code. > [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.