Not to hijack the thread, but for my edification, what are the advantages/disadvantages of split() + lapply() compared to by()?
Josh On Sun, Jul 18, 2010 at 9:50 PM, Dennis Murphy <djmu...@gmail.com> wrote: > Hi: > > Time to jack up your level of R knowledge, courtesy of the apply family. > > The 'R way' to do what you want is to split the data by species into list > components, run lm() on each component and save the resulting lm objects in > a list. The next trick is to figure out how to extract what you want, which > may require a bit more ingenuity in delving into aRcana :) > > ----- > Aside: > To reinforce Joshua's point, variable names with spaces not explicitly > enclosed in quotes is bad practice, especially when someone who wants to > help tries to copy and paste your data into his/her R session: > > Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, > : > line 1 did not have 4 elements > > R expected four columns of data, but you provided three. In the future, it's > a good idea to include your data example with dput(), which outputs > > dput(d) > structure(list(species = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, > 2L, 3L, 3L, 3L, 3L, 5L, 5L, 5L, 5L), o2con = c(0.5, 0.6, 0.4, > 0.4, 0.5, 0.3, 0.4, 0.5, 0.7, 0.9, 0.3, 0.7, 0.4, 0.3, 0.3, 0.6, > 0.9, 0.2), bm = c(5L, 2L, 4L, 2L, 3L, 7L, 8L, 3L, 4L, 2L, 6L, > 2L, 1L, 7L, 2L, 1L, 7L, 5L)), .Names = c("species", "o2con", > "bm"), class = "data.frame", row.names = c(NA, -18L)) > > This is easily copied and pasted into anyone's R session....but I digress. > ------ > > Calling your data frame d, here's how to run the same regression model on > all species: > > # Create a function to perform the modeling, taking a data frame df as input > f <- function(df) lm(o2con ~ bm, data = df) > > # Use lapply() to apply the function to each 'split' of the data, by > species: > v <- lapply(split(d, d$species), f) > > # v is a list object, where each component of the list is an lm object, > # which itself is a list. In other words, it's a list of lists. do.call() is > a > # very useful function that applies a function to components of a list. > # rbind and cbind are commonly used to slurp together common elements > # from each component of a list. > > # Pulling out the coefficients from each model: >> do.call(rbind, lapply(v, coef)) > (Intercept) bm > 1 0.5176471 -0.01176471 > 2 0.9253731 -0.07611940 > 3 0.5942308 -0.04230769 > 5 0.3351648 0.04395604 > > # Extract the r-squared values from each model: > g <- function(m) summary(m)$r.squared >> do.call(rbind, lapply(v, g)) > [,1] > 1 0.03361345 > 2 0.66932578 > 3 0.43291592 > 5 0.14652015 > > # But you have to be careful...e.g., since you have unequal sample sizes per > species, >> do.call(cbind, lapply(v, resid)) > 1 2 3 5 > 1 0.04117647 -0.09253731 -0.040384615 -0.1230769 > 2 0.10588235 0.08358209 0.190384615 0.2208791 > 3 -0.07058824 -0.19701493 -0.151923077 0.2571429 > 4 -0.09411765 0.07910448 0.001923077 -0.3549451 > 5 0.01764706 0.12686567 -0.040384615 -0.1230769 > Warning message: > In function (..., deparse.level = 1) : > number of rows of result is not a multiple of vector length (arg 3) > > Notice how the first residual is recycled in each of groups 3 and 5. That's > a potential gotcha. > > This gives you a small glimpse into the power that R can deliver in data > analysis. > > HTH, > Dennis > > On Sun, Jul 18, 2010 at 2:29 PM, karmakiller <roisinmoria...@gmail.com>wrote: > >> >> Hi All, >> >> I have a large data set with many columns of data. One of these columns is >> a >> species identifier and the remainder are variables such as temperature or >> mass. Currently I am carrying out a single regression on subsets of the >> data >> set, e.g. separated data sets with only the data from one species at a >> time. >> I have been searching for a thread that will help me to understand how best >> to repeat this process for each different species identifier in that >> variable column. I can’t seem to find one that is similar to what I am >> trying to do. It might be the case that I am not looking for the right >> thing >> or that I do not fully understand the process. >> >> How do I run a simple loop that produces a regression for each species as >> identified in the variable species id, which is one column in the large >> data >> set that I am using? >> >> Simple regression that I wish to repeat >> >> data<- read.table("…/STUDY.txt",header=T) >> names(data) >> model<- with(data,{lm(o2con~bm)}) >> summary(model) >> >> >> sample data set >> >> species id o2con bm >> 1 0.5 5 >> 1 0.6 2 >> 1 0.4 4 >> 1 0.4 2 >> 1 0.5 3 >> 2 0.3 7 >> 2 0.4 8 >> 2 0.5 3 >> 2 0.7 4 >> 2 0.9 2 >> 3 0.3 6 >> 3 0.7 2 >> 3 0.4 1 >> 3 0.3 7 >> 5 0.3 2 >> 5 0.6 1 >> 5 0.9 7 >> 5 0.2 5 >> >> I would be very grateful for some help with this. I really like using R and >> I can usually figure out what I want to do but I have been trying to figure >> this out for a while now and I am getting nowhere. >> >> Thank you. >> >> -- >> View this message in context: >> http://r.789695.n4.nabble.com/simple-loop-analysing-subsets-tp2293383p2293383.html >> Sent from the R help mailing list archive at Nabble.com. >> >> ______________________________________________ >> R-help@r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. >> > > [[alternative HTML version deleted]] > > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > > -- Joshua Wiley Ph.D. Student, Health Psychology University of California, Los Angeles http://www.joshuawiley.com/ ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.