Dear Jim, Thanks for telling me about gc() - that may help.
Here's some more detail on my function. It's to implement a methodology for classifying plant communities into functional groups according to the values of species traits, and comparing the usefulness of alternative classifications by using them to summarise species abundance data and then correlating site-differences based on these abundance data with site-differences based on environmental variables. The method is described in Pillar et al (2009), J. Vegetation Science 20: 334-348. First the function produces a set of classifications of species by applying the function agnes() from the library "cluster" to all possible combinations of the variables of a species-by-traits dataframe Q. It also prepares a distance matrix dR based on environmental variables for the same sites. Then there is a loop that takes each ith classification in turn and summarises a raw-data dataframe of species abundances into a shorter dataframe Xi, by grouping clusters of its rows according to the classification. It then calculates a distance matrix (dXi) based on this summary of abundances, and another distance matrix (dQi) based on corresponding variables of the matrix Q directly. Finally in the loop, mantel.partial() from the library "vegan" is used to run a partial mantel test between dXi and dR, conditioned on dQi. The argument "permutations" is set to zero, and only the mantel statistic is stored. The loop also contains a forward stepwise selection procedure so that not all classifications are actually used. After all classifications using (e.g.) a single variable have been used, the variable(s) involved in the best classification(s) are specified for inclusion in the next cycle of the loop. I wonder how lucid that all was... I began putting together the main parts of the code, but I fear it takes so much explanation (not to mention editing for transparency) that it may not be worth the effort unless someone is really committed to following it through - it's about 130 lines in total. I could still do this if it's likely to be worthwhile... However, the stepwise procedure was only fully implemented after I sent my first email. Now that none of the iterative output is stored except the final mantel statistic and essential records of which classifications were used, the memory demand has decreased. The problem that now presents itself is simply that the function still takes a very long time to run (e.g. 10 hours to go through 7 variables stepwise, with distance matrices of dimension 180 or so). Two parts of the code that feel clumsy to me already are: unstack(stack(by(X,f,colSums))) # to reduce a dataframe X to a dataframe with fewer rows by summing within sets of rows defined by the factor f and V <- list(); for(i in 1:n) V[[i]] <- grep(pattern[i], x); names(V) <- 1:q; V <- stack(V); V[,1] # to get the indices of multiple matches from x (which is a vector of variable names some of which may be repeated) for each of several character strings in pattern The code also involves relating columns of dataframes to each other using character-matching - e.g. naming columns with paste() -ed strings of all the variables used to create them, and then subsequently strsplit() -ing these names so that columns can be selected that contain any of the specified variable names. I'm grateful for any advice! Thanks, Richard. ________________________________ From: jim holtman <jholt...@gmail.com> To: Richard Gunton <r.gun...@talk21.com> Sent: Tuesday, 8 September, 2009 2:08:51 PM Subject: Re: [R] How to reduce memory demands in a function? Can you at least post what the function is doing and better yet, provide commented, minimal, self-contained, reproducible code. You can put call to memory.size() to see how large things are growing, delete temporary objects when not needed, make calls to gc(), ... , but it is hard to tell what without an example. On Mon, Sep 7, 2009 at 4:16 AM, Richard Gunton<r..gun...@talk21.com> wrote: > I've written a function that regularly throws the "cannot allocate vector of size X Kb" error, since it contains a loop that creates large numbers of big distance matrices. I'd be very grateful for any simple advice on how to reduce the memory demands of my function. Besides increasing memory.size to the maximum available, I've tried reducing my "dist" objects to 3 sig. fig.s (not sure if that made any difference), I've tried the distance function daisy() from package "cluster" instead of dist(), and I've avoided storing unnecessary intermediary objects as far as possible by nesting functions in the same command. I've even tried writing each of my dist() objects to a text file, one line for each, and reading them in again one at a time as and when required, using scan() - and although this seemed to avoid the memory problem, it ran so slowly that it wasn't much use for someone with deadlines to meet... I don't have formal training in programming, so if there's something handy I should read, do let me know. Thanks, Richard Gunton. Postdoctoral researcher in arable weed ecology, INRA Dijon. [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.