Dear Jim,

Thanks for telling me about gc() - that may help.

Here's some more detail on my function.  It's to implement a methodology for 
classifying plant communities into functional groups according to the values of 
species traits, and comparing the usefulness of alternative classifications by 
using them to summarise species abundance data and then correlating 
site-differences based on these abundance data with site-differences based on 
environmental variables.  The method is described in Pillar et al (2009), J. 
Vegetation Science 20: 334-348.

First the function produces a set of classifications of species by applying the 
function agnes() from
the library "cluster" to all possible combinations of the variables of a 
species-by-traits dataframe Q.  It also prepares a distance matrix dR based on 
environmental variables for the same sites.  Then there is a loop that takes 
each ith classification in turn and summarises a raw-data dataframe of species 
abundances into a shorter dataframe Xi, by
grouping clusters of its rows according to the classification. It then 
calculates a distance matrix (dXi) based on this summary of abundances, and 
another distance matrix (dQi) based on corresponding variables of the matrix Q 
directly.  Finally in the loop, mantel.partial() from the library "vegan" is 
used to run a partial mantel
test between dXi and dR,
conditioned on dQi.  The argument
"permutations" is set to zero, and only the mantel statistic is
stored.  

The loop also contains a forward stepwise selection procedure so that not all 
classifications are actually used. After all classifications using (e.g.) a 
single variable have been used, the variable(s) involved in the best 
classification(s) are specified for inclusion in the next cycle of the loop.  I 
wonder how lucid that all was...

I began putting together the main parts of the code, but I fear it takes so 
much explanation (not to mention editing for transparency) that it may not be 
worth the effort unless someone is really committed to following it through - 
it's about 130 lines in total.  I could still do this if it's likely to be 
worthwhile...

However, the stepwise procedure was only fully implemented after I sent my 
first email.  Now that none of the iterative output is stored except the final 
mantel statistic and essential records of which classifications were used, the 
memory demand has decreased.  The problem that now presents itself is simply 
that the function still takes a very long time to run (e.g. 10 hours to go 
through 7 variables stepwise, with distance matrices of dimension 180 or so).

Two parts of the code that feel clumsy to me already are:
unstack(stack(by(X,f,colSums)))    # to reduce a dataframe X to a dataframe 
with fewer rows by summing within sets of rows defined by the factor f
and
V <- list();  for(i in 1:n) V[[i]] <- grep(pattern[i], x);  names(V) <- 1:q;  V 
<- stack(V);  V[,1] # to get the indices of multiple matches from x (which is a 
vector of variable names some of which may be repeated) for each of several 
character strings in pattern

The code also involves relating columns of dataframes to each
other using character-matching - e.g. naming columns with paste() -ed strings 
of all the variables used to create them, and then subsequently strsplit() -ing 
these names so that columns can be selected that contain any of the
specified variable names. 

I'm grateful for any advice!

Thanks,   Richard.


________________________________
From: jim holtman <jholt...@gmail.com>
To: Richard Gunton <r.gun...@talk21.com>
Sent: Tuesday, 8 September, 2009 2:08:51 PM
Subject: Re: [R] How to reduce memory demands in a function?

Can you at least post what the function is doing and better yet,
provide commented, minimal, self-contained, reproducible code.  You
can put call to memory.size() to see how large things are growing,
delete temporary objects when not needed, make calls to gc(), ... ,
but it is hard to tell what without an example.


On Mon, Sep 7, 2009 at 4:16 AM, Richard Gunton<r..gun...@talk21.com> wrote:
>
I've written a function that regularly throws the "cannot allocate vector of 
size X Kb" error, since it contains a loop that creates large numbers of big 
distance matrices. I'd be very grateful for any simple advice on how to reduce 
the memory demands of my function.  Besides increasing memory.size to the 
maximum available, I've tried reducing my "dist" objects to 3 sig. fig.s (not 
sure if that made any difference), I've tried the distance function daisy() 
from package "cluster" instead of dist(), and I've avoided storing unnecessary 
intermediary objects as far as possible by nesting functions in the same 
command.  I've even tried writing each of my dist() objects to a text file, one 
line for each, and reading them in again one at a time as and when required, 
using scan() - and although this seemed to avoid the memory problem, it ran so 
slowly that it wasn't much use for someone with deadlines to meet...

I don't have formal training in programming, so if there's something handy I 
should read, do let me know.

Thanks,

Richard Gunton.

Postdoctoral researcher in arable weed ecology, INRA Dijon.


      
        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to