Hi Daniele, One possibility would be to make two runs. In the first run you are not building the matrix but just calculating the number of rows you need (in a loop). Then you allocate such matrix (only once) and fill it in the second run.
Regards, Moshe. --- On Wed, 24/2/10, Daniele Amberti <daniele.ambe...@ors.it> wrote: > From: Daniele Amberti <daniele.ambe...@ors.it> > Subject: [R] Optimise huge data.frame construction > To: "r-help@r-project.org" <r-help@r-project.org> > Received: Wednesday, 24 February, 2010, 8:34 PM > I have data for different items (ID) > in a database. > For each ID I have to get: > > - Timestamp of the > observation (timestamp); > > - numerical value (val) > that will be my response variable in some kind of model; > > - a variable number of > variables in a know set (if value for a specific variable is > not present in DB it is 0). > > To get to the above mentioned values I have to cycle over > IDs, make some calculation and store results to construct a > huge data.frame for subsequent estimations. The number of > rows for each ID is random (typically 14 to 200). > > My current approach is to construct a matrix like this: > > out <- c('A', 'B', 'C', 'D') > out <- matrix(-1, 5000, 3 + length(out), dimnames = > list(1:5000, c('ID', 'timestamp' , 'val', out))) > > I access to out matrix by numerical index to substitute > values ( out[1:n,1] <- k ) > When matrix is full I add 5000 rows and go on. > Afterward I clean rows with ID set to -1 and than all other > -1 values with 0 > > For my application typically an ID have something between > 14 and 200 observations (mean around 50) but I have 15000 > IDs ... > After profiling I realize that accessing the out matrix > this way is too slow. > > Do you have any idea on how to speed up this kind of > process? > I think something can be done creating a data.frame for > each ID and bind them in the end. Is it a good idea? How can > I implement that? List of data.frame? And than? > > Below some code that can be useful if someone would like to > experiment ... > > alist <- vector('list', 2) > alist[[1]] <- data.frame( ID = 1, timestamp = 1:14, val > = rnorm(14), A = 1, B = 2, C = 3 ) > alist[[2]] <- data.frame( ID = 2, timestamp = 2:15, val > = rnorm(14), B = 2, C = 3, D = 4 ) > alist[[3]] <- data.frame( ID = 3, timestamp = 3:30, val > = rnorm(28), C = 1, D = 2 ) > > > Thanks in advance for your valuable help. > Daniele > > ________________________________ > ORS Srl > > Via Agostino Morando 1/3 12060 Roddi (Cn) - Italy > Tel. +39 0173 620211 > Fax. +39 0173 620299 / +39 0173 433111 > Web Site www.ors.it > > ------------------------------------------------------------------------------------------------------------------------ > Qualsiasi utilizzo non autorizzato del presente messaggio e > dei suoi allegati ? vietato e potrebbe costituire reato. > Se lei avesse ricevuto erroneamente questo messaggio, Le > saremmo grati se provvedesse alla distruzione dello stesso > e degli eventuali allegati. > Opinioni, conclusioni o altre informazioni riportate nella > e-mail, che non siano relative alle attivit? e/o > alla missione aziendale di O.R.S. Srl si intendono non > attribuibili alla societ? stessa, n? la impegnano in alcun > modo. > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help@r-project.org > mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, > reproducible code. > ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.