Re: [R] [FORGED] Splitting data.frame into a list of small data.frames given indices

Ivan Calandra Wed, 29 Jun 2016 06:57:04 -0700

Hi,

I don't really understand why you split every row... This makes it veryslow. Try with a more realistic example (with a factor to split).


Ivan

--
Ivan Calandra, PhD
Scientific Mediator
University of Reims Champagne-Ardenne
GEGENAA - EA 3795
CREA - 2 esplanade Roland Garros
51100 Reims, France
+33(0)3 26 77 36 89
ivan.calan...@univ-reims.fr
--
https://www.researchgate.net/profile/Ivan_Calandra
https://publons.com/author/705639/

Le 29/06/2016 à 15:21, Witold E Wolski a écrit :

Hi,

Here is an complete example which shows the the complexity of split or
by is O(n^2)

nrows <- c(1e3,5e3, 1e4 ,5e4, 1e5 ,2e5)
res<-list()

for(i in nrows){
   dum <- data.frame(x = runif(i,1,1000), y=runif(i,1,1000))
   res[[length(res)+1]]<-(system.time(x<- split(dum, 1:nrow(dum))))
}
res <- do.call("rbind",res)
plot(nrows^2, res[,"elapsed"])

And I can't see a reason why this has to be so slow.


cheers







On 29 June 2016 at 12:00, Rolf Turner <r.tur...@auckland.ac.nz> wrote:

On 29/06/16 21:16, Witold E Wolski wrote:

It's the inverse problem to merging a list of data.frames into a large
data.frame just discussed in the "performance of do.call("rbind")"
thread

I would like to split a data.frame into a list of data.frames
according to first column.
This SEEMS to be easily possible with the function base::by. However,
as soon as the data.frame has a few million rows this function CAN NOT
BE USED (except you have A PLENTY OF TIME).

for 'by' runtime ~ nrow^2, or formally O(n^2)  (see benchmark below).

So basically I am looking for a similar function with better complexity.


  > nrows <- c(1e5,1e6,2e6,3e6,5e6)

timing <- list()
for(i in nrows){

+ dum <- peaks[1:i,]
+ timing[[length(timing)+1]] <- system.time(x<- by(dum[,2:3],
INDICES=list(dum[,1]), FUN=function(x){x}, simplify = FALSE))
+ }

names(timing)<- nrows
timing

$`1e+05`
    user  system elapsed
    0.05    0.00    0.05

$`1e+06`
    user  system elapsed
    1.48    2.98    4.46

$`2e+06`
    user  system elapsed
    7.25   11.39   18.65

$`3e+06`
    user  system elapsed
   16.15   25.81   41.99

$`5e+06`
    user  system elapsed
   43.22   74.72  118.09


I'm not sure that I follow what you're doing, and your example is not
reproducible, since we have no idea what "peaks" is, but on a toy example
with 5e6 rows in the data frame I got a timing result of

    user  system elapsed
   0.379 0.025 0.406

when I applied split().  Is this adequately fast? Seems to me that if you
want to split something, split() would be a good place to start.

cheers,

Rolf Turner

--
Technical Editor ANZJS
Department of Statistics
University of Auckland
Phone: +64-9-373-7599 ext. 88276


______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] [FORGED] Splitting data.frame into a list of small data.frames given indices

Reply via email to