[R] Fastest way to calculate quantile in large data.table

Camilo Mora Thu, 05 Feb 2015 11:51:09 -0800

In total I found 8 different way to calculate quantile in very a large 
data.table. I share below their performances for future reference. Tests 1, 7 
and 8 were the fastest I found.


Best,

Camilo

library(data.table)
v <- data.table(x=runif(10000),x2 = runif(10000),  
x3=runif(10000),x4=runif(10000))

#fastest
Sys.time()->StartTEST1
t(v[, apply(v,1,quantile,probs =c(.1,.9,.5),na.rm=TRUE)] )
Sys.time()->EndTEST1

Sys.time()->StartTEST2
v[, quantile(.SD,probs =c(.1,.9,.5)), by = 1:nrow(v)]
Sys.time()->EndTEST2

Sys.time()->StartTEST3
v[, c("L","H","M"):=quantile(.SD,probs =c(.1,.9,.5)), by = 1:nrow(v)]
Sys.time()->EndTEST3
v
v[, c("L","H","M"):=NULL]

v[,Names:=rownames(v)]
setkey(v,Names)

Sys.time()->StartTEST4
v[, c("L","H","M"):=quantile(.SD,probs =c(.1,.9,.5)), by = Names]
Sys.time()->EndTEST4
v
v[, c("L","H","M"):=NULL]


Sys.time()->StartTEST5
v[,  as.list(quantile(.SD,c(.1,.90,.5),na.rm=TRUE)), by=Names]
Sys.time()->EndTEST5


Sys.time()->StartTEST6
v[,  as.list(quantile(.SD,c(.1,.90,.5),na.rm=TRUE)), by=Names,.SDcols=1:4]
Sys.time()->EndTEST6


Sys.time()->StartTEST7
v[,  as.list(quantile(c(x ,       x2,        x3,        x4 
),c(.1,.90,.5),na.rm=TRUE)), by=Names]
Sys.time()->EndTEST7


# melting the database and doing quantily by summary. This is the second 
fastest, which is ironic given that the database has to be melted first
library(reshape2)
Sys.time()->StartTEST8
vs<-melt(v)
vs[,  as.list(quantile(value,c(.1,.90,.5),na.rm=TRUE)), by=Names]
Sys.time()->EndTEST8


EndTEST1-StartTEST1
EndTEST2-StartTEST2
EndTEST3-StartTEST3
EndTEST4-StartTEST4
EndTEST5-StartTEST5
EndTEST6-StartTEST6
EndTEST7-StartTEST7
EndTEST8-StartTEST8


______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] Fastest way to calculate quantile in large data.table

Reply via email to