Very nice! I am quite impressed at how flexible data.table is.
On Thu, Oct 13, 2011 at 1:05 AM, Matthew Dowle wrote:
> Using Josh's nice example, with data.table's built-in 'by' (optimised
> grouping) yields a 6 times speedup (100 seconds down to 15 on
> my netbook).
>
>> system.time(all.2b <- l
Using Josh's nice example, with data.table's built-in 'by' (optimised
grouping) yields a 6 times speedup (100 seconds down to 15 on
my netbook).
> system.time(all.2b <- lapply(si, function(.indx) { coef(lm(y ~
+ x, data=d[.indx,])) }))
user system elapsed
144.501 0.300 145.525
> system.
On Wed, Oct 12, 2011 at 4:56 AM, ivo welch wrote:
> thanks, josh. in my posting example, I did not need anything except
> coefficients. (when this is the case, I usually do not even use
> lm.fit, but I eliminate all missing obs first and then use solve
> crossprod(y,cbind(1,x)) crossprod(cbind(1
> (and assumes the data.frame doesn't include matrices
>>>> or other data.frames) and relies on split(vector,factor)
>>>> quickly splitting a vector into a list of vectors.
>>>> For a 10^6 row by 10 column data.frame split in 10^5
>>>> groups
omething based on this idea would help your
>>> parallelized by().
>>>
>>> mysplit.data.frame <-
>>> function (x, f, drop = FALSE, ...)
>>> {
>>> f <- as.factor(f)
>>> tmp <- lapply(x, function(xi) split(xi, f, drop =
))
>> tmp <- lapply(setNames(seq_along(tmp), names(tmp)), function(i) {
>> t <- tmp[[i]]
>> names(t) <- names(x)
>> attr(t, "row.names") <- rn[[i]]
>> class(t) <- "data.frame"
>> t
>> })
) {
> t <- tmp[[i]]
> names(t) <- names(x)
> attr(t, "row.names") <- rn[[i]]
> class(t) <- "data.frame"
> t
> })
> tmp
> }
>
> Bill Dunlap
> Spotfire, TIBCO Software
> wdunlap tibco.com
>
n
> Behalf Of Jim Holtman
> Sent: Monday, October 10, 2011 7:29 PM
> To: ivo welch
> Cc: r-help
> Subject: Re: [R] SLOW split() function
>
> instead of spliting the entire dataframe, split the indices and then use
> these to access your data:
> try
>
> system.tim
I tried this:
library(data.table)
N <- 1000
T <- N*10
d <- data.table(gp= rep(1:T, rep(N,T)), val=rnorm(N*T), key = 'gp')
dim(d)
[1] 10002
# On my humble 8Gb system,
> system.time(l <- d[, split(val, gp)])
user system elapsed
4.150.094.27
I wouldn't be surprise
instead of spliting the entire dataframe, split the indices and then use these
to access your data: try
system.time(s <- split(seq(nrow(d)), d$key))
this should be faster and less memory intensive. you can then use the indices
to access the subset:
result <- lapply(s, function(.indx){
do
10 matches
Mail list logo