Hi Harold,
Generally: you can not beat data.table, unless you can represent your
data in a matrix (or array or vector). For some specific cases, Hervé's
suggestion might be also competitive.
Your problem is that you did not put any effort to read at least part of
the very extensive documentati
On 09/28/2016 02:53 PM, Hervé Pagès wrote:
Hi,
I'm surprised nobody suggested split(). Splitting the data.frame
upfront is faster than repeatedly subsetting it:
tmp <- data.frame(id = rep(1:2, each = 10), foo = rnorm(20))
idList <- unique(tmp$id)
system.time(for (i in idList) tmp
"I'm surprised nobody suggested split(). "
I did.
by() is a data frame oriented version of tapply(), which uses split().
Cheers,
Bert
Bert Gunter
"The trouble with having an open mind is that people keep coming along
and sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom C
Hi,
I'm surprised nobody suggested split(). Splitting the data.frame
upfront is faster than repeatedly subsetting it:
tmp <- data.frame(id = rep(1:2, each = 10), foo = rnorm(20))
idList <- unique(tmp$id)
system.time(for (i in idList) tmp[which(tmp$id == i),])
# user system el
eplicate(500, subset(tmp2, id == idList[1])))
>
> From: Dominik Schneider [mailto:dosc3...@colorado.edu]
> Sent: Wednesday, September 28, 2016 12:27 PM
> To: Doran, Harold
> Cc: r-help@r-project.org
> Subject: Re: [R] Faster Subsetting
>
> I regularly crunch through this amo
I regularly crunch through this amount of data with tidyverse. You can also
try the data.table package. They are optimized for speed, as long as you
have the memory.
Dominik
On Wed, Sep 28, 2016 at 10:09 AM, Doran, Harold wrote:
> I have an extremely large data frame (~13 million rows) that rese
I regularly crunch through this amount of data with tidyverse. You can also
try the data.table package. They are optimized for speed, as long as you
have the memory.
Dominik
On Wed, Sep 28, 2016 at 10:09 AM, Doran, Harold wrote:
> I have an extremely large data frame (~13 million rows) that rese
On Wed, 28 Sep 2016, "Doran, Harold" writes:
> I have an extremely large data frame (~13 million rows) that resembles
> the structure of the object tmp below in the reproducible code. In my
> real data, the variable, 'id' may or may not be ordered, but I think
> that is irrelevant.
>
> I have a p
compared to the indexing method.
>
> Perhaps I'm using it incorrectly?
>
>
>
> -Original Message-
> From: Constantin Weiser [mailto:constantin.wei...@hhu.de]
> Sent: Wednesday, September 28, 2016 12:55 PM
> To: r-help@r-project.org
> Cc: Doran, Harold
> Subj
compared to the indexing method.
Perhaps I'm using it incorrectly?
-Original Message-
From: Constantin Weiser [mailto:constantin.wei...@hhu.de]
Sent: Wednesday, September 28, 2016 12:55 PM
To: r-help@r-project.org
Cc: Doran, Harold
Subject: Re: [R] Faster Subsetting
I just mod
.data.table(tmp) # data.table
>
> system.time(replicate(500, tmp2[which(tmp$id == idList[1]),]))
>
> system.time(replicate(500, subset(tmp2, id == idList[1])))
>
> From: Dominik Schneider [mailto:dosc3...@colorado.edu]
> Sent: Wednesday, September 28, 2016 12:27 PM
> To: Doran, H
Hello,
If you work with a matrix instead of a data.frame, it usually runs
faster, but your column vectors must all be numeric.
### Fast, but not fast enough
system.time(replicate(500, tmp[which(tmp$id == idList[1]),]))
user system elapsed
0.050.000.04
### Not fast at all, a
Schneider [mailto:dosc3...@colorado.edu]
Sent: Wednesday, September 28, 2016 12:27 PM
To: Doran, Harold
Cc: r-help@r-project.org
Subject: Re: [R] Faster Subsetting
I regularly crunch through this amount of data with tidyverse. You can also try
the data.table package. They are optimized for speed
I have an extremely large data frame (~13 million rows) that resembles the
structure of the object tmp below in the reproducible code. In my real data,
the variable, 'id' may or may not be ordered, but I think that is irrelevant.
I have a process that requires subsetting the data by id and then
14 matches
Mail list logo