Re: [R] data frame subset too slow

Duke Thu, 30 Dec 2010 08:29:27 -0800

Hi Jim,

Is this really a problem for me to use [1] instead of [[1]]? Will thismake it run slower? Also, if I use dat$V1 %in% list$V1, will it be fine?


Anyway, my data and list are basically gene lists (tab delimited):

$ head test.txt

Xkr4 chr1 - 3204562 3661579 3206102 3661429 33204562,3411782,3660632, 3207049,3411982,3661579,Rp1 chr1 - 4280926 4399322 4283061 4399268 44280926,4341990,4342282,4399250, 4283093,4342162,4342918,4399322,Rp1_2 chr1 - 4333587 4350395 4334680 4342906 44333587,4341990,4342282,4350280, 4340172,4342162,4342918,4350395,Sox17 chr1 - 4481008 4486494 4481796 4483487 54481008,4483180,4483852,4485216,4486371,4482749,4483547,4483944,4486023,4486494,Mrpl15 chr1 - 4763278 4775807 4764532 4775758 54763278,4767605,4772648,4774031,4775653,4764597,4767729,4772814,4774186,4775807,Mrpl15_2 chr1 - 4763278 4775807 4775807 47758074 4763278,4767605,4772648,4775653, 4764597,4767729,4772814,4775807,

$ head list.txt
GeneNames    Chr    Start    End
0610007C21Rik    chr5    31351012    31356996
0610007L01Rik    chr5    130695613    130719635
0610007L01Rik_2    chr5    130698204    130719635
0610007P08Rik    chr13    63916627    64001609
0610007P08Rik_2    chr13    63916641    63970963
0610007P14Rik    chr12    87156404    87165495

Thanks,

D.

On 12/30/10 11:13 AM, jim holtman wrote:

You should be using dat[[1]].  Here is an example with 80000 rows that
take about 0.02 seconds to get the subset.

Provide an 'str' of what your data looks like

n<- 80000  # rows to create
dat<- data.frame(sample(1:200, n, TRUE), runif(n), runif(n), runif(n), runif(n))
lst<- data.frame(sample(1:100, n, TRUE), runif(n), runif(n), runif(n), runif(n))
str(dat)

'data.frame':   80000 obs. of  5 variables:
  $ sample.1.200..n..TRUE.: int  39 116 69 163 51 125 144 32 28 4 ...
  $ runif.n.              : num  0.519 0.793 0.549 0.77 0.272 ...
  $ runif.n..1            : num  0.691 0.89 0.783 0.467 0.357 ...
  $ runif.n..2            : num  0.705 0.254 0.584 0.998 0.279 ...
  $ runif.n..3            : num  0.873 1 0.678 0.702 0.455 ...

str(lst)

'data.frame':   80000 obs. of  5 variables:
  $ sample.1.100..n..TRUE.: int  38 83 38 70 77 44 81 55 32 1 ...
  $ runif.n.              : num  0.0621 0.7374 0.074 0.4281 0.0516 ...
  $ runif.n..1            : num  0.879 0.294 0.146 0.884 0.58 ...
  $ runif.n..2            : num  0.648 0.745 0.825 0.507 0.799 ...
  $ runif.n..3            : num  0.2523 0.1679 0.9728 0.0478 0.0967 ...

system.time({

+ dat.sub<- dat[dat[[1]] %in% lst[[1]],]
+ })
    user  system elapsed
    0.02    0.00    0.01

str(dat.sub)

'data.frame':   39803 obs. of  5 variables:
  $ sample.1.200..n..TRUE.: int  39 69 51 32 28 4 69 3 48 69 ...
  $ runif.n.              : num  0.5188 0.5494 0.2718 0.5566 0.0893 ...
  $ runif.n..1            : num  0.691 0.783 0.357 0.619 0.717 ...
  $ runif.n..2            : num  0.705 0.584 0.279 0.789 0.192 ...
  $ runif.n..3            : num  0.873 0.678 0.455 0.843 0.383 ...
On Thu, Dec 30, 2010 at 10:23 AM, Duke<duke.li...@gmx.com>  wrote:

Hi all,

First I dont have much experience with R so be gentle. OK, I am dealing with
a dataset (~ tens of thousand lines, each line ~ 10 columns of data). I have
to create some subset of this data based on some certain conditions (for
example, same first column with another dataset etc...). Here is how I did
it:

# import data
dat<- read.table( "test.txt", header=TRUE, fill=TRUE, sep="\t" )
list<- read.table( "list.txt", header=TRUE, fill=TRUE, sep="\t" )
# create sub data
subdat<- dat[dat[1] %in% list[1],]

So the third line is to create a new data frame with all the same first
column in both dat and list. There is no problem with the code as it runs
just fine with testing data (small). When I tried with my real data (~80k
lines, ~ 15MB size), it takes like forever (few hours). I dont know why it
takes that long, but I think it shouldnt. I think even with a for loop in
C++, I can get this done in say few minutes.

So anyone has any idea/advice/suggestion?

Thanks so much in advance and Happy New Year to all of you.

D.

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] data frame subset too slow

Reply via email to