Actually there are different ways of doing subsetting:
[1]
[[1]]
[,1]
$V1
Please let me know which one is the fastest (and most used) one. Thanks.
D.
On 12/30/10 11:28 AM, Duke wrote:
Hi Jim,
Is this really a problem for me to use [1] instead of [[1]]? Will this
make it run slower? Also, if I use dat$V1 %in% list$V1, will it be fine?
Anyway, my data and list are basically gene lists (tab delimited):
$ head test.txt
Xkr4 chr1 - 3204562 3661579 3206102 3661429 3
3204562,3411782,3660632, 3207049,3411982,3661579,
Rp1 chr1 - 4280926 4399322 4283061 4399268 4
4280926,4341990,4342282,4399250, 4283093,4342162,4342918,4399322,
Rp1_2 chr1 - 4333587 4350395 4334680 4342906 4
4333587,4341990,4342282,4350280, 4340172,4342162,4342918,4350395,
Sox17 chr1 - 4481008 4486494 4481796 4483487 5
4481008,4483180,4483852,4485216,4486371,
4482749,4483547,4483944,4486023,4486494,
Mrpl15 chr1 - 4763278 4775807 4764532 4775758
5 4763278,4767605,4772648,4774031,4775653,
4764597,4767729,4772814,4774186,4775807,
Mrpl15_2 chr1 - 4763278 4775807 4775807 4775807
4 4763278,4767605,4772648,4775653, 4764597,4767729,4772814,4775807,
$ head list.txt
GeneNames Chr Start End
0610007C21Rik chr5 31351012 31356996
0610007L01Rik chr5 130695613 130719635
0610007L01Rik_2 chr5 130698204 130719635
0610007P08Rik chr13 63916627 64001609
0610007P08Rik_2 chr13 63916641 63970963
0610007P14Rik chr12 87156404 87165495
Thanks,
D.
On 12/30/10 11:13 AM, jim holtman wrote:
You should be using dat[[1]]. Here is an example with 80000 rows that
take about 0.02 seconds to get the subset.
Provide an 'str' of what your data looks like
n<- 80000 # rows to create
dat<- data.frame(sample(1:200, n, TRUE), runif(n), runif(n),
runif(n), runif(n))
lst<- data.frame(sample(1:100, n, TRUE), runif(n), runif(n),
runif(n), runif(n))
str(dat)
'data.frame': 80000 obs. of 5 variables:
$ sample.1.200..n..TRUE.: int 39 116 69 163 51 125 144 32 28 4 ...
$ runif.n. : num 0.519 0.793 0.549 0.77 0.272 ...
$ runif.n..1 : num 0.691 0.89 0.783 0.467 0.357 ...
$ runif.n..2 : num 0.705 0.254 0.584 0.998 0.279 ...
$ runif.n..3 : num 0.873 1 0.678 0.702 0.455 ...
str(lst)
'data.frame': 80000 obs. of 5 variables:
$ sample.1.100..n..TRUE.: int 38 83 38 70 77 44 81 55 32 1 ...
$ runif.n. : num 0.0621 0.7374 0.074 0.4281 0.0516 ...
$ runif.n..1 : num 0.879 0.294 0.146 0.884 0.58 ...
$ runif.n..2 : num 0.648 0.745 0.825 0.507 0.799 ...
$ runif.n..3 : num 0.2523 0.1679 0.9728 0.0478 0.0967 ...
system.time({
+ dat.sub<- dat[dat[[1]] %in% lst[[1]],]
+ })
user system elapsed
0.02 0.00 0.01
str(dat.sub)
'data.frame': 39803 obs. of 5 variables:
$ sample.1.200..n..TRUE.: int 39 69 51 32 28 4 69 3 48 69 ...
$ runif.n. : num 0.5188 0.5494 0.2718 0.5566 0.0893 ...
$ runif.n..1 : num 0.691 0.783 0.357 0.619 0.717 ...
$ runif.n..2 : num 0.705 0.584 0.279 0.789 0.192 ...
$ runif.n..3 : num 0.873 0.678 0.455 0.843 0.383 ...
On Thu, Dec 30, 2010 at 10:23 AM, Duke<duke.li...@gmx.com> wrote:
Hi all,
First I dont have much experience with R so be gentle. OK, I am
dealing with
a dataset (~ tens of thousand lines, each line ~ 10 columns of
data). I have
to create some subset of this data based on some certain conditions
(for
example, same first column with another dataset etc...). Here is how
I did
it:
# import data
dat<- read.table( "test.txt", header=TRUE, fill=TRUE, sep="\t" )
list<- read.table( "list.txt", header=TRUE, fill=TRUE, sep="\t" )
# create sub data
subdat<- dat[dat[1] %in% list[1],]
So the third line is to create a new data frame with all the same first
column in both dat and list. There is no problem with the code as it
runs
just fine with testing data (small). When I tried with my real data
(~80k
lines, ~ 15MB size), it takes like forever (few hours). I dont know
why it
takes that long, but I think it shouldnt. I think even with a for
loop in
C++, I can get this done in say few minutes.
So anyone has any idea/advice/suggestion?
Thanks so much in advance and Happy New Year to all of you.
D.
______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.