Hi all,

First I dont have much experience with R so be gentle. OK, I am dealing with a dataset (~ tens of thousand lines, each line ~ 10 columns of data). I have to create some subset of this data based on some certain conditions (for example, same first column with another dataset etc...). Here is how I did it:

# import data
dat <- read.table( "test.txt", header=TRUE, fill=TRUE, sep="\t" )
list <- read.table( "list.txt", header=TRUE, fill=TRUE, sep="\t" )
# create sub data
subdat <- dat[dat[1] %in% list[1],]

So the third line is to create a new data frame with all the same first column in both dat and list. There is no problem with the code as it runs just fine with testing data (small). When I tried with my real data (~80k lines, ~ 15MB size), it takes like forever (few hours). I dont know why it takes that long, but I think it shouldnt. I think even with a for loop in C++, I can get this done in say few minutes.

So anyone has any idea/advice/suggestion?

Thanks so much in advance and Happy New Year to all of you.

D.

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to