On 11/2/2009 2:03 PM, Gene Leynes wrote:
I've been trying to figure out how to read in a large file for a few days
now, and after extensive research I'm still not sure what to do.
I have a large comma delimited text file that contains 59 fields in each
record.
There is also a header every 121 records
You can open the connection before reading, then read in blocks of lines
and process those. You don't need to reopen it every time. For example,
ff <- file(fname, open="rt") # rt is read text
for (block in 1:nblocks) {
x <- readLines(ff, n=121)
# process this block
}
close(ff)
Duncan Murdoch
This function works well for smallish records
getcsv=function(fname){
ff=file(description = fname)
x <- readLines(ff)
closeAllConnections()
x <- x[x != ""] # REMOVE BLANKS
x=x[grep("^[-0-9]", x)] # REMOVE ALL TEXT
spl=strsplit(x,',') # THIS PART IS SLOW, BUT MANAGABLE
xx=t(sapply(1:length(spl),function(temp)as.vector(na.omit(as.numeric(spl[[temp]])))))
return(xx)
}
It's not elegant, but it works.
For 121,000 records it completes in 2.3 seconds
For 121,000*5 records it completes in 63 seconds
For 121,000*10 records it doesn't complete
When I try other methods to read the file in chunks (using scan), the
process breaks down because I have to start at the beginning of the file on
every iteration.
For example:
fnn=function(n,col){
a=122*(n-1)+2
xx=scan(fname,skip=a-1,nlines=121,sep=',',quiet=TRUE,what=character(0))
xx=xx[xx!='']
xx=matrix(xx,ncol=49,byrow=TRUE)
xx[,col]
}
system.time(sapply(1:10,fnn,c=26)) # 0.31 Seconds
system.time(sapply(91:90,fnn,c=26)) # 1.09 Seconds
system.time(sapply(901:910,fnn,c=26)) # 5.78 Seconds
Even though I'm only getting the 26th column for 10 sets of records, it
takes a lot longer the further into the file I go.
How can I tell scan to pick up where it left off, without it starting at the
beginning?? There must be a good example somewhere.
I have done a lot of research (in fact, thank you to Michael J. Crawley and
others for your help thus far)
Thanks,
Gene
[[alternative HTML version deleted]]
______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.