Thanks Henrick. Seems it fits my needs. One my question is the argument, length.out=0.10*n, is it "randomly" taking out 10% ? I found it basically takes every 10th row if I put length.out=0.1*n, and every 100th row if I put length.out=0.01*n till the end. I couldn't find this information on documentation.
Stephen HK Wong Stanford, California 94305-5324 ----- Original Message ----- From: Henrik Bengtsson <h...@biostat.ucsf.edu> To: Stephen HK Wong <hon...@stanford.edu> Cc: r-help@r-project.org Sent: Thu, 18 Sep 2014 18:33:15 -0700 (PDT) Subject: Re: [R] read.table() 1Gb text dataframe As a start, make sure you specify the 'colClasses' argument. BTW, using that you can even go to the extreme and read one column at the time, if it comes down to that. To read a 10% subset of the rows, you can use R.filesets as: library(R.filesets) db <- TabularTextFile(pathname) n <- nbrOfRows(db) data <- readDataFrame(db, rows=seq(from=1, to=n, length.out=0.10*n)) It is also useful to specify 'colClasses' here. In addition to specifying them ordered by column, as for read.table(), you also specify them by column names (or regular expressions of the column names), e.g. data <- readDataFrame(db, colClasses=c("*"="NULL", "(x|y)"="integer", outcome="numeric", "id"="character"), rows=seq(from=1, to=n, length.out=0.10*n)) That 'colClasses' specifies that the default is drop all columns, read columns 'x' and 'y' as integers, and so on. BTW, if you know 'n' upfront you can skip the setup of TabularTextFile and just do: data <- readDataFrame(pathname, rows=seq(from=1, to=n, length.out=0.10*n)) Hope this helps Henrik On Thu, Sep 18, 2014 at 4:48 PM, Stephen HK Wong <hon...@stanford.edu> wrote: > Dear All, > > I have a table of 4 columns and many millions rows separated by > tab-delimited. I don't have enough memory to read.table in that 1 Gb file. > And actually I have 12 text files like that. Is there a way that I can just > randomly read.table() in 10% of rows ? I was able to do that using colbycol > package, but it is not not available. Many thanks!! > > > > Stephen HK Wong > Stanford, California 94305-5324 > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.