Re: [R] Speeding reading of large file

Fisher Dennis Wed, 28 Nov 2012 12:08:33 -0800

An interesting approach -- I lose the column names (which I need) but I could 
get them with something cute such as:
        1.  read the first few lines only with readLines(FILENAME, n=10)
        2.  use your approach to read.table -- this will grab the column names
        3.  replace the headers in the full version with the correct column 
names


Dennis Fisher MD
P < (The "P Less Than" Company)
Phone: 1-866-PLessThan (1-866-753-7784)
Fax: 1-866-PLessThan (1-866-753-7784)
www.PLessThan.com

On Nov 28, 2012, at 11:32 AM, David L Carlson wrote:

> Using your first approach, this should be faster 
> 
> raw <- readLines(con=filename)
> dta <- read.table(text=raw[!grepl("[A:DF:Z]" ,raw)], header=FALSE)
> 
> ----------------------------------------------
> David L Carlson
> Associate Professor of Anthropology
> Texas A&M University
> College Station, TX 77843-4352
> 
>> -----Original Message-----
>> From: r-help-boun...@r-project.org [mailto:r-help-bounces@r-
>> project.org] On Behalf Of Fisher Dennis
>> Sent: Wednesday, November 28, 2012 11:43 AM
>> To: r-help@r-project.org
>> Subject: [R] Speeding reading of large file
>> 
>> R 2.15.1
>> OS X and Windows
>> 
>> Colleagues,
>> 
>> I have a file that looks that this:
>> TABLE NO.  1
>> PTID        TIME        AMT         FORM        PERIOD      IPRED
>> CWRES       EVID        CP          PRED        RES         WRES
>>  2.0010E+03  3.9375E-01  5.0000E+03  2.0000E+00  0.0000E+00
>> 0.0000E+00  0.0000E+00  1.0000E+00  0.0000E+00  0.0000E+00  0.0000E+00
>> 0.0000E+00
>>  2.0010E+03  8.9583E-01  5.0000E+03  2.0000E+00  0.0000E+00
>> 3.3389E+00  0.0000E+00  1.0000E+00  0.0000E+00  3.5321E+00  0.0000E+00
>> 0.0000E+00
>>  2.0010E+03  1.4583E+00  5.0000E+03  2.0000E+00  0.0000E+00
>> 5.8164E+00  0.0000E+00  1.0000E+00  0.0000E+00  5.9300E+00  0.0000E+00
>> 0.0000E+00
>>  2.0010E+03  1.9167E+00  5.0000E+03  2.0000E+00  0.0000E+00
>> 8.3633E+00  0.0000E+00  1.0000E+00  0.0000E+00  8.7011E+00  0.0000E+00
>> 0.0000E+00
>>  2.0010E+03  2.4167E+00  5.0000E+03  2.0000E+00  0.0000E+00
>> 1.0092E+01  0.0000E+00  1.0000E+00  0.0000E+00  1.0324E+01  0.0000E+00
>> 0.0000E+00
>>  2.0010E+03  2.9375E+00  5.0000E+03  2.0000E+00  0.0000E+00
>> 1.1490E+01  0.0000E+00  1.0000E+00  0.0000E+00  1.1688E+01  0.0000E+00
>> 0.0000E+00
>>  2.0010E+03  3.4167E+00  5.0000E+03  2.0000E+00  0.0000E+00
>> 1.2940E+01  0.0000E+00  1.0000E+00  0.0000E+00  1.3236E+01  0.0000E+00
>> 0.0000E+00
>>  2.0010E+03  4.4583E+00  5.0000E+03  2.0000E+00  0.0000E+00
>> 1.1267E+01  0.0000E+00  1.0000E+00  0.0000E+00  1.1324E+01  0.0000E+00
>> 0.0000E+00
>> 
>> The file is reasonably large (> 10^6 lines) and the two line header is
>> repeated periodically in the file.
>> I need to read this file in as a data frame.  Note that the number of
>> columns, the column headers, and the number of replicates of the
>> headers are not known in advance.
>> 
>> I have tried two approaches to this:
>>      First Approach:
>>              1.  readLines(FILENAME) to read in the file
>>              2.  use grep to find the repeat headers; strip out the
>> repeat headers
>>              3.  write() the object to tempfile, read in that temporary
>> file using read.table(tempfile, header=TRUE, skip=1) [an alternative is
>> to use textConnection but that does not appear to speed things]
>> 
>>      Second Approach:
>>              1.  TEMP        <- read.table(FILENAME, header=TRUE, skip=1,
>> fill=TRUE, as.is=TRUE)
>>              2.  get rid of the errant entries with:
>>                      TEMP[!is.na(as.numeric(TEMP[,1])),]
>>              3.  reading of the character entries forced all columns to
>> character mode.  Therefore, I convert each column to numeric:
>>                      for (COL in 1:ncol(TEMP)) TEMP[,COL] <-
>> as.numeric(TEMP[,COL])
>> The second approach is ~ 20% faster than the first.  With the second
>> approach, the conversion to numeric occupies 50% of the elapsed time.
>> 
>> Is there some approach that would be much faster?  For example, would a
>> vectorized approach to conversion to numeric improve throughput?  Or,
>> is there some means to ensure that all data are read as numeric (I
>> tried to use colClasses but that triggered an error when the text
>> string was encountered).
>> 
>> ############################
>> A dput version of the data is:
>> c("TABLE NO.  1", " PTID        TIME        AMT         FORM
>> PERIOD      IPRED       CWRES       EVID        CP          PRED
>> RES         WRES",
>> "  2.0010E+03  3.9375E-01  5.0000E+03  2.0000E+00  0.0000E+00
>> 0.0000E+00  0.0000E+00  1.0000E+00  0.0000E+00  0.0000E+00  0.0000E+00
>> 0.0000E+00",
>> "  2.0010E+03  8.9583E-01  5.0000E+03  2.0000E+00  0.0000E+00
>> 3.3389E+00  0.0000E+00  1.0000E+00  0.0000E+00  3.5321E+00  0.0000E+00
>> 0.0000E+00",
>> "  2.0010E+03  1.4583E+00  5.0000E+03  2.0000E+00  0.0000E+00
>> 5.8164E+00  0.0000E+00  1.0000E+00  0.0000E+00  5.9300E+00  0.0000E+00
>> 0.0000E+00",
>> "  2.0010E+03  1.9167E+00  5.0000E+03  2.0000E+00  0.0000E+00
>> 8.3633E+00  0.0000E+00  1.0000E+00  0.0000E+00  8.7011E+00  0.0000E+00
>> 0.0000E+00",
>> "  2.0010E+03  2.4167E+00  5.0000E+03  2.0000E+00  0.0000E+00
>> 1.0092E+01  0.0000E+00  1.0000E+00  0.0000E+00  1.0324E+01  0.0000E+00
>> 0.0000E+00",
>> "  2.0010E+03  2.9375E+00  5.0000E+03  2.0000E+00  0.0000E+00
>> 1.1490E+01  0.0000E+00  1.0000E+00  0.0000E+00  1.1688E+01  0.0000E+00
>> 0.0000E+00",
>> "  2.0010E+03  3.4167E+00  5.0000E+03  2.0000E+00  0.0000E+00
>> 1.2940E+01  0.0000E+00  1.0000E+00  0.0000E+00  1.3236E+01  0.0000E+00
>> 0.0000E+00",
>> "  2.0010E+03  4.4583E+00  5.0000E+03  2.0000E+00  0.0000E+00
>> 1.1267E+01  0.0000E+00  1.0000E+00  0.0000E+00  1.1324E+01  0.0000E+00
>> 0.0000E+00"
>> )
>> 
>> This can be assembled into a large dataset and written to a file named
>> FILENAME with the following code:
>> cat(c("TABLE NO.  1", " PTID        TIME        AMT         FORM
>> PERIOD      IPRED       CWRES       EVID        CP          PRED
>> RES         WRES",
>> "  2.0010E+03  3.9375E-01  5.0000E+03  2.0000E+00  0.0000E+00
>> 0.0000E+00  0.0000E+00  1.0000E+00  0.0000E+00  0.0000E+00  0.0000E+00
>> 0.0000E+00",
>> "  2.0010E+03  8.9583E-01  5.0000E+03  2.0000E+00  0.0000E+00
>> 3.3389E+00  0.0000E+00  1.0000E+00  0.0000E+00  3.5321E+00  0.0000E+00
>> 0.0000E+00",
>> "  2.0010E+03  1.4583E+00  5.0000E+03  2.0000E+00  0.0000E+00
>> 5.8164E+00  0.0000E+00  1.0000E+00  0.0000E+00  5.9300E+00  0.0000E+00
>> 0.0000E+00",
>> "  2.0010E+03  1.9167E+00  5.0000E+03  2.0000E+00  0.0000E+00
>> 8.3633E+00  0.0000E+00  1.0000E+00  0.0000E+00  8.7011E+00  0.0000E+00
>> 0.0000E+00",
>> "  2.0010E+03  2.4167E+00  5.0000E+03  2.0000E+00  0.0000E+00
>> 1.0092E+01  0.0000E+00  1.0000E+00  0.0000E+00  1.0324E+01  0.0000E+00
>> 0.0000E+00",
>> "  2.0010E+03  2.9375E+00  5.0000E+03  2.0000E+00  0.0000E+00
>> 1.1490E+01  0.0000E+00  1.0000E+00  0.0000E+00  1.1688E+01  0.0000E+00
>> 0.0000E+00",
>> "  2.0010E+03  3.4167E+00  5.0000E+03  2.0000E+00  0.0000E+00
>> 1.2940E+01  0.0000E+00  1.0000E+00  0.0000E+00  1.3236E+01  0.0000E+00
>> 0.0000E+00",
>> "  2.0010E+03  4.4583E+00  5.0000E+03  2.0000E+00  0.0000E+00
>> 1.1267E+01  0.0000E+00  1.0000E+00  0.0000E+00  1.1324E+01  0.0000E+00
>> 0.0000E+00"
>> )[rep(1:10, 1000)], file="FILENAME", sep="\n")
>> 
>> 
>> Dennis
>> 
>> 
>> Dennis Fisher MD
>> P < (The "P Less Than" Company)
>> Phone: 1-866-PLessThan (1-866-753-7784)
>> Fax: 1-866-PLessThan (1-866-753-7784)
>> www.PLessThan.com
>> 
>> ______________________________________________
>> R-help@r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-
>> guide.html
>> and provide commented, minimal, self-contained, reproducible code.
> 

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Speeding reading of large file

Reply via email to