By the way, if you use the H2 database with sqldf then there is a second way to read files in using sqldf.
# 1. run your perl program outside of R to create myfile.csv, say. # 2. install java from http://java.sun.com # and then install the RH2 package from CRAN install.packages("RH2") # 3. load sqldf and RH2 # sqldf automatically uses H2 database if RH2 is loaded library(RH2) library(sqldf) # 4. read file using sqldf making use of the CSVREAD function in H2 DF <- sqldf("select * from CSVREAD('myfile.csv')") On Sat, Feb 6, 2010 at 8:37 PM, Gabor Grothendieck <ggrothendi...@gmail.com> wrote: > file= is the input data file. filter= is just a command string that > specifies a program to run (not a data file). > > 1. If Filename.tmp is the name of a temporary file (that it creates) > it runs a batch command similar to this: > paste("cmd /c", filter, "<", file, ">", Filename.tmp) > > 2. Then it reads Filename.tmp into the database (which it creates for > you) and does this without involving R and > > 3. finally it reads the table in the database that was created into R, > as an R dataframe, and destroys the database. > > > On Sat, Feb 6, 2010 at 7:53 PM, Vadlamani, Satish {FLNA} > <satish.vadlam...@fritolay.com> wrote: >> Gabor: >> It did suppress the message now and I was able to load the data. Question. >> >> 1. test_df <- read.csv.sql(file="3wkoutstatfcst_small.dat", filter="perl >> parse_3wkout.pl") >> >> In the statement above, should the filename in file= and the file name that >> the perl script uses through the filter= command be the same? I would think >> not. I would say that if filter= is passed to the statement, then the >> filename should be ignored. Is this how it works? >> >> Thanks. >> Satish >> >> >> -----Original Message----- >> From: Gabor Grothendieck [mailto:ggrothendi...@gmail.com] >> Sent: Saturday, February 06, 2010 4:58 PM >> To: Vadlamani, Satish {FLNA} >> Cc: r-help@r-project.org >> Subject: Re: [R] Reading large files >> >> I have uploaded another version which suppresses display of the error >> message but otherwise works the same. Omitting the redundant >> arguments we have: >> >> ibrary(sqldf) >> # next line is only needed once per session to read in devel version >> source("http://sqldf.googlecode.com/svn/trunk/R/sqldf.R") >> >> test_df <- read.csv.sql(file="3wkoutstatfcst_small.dat", filter="perl >> parse_3wkout.pl") >> >> >> On Sat, Feb 6, 2010 at 5:48 PM, Vadlamani, Satish {FLNA} >> <satish.vadlam...@fritolay.com> wrote: >>> Gabor: >>> Please see the results below. Sourcing your new R script worked (although >>> with the same error message). If I put eol="\n" option, it is adding a "\r" >>> to the last column. I took out the eol option below. This is just some more >>> feedback to you. >>> >>> I am thinking that I will just do an inline edit in Perl (that is create >>> the csv file through Perl by overwriting the current file) and then use >>> read.csv.sql without the filter= option. This seems to be more tried and >>> tested. If you have any suggestions, please let me know. Thanks. >>> Satish >>> >>> >>> BEFORE SOURCING YOUR NEW R SCRIPT >>>> test_df <- read.csv.sql(file="3wkoutstatfcst_small.dat", sql = "select * >>>> from file", header = TRUE, sep = ",", filter="perl parse_3wkout.pl") >>> Error in readRegistry(key, maxdepth = 3) : >>> Registry key 'SOFTWARE\R-core' not found >>>> test_df >>> Error: object 'test_df' not found >>> >>> AFTER SOURCING YOUR NEW R SCRIPT >>>> source("f:/dp_modeling_team/downloads/R/sqldf.R") >>>> test_df <- read.csv.sql(file="3wkoutstatfcst_small.dat", sql = "select * >>>> from file", header = TRUE, sep = ",", filter="perl parse_3wkout.pl") >>> Error in readRegistry(key, maxdepth = 3) : >>> Registry key 'SOFTWARE\R-core' not found >>> In addition: Warning messages: >>> 1: closing unused connection 5 (3wkoutstatfcst_small.dat) >>> 2: closing unused connection 4 (3wkoutstatfcst_small.dat) >>> 3: closing unused connection 3 (3wkoutstatfcst_small.dat) >>>> test_df >>> allgeo area1 zone dist ccust1 whse bindc ccust2 account area2 ccust3 >>> 1 A 4 1 37 99 4925 4925 99 99 4 99 >>> 2 A 4 1 37 99 4925 4925 99 99 4 99 >>> >>> -----Original Message----- >>> From: Gabor Grothendieck [mailto:ggrothendi...@gmail.com] >>> Sent: Saturday, February 06, 2010 4:28 PM >>> To: Vadlamani, Satish {FLNA} >>> Cc: r-help@r-project.org >>> Subject: Re: [R] Reading large files >>> >>> The software attempts to read the registry and temporarily augment the >>> path in case you have Rtools installed so that the filter can access >>> all the tools that Rtools provides. I am not sure why its failing on >>> your system but there is evidently some differences between systems >>> here and I have added some code to trap and bypass that portion in >>> case it fails. I have added the new version to the svn repository so >>> try this: >>> >>> library(sqldf) >>> # overwrite with development version >>> source("http://sqldf.googlecode.com/svn/trunk/R/sqldf.R") >>> # your code to call read.csv.sql >>> >>> >>> On Sat, Feb 6, 2010 at 5:18 PM, Vadlamani, Satish {FLNA} >>> <satish.vadlam...@fritolay.com> wrote: >>>> >>>> Gabor: >>>> Here is the update. As you can see, I got the same error as below in 1. >>>> >>>> 1. Error >>>> test_df <- read.csv.sql(file="out_small.txt", sql = "select * from file", >>>> header = TRUE, sep = ",", filter="perl parse_3wkout.pl", eol="\n") >>>> Error in readRegistry(key, maxdepth = 3) : >>>> Registry key 'SOFTWARE\R-core' not found >>>> >>>> 2. But the loading of the bigger file was successful as you can see below. >>>> 857 MB, 333,250 rows, 227 columns. This is good. >>>> >>>> I will have to just do an inline edit in Perl and change the file to csv >>>> from within R and then call the read.csv.sql. >>>> >>>> If you have any suggestions to fix 1, I would like to try them. >>>> >>>> system.time(test_df <- read.csv.sql(file="out.txt")) >>>> user system elapsed >>>> 192.53 15.50 213.68 >>>> Warning message: >>>> closing unused connection 3 (out.txt) >>>> >>>> Thanks again. >>>> >>>> Satish >>>> >>>> -----Original Message----- >>>> From: Gabor Grothendieck [mailto:ggrothendi...@gmail.com] >>>> Sent: Saturday, February 06, 2010 3:02 PM >>>> To: Vadlamani, Satish {FLNA} >>>> Cc: r-help@r-project.org >>>> Subject: Re: [R] Reading large files >>>> >>>> Note that you can shorten #1 to read.csv.sql("out.txt") since your >>>> other arguments are the default values. >>>> >>>> For the second one, use read.csv.sql, eliminate the arguments that are >>>> defaults anyways (should not cause a problem but its error prone) and >>>> add an explicit eol= argument since SQLite can have problems with end >>>> of line in some cases. Also test out your perl script separately from >>>> R first to ensure that it works: >>>> >>>> test_df <- read.csv.sql(file="3wkoutstatfcst_small.dat", filter="perl >>>> parse_3wkout.pl", eol = "\n") >>>> >>>> SQLite has some known problems with end of line so try it with and >>>> without the eol= argument just in case. When I just made up the >>>> following gawk example I noticed that I did need to specify the eol= >>>> argument. >>>> >>>> Also I have added a complete example using gawk as Example 13c on the >>>> home page just now: >>>> http://code.google.com/p/sqldf/#Example_13._read.csv.sql_and_read.csv2.sql >>>> >>>> >>>> On Sat, Feb 6, 2010 at 3:52 PM, Vadlamani, Satish {FLNA} >>>> <satish.vadlam...@fritolay.com> wrote: >>>>> Gabor: >>>>> >>>>> I had success with the following. >>>>> 1. I created a csv file with a perl script called "out.txt". Then ran the >>>>> following successfully >>>>> library("sqldf") >>>>> test_df <- read.csv.sql(file="out.txt", sql = "select * from file", >>>>> header = TRUE, sep = ",", dbname = tempfile()) >>>>> >>>>> 2. I did not have success with the following. Could you tell me what I >>>>> may be doing wrong? I could paste the perl script if necessary. From the >>>>> perl script, I am reading the file, creating the csv record and printing >>>>> each record one by one and then exiting. >>>>> >>>>> Thanks. >>>>> >>>>> Not had success with below.. >>>>> #test_df <- read.csv2.sql(file="3wkoutstatfcst_small.dat", sql = "select >>>>> * from file", header = TRUE, sep = ",", filter="perl parse_3wkout.pl", >>>>> dbname = tempfile()) >>>>> test_df >>>>> >>>>> Error message below: >>>>> test_df <- read.csv2.sql(file="3wkoutstatfcst_small.dat", sql = "select * >>>>> from file", header = TRUE, sep = ",", filter="perl parse_3wkout.pl", >>>>> dbname = tempfile()) >>>>> Error in readRegistry(key, maxdepth = 3) : >>>>> Registry key 'SOFTWARE\R-core' not found >>>>> In addition: Warning messages: >>>>> 1: closing unused connection 14 (3wkoutstatfcst_small.dat) >>>>> 2: closing unused connection 13 (3wkoutstatfcst_small.dat) >>>>> 3: closing unused connection 11 (3wkoutstatfcst_small.dat) >>>>> 4: closing unused connection 9 (3wkoutstatfcst_small.dat) >>>>> 5: closing unused connection 3 (3wkoutstatfcst_small.dat) >>>>>> test_df <- read.csv2.sql(file="3wkoutstatfcst_small.dat", sql = "select >>>>>> * from file", header = TRUE, sep = ",", filter="perl parse_3wkout.pl", >>>>>> dbname = tempfile()) >>>>> Error in readRegistry(key, maxdepth = 3) : >>>>> Registry key 'SOFTWARE\R-core' not found >>>>> >>>>> -----Original Message----- >>>>> From: Gabor Grothendieck [mailto:ggrothendi...@gmail.com] >>>>> Sent: Saturday, February 06, 2010 12:14 PM >>>>> To: Vadlamani, Satish {FLNA} >>>>> Cc: r-help@r-project.org >>>>> Subject: Re: [R] Reading large files >>>>> >>>>> No. >>>>> >>>>> On Sat, Feb 6, 2010 at 1:01 PM, Vadlamani, Satish {FLNA} >>>>> <satish.vadlam...@fritolay.com> wrote: >>>>>> Gabor: >>>>>> Can I pass colClasses as a vector to read.csv.sql? Thanks. >>>>>> Satish >>>>>> >>>>>> >>>>>> -----Original Message----- >>>>>> From: Gabor Grothendieck [mailto:ggrothendi...@gmail.com] >>>>>> Sent: Saturday, February 06, 2010 9:41 AM >>>>>> To: Vadlamani, Satish {FLNA} >>>>>> Cc: r-help@r-project.org >>>>>> Subject: Re: [R] Reading large files >>>>>> >>>>>> Its just any Windows batch command string that filters stdin to >>>>>> stdout. What the command consists of should not be important. An >>>>>> invocation of perl that runs a perl script that filters stdin to >>>>>> stdout might look like this: >>>>>> read.csv.sql("myfile.dat", filter = "perl myprog.pl") >>>>>> >>>>>> For an actual example see the source of read.csv2.sql which defaults >>>>>> to using a Windows vbscript program as a filter. >>>>>> >>>>>> On Sat, Feb 6, 2010 at 10:16 AM, Vadlamani, Satish {FLNA} >>>>>> <satish.vadlam...@fritolay.com> wrote: >>>>>>> Jim, Gabor: >>>>>>> Thanks so much for the suggestions where I can use read.csv.sql and >>>>>>> embed Perl (or gawk). I just want to mention that I am running on >>>>>>> Windows. I am going to read the documentation the filter argument and >>>>>>> see if it can take a decent sized Perl script and then use its output >>>>>>> as input. >>>>>>> >>>>>>> Suppose that I write a Perl script that parses this fwf file and >>>>>>> creates a CSV file. Can I embed this within the read.csv.sql call? Or, >>>>>>> can it only be a statement or something? If you know the answer, please >>>>>>> let me know. Otherwise, I will try a few things and report back the >>>>>>> results. >>>>>>> >>>>>>> Thanks again. >>>>>>> Saitsh >>>>>>> >>>>>>> >>>>>>> -----Original Message----- >>>>>>> From: jim holtman [mailto:jholt...@gmail.com] >>>>>>> Sent: Saturday, February 06, 2010 6:16 AM >>>>>>> To: Gabor Grothendieck >>>>>>> Cc: Vadlamani, Satish {FLNA}; r-help@r-project.org >>>>>>> Subject: Re: [R] Reading large files >>>>>>> >>>>>>> In perl the 'unpack' command makes it very easy to parse fixed fielded >>>>>>> data. >>>>>>> >>>>>>> On Fri, Feb 5, 2010 at 9:09 PM, Gabor Grothendieck >>>>>>> <ggrothendi...@gmail.com> wrote: >>>>>>>> Note that the filter= argument on read.csv.sql can be used to pass the >>>>>>>> input through a filter written in perl, [g]awk or other language. >>>>>>>> For example: read.csv.sql(..., filter = "gawk -f myfilter.awk") >>>>>>>> >>>>>>>> gawk has the FIELDWIDTHS variable for automatically parsing fixed >>>>>>>> width fields, e.g. >>>>>>>> http://www.delorie.com/gnu/docs/gawk/gawk_44.html >>>>>>>> making this very easy but perl or whatever you are most used to would >>>>>>>> be fine too. >>>>>>>> >>>>>>>> On Fri, Feb 5, 2010 at 8:50 PM, Vadlamani, Satish {FLNA} >>>>>>>> <satish.vadlam...@fritolay.com> wrote: >>>>>>>>> Hi Gabor: >>>>>>>>> Thanks. My files are all in fixed width format. They are a lot of >>>>>>>>> them. It would take me some effort to convert them to CSV. I guess >>>>>>>>> this cannot be avoided? I can write some Perl scripts to convert >>>>>>>>> fixed width format to CSV format and then start with your suggestion. >>>>>>>>> Could you let me know your thoughts on the approach? >>>>>>>>> Satish >>>>>>>>> >>>>>>>>> >>>>>>>>> -----Original Message----- >>>>>>>>> From: Gabor Grothendieck [mailto:ggrothendi...@gmail.com] >>>>>>>>> Sent: Friday, February 05, 2010 5:16 PM >>>>>>>>> To: Vadlamani, Satish {FLNA} >>>>>>>>> Cc: r-help@r-project.org >>>>>>>>> Subject: Re: [R] Reading large files >>>>>>>>> >>>>>>>>> If your problem is just how long it takes to load the file into R try >>>>>>>>> read.csv.sql in the sqldf package. A single read.csv.sql call can >>>>>>>>> create an SQLite database and table layout for you, read the file into >>>>>>>>> the database (without going through R so R can't slow this down), >>>>>>>>> extract all or a portion into R based on the sql argument you give it >>>>>>>>> and then remove the database. See the examples on the home page: >>>>>>>>> http://code.google.com/p/sqldf/#Example_13._read.csv.sql_and_read.csv2.sql >>>>>>>>> >>>>>>>>> On Fri, Feb 5, 2010 at 2:11 PM, Satish Vadlamani >>>>>>>>> <satish.vadlam...@fritolay.com> wrote: >>>>>>>>>> >>>>>>>>>> Matthew: >>>>>>>>>> If it is going to help, here is the explanation. I have an end state >>>>>>>>>> in >>>>>>>>>> mind. It is given below under "End State" header. In order to get >>>>>>>>>> there, I >>>>>>>>>> need to start somewhere right? I started with a 850 MB file and >>>>>>>>>> could not >>>>>>>>>> load in what I think is reasonable time (I waited for an hour). >>>>>>>>>> >>>>>>>>>> There are references to 64 bit. How will that help? It is a 4GB RAM >>>>>>>>>> machine >>>>>>>>>> and there is no paging activity when loading the 850 MB file. >>>>>>>>>> >>>>>>>>>> I have seen other threads on the same types of questions. I did not >>>>>>>>>> see any >>>>>>>>>> clear cut answers or errors that I could have been making in the >>>>>>>>>> process. If >>>>>>>>>> I am missing something, please let me know. Thanks. >>>>>>>>>> Satish >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> End State >>>>>>>>>>> Satish wrote: "at one time I will need to load say 15GB into R" >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> ----- >>>>>>>>>> Satish Vadlamani >>>>>>>>>> -- >>>>>>>>>> View this message in context: >>>>>>>>>> http://n4.nabble.com/Reading-large-files-tp1469691p1470667.html >>>>>>>>>> Sent from the R help mailing list archive at Nabble.com. >>>>>>>>>> >>>>>>>>>> ______________________________________________ >>>>>>>>>> R-help@r-project.org mailing list >>>>>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help >>>>>>>>>> PLEASE do read the posting guide >>>>>>>>>> http://www.R-project.org/posting-guide.html >>>>>>>>>> and provide commented, minimal, self-contained, reproducible code. >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> ______________________________________________ >>>>>>>> R-help@r-project.org mailing list >>>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help >>>>>>>> PLEASE do read the posting guide >>>>>>>> http://www.R-project.org/posting-guide.html >>>>>>>> and provide commented, minimal, self-contained, reproducible code. >>>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Jim Holtman >>>>>>> Cincinnati, OH >>>>>>> +1 513 646 9390 >>>>>>> >>>>>>> What is the problem that you are trying to solve? >>>>>>> >>>>>> >>>>> >>>> >>> >> > ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.