On Mon, 25 Aug 2008, Roland Rau wrote:

Hi,

Jason Thibodeau wrote:
 I am attempting to perform some simple data manipulation on a large data
 set. I have a snippet of the whole data set, and my small snippet is 2GB
 in
 CSV.

 Is there a way I can read my csv, select a few columns, and write it to an
 output file in real time? This is what I do right now to a small test
 file:

 data <- read.csv('data.csv', header = FALSE)

 data_filter <- data[c(1,3,4)]

 write.table(data_filter, file = "filter_data.csv", sep = ",", row.names =
 FALSE, col.names = FALSE)

in this case, I think R is not the best tool for the job. I would rather suggest to use an implementation of the awk language (e.g. gawk). I just tried the following on WinXP (zipped file (87MB zipped, 1.2GB unzipped), piped into gawk)
unzip -p myzipfile.zip | gawk '{print $1, $3, $4}' > myfiltereddata.txt

Or

unzip -p myzipfile.zip | cut -d, -f1,3,4 > myfiltereddata.txt

But beware that both this and Roland's solution will return

        a,c,d

for an input line consisting of

        a,"b,c",d,e,f

HTH,

Chuck

and it took about 90 seconds.

Please note that you might need to specify your delimiter (field separator (FS) and output field separator (OFS)) =>
gawk '{FS=","; OFS=","} {print $1, $3, $4}' data.csv > filter_data.scv

I hope this helps (despite not encouraging the usage of R),
Roland

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



Charles C. Berry                            (858) 534-2098
                                            Dept of Family/Preventive Medicine
E mailto:[EMAIL PROTECTED]                  UC San Diego
http://famprevmed.ucsd.edu/faculty/cberry/  La Jolla, San Diego 92093-0901

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to