On Mon, 25 Aug 2008, Roland Rau wrote:
Hi,
Jason Thibodeau wrote:
I am attempting to perform some simple data manipulation on a large data
set. I have a snippet of the whole data set, and my small snippet is 2GB
in
CSV.
Is there a way I can read my csv, select a few columns, and write it to an
output file in real time? This is what I do right now to a small test
file:
data <- read.csv('data.csv', header = FALSE)
data_filter <- data[c(1,3,4)]
write.table(data_filter, file = "filter_data.csv", sep = ",", row.names =
FALSE, col.names = FALSE)
in this case, I think R is not the best tool for the job. I would rather
suggest to use an implementation of the awk language (e.g. gawk).
I just tried the following on WinXP (zipped file (87MB zipped, 1.2GB
unzipped), piped into gawk)
unzip -p myzipfile.zip | gawk '{print $1, $3, $4}' > myfiltereddata.txt
Or
unzip -p myzipfile.zip | cut -d, -f1,3,4 > myfiltereddata.txt
But beware that both this and Roland's solution will return
a,c,d
for an input line consisting of
a,"b,c",d,e,f
HTH,
Chuck
and it took about 90 seconds.
Please note that you might need to specify your delimiter (field separator
(FS) and output field separator (OFS)) =>
gawk '{FS=","; OFS=","} {print $1, $3, $4}' data.csv > filter_data.scv
I hope this helps (despite not encouraging the usage of R),
Roland
______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Charles C. Berry (858) 534-2098
Dept of Family/Preventive Medicine
E mailto:[EMAIL PROTECTED] UC San Diego
http://famprevmed.ucsd.edu/faculty/cberry/ La Jolla, San Diego 92093-0901
______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.