On Jul 14, 2013, at 10:57 AM, David Winsemius wrote: > > On Jul 14, 2013, at 9:48 AM, Houhou Li wrote: > >> Hi, >> >> I have several really big data files in csv format like this: the first line >> is the header, the second to fourth lines have info about the file and are >> the lines I need to skip (data in 2-4th lines are not correspoding to >> variable names in the hearder), from the fifth line, real data begins, but >> the last line is not a data line, it's the string "Done" instead of normal >> EOF character. All data is numeric. I tried to use read.table(), read.csv() >> with colClasses="numeric" and scan(), but couldn't make them work. Can >> anyone help me? How can I get rid of the last line "Done" automatically? I >> would like to use R script to do it automatically, not to do formatting in >> Excel then read back to R. Thank you very much, here is an example of the >> data: > > Deleting the last line in Excel would not make sense unless this is already > data in Excel. Better would be to sue a text editor. Less likely to corrupt > the data. > >> >> Tag,X,Y,BlobRegion,swaths,fr_int_20,fr_int_60,i60,RawTothgt,RawHtlc,RawRad20,RawRad40,RawRad60,RawRad80,CCV,BlobPerim,n_pts,n_pts_i255,vts,vts2,vtg,home,sum_ht,sum_ht_sq,dcch,dcch2,nb_ccv,n_nb,nb_sum_hts,nb_sum_hts2,z_tip_dist,nb_MassLen,n_f_rtns20,n_f_rtns60,max_fl_pt_count,loreyrawht,p00ile_cm,p25ile_cm,p50ile_cm,p75ile_cm,iq25,iq50,iq75,mean_intns >> 01_24_2013.001,SF12 >> 5413 >> 509627.82, 4869704.98, 509999.83, 4869999.98 >> 123,509692.55,4869856.64,18,0,80.53,81.03,84,36.2100,17.1521,4.0359,4.0359,3.8881,2.9217,1737.13,31.42,210,210,0.828,0.955,0.281,28.50,5746.46,163727.12,0.764,1.000,1147.23,33,769.16,19024.42,0.01,0.09,174,163,174,34.90,140,2369,2849,3157,33,81,110,71.59 >> 159,509679.19,4869855.54,18,0,77.62,78.97,75,30.4000,11.2000,2.5319,2.5129,2.3365,1.8315,3248.82,21.42,90,90,0.877,0.936,0.589,22.91,2000.74,46861.45,0.691,0.999,1772.06,14,365.47,10233.32,0.04,0.68,81,66,81,33.29,905,1869,2272,2633,55,82,98,71.62 > > Read the first line with readLines using n=1 saving as 'colnams' > Read the dat <- read.table( ... with skip=4, sep=",", and fill = TRUE > Delete last line holding "Done" and a large number of NA's > names(dat) <- scan(text=colnams, what=character(0), sep="," ) > > (Tested. Expected results achieved.)
Lines <- "Tag,X,Y,BlobRegion,swaths,fr_int_20,fr_int_60,i60,RawTothgt,RawHtlc,RawRad20, RawRad40,RawRad60,RawRad80,CCV,BlobPerim,n_pts,n_pts_i255,vts,vts2,vtg,home,sum_ht, sum_ht_sq,dcch,dcch2,nb_ccv,n_nb,nb_sum_hts,nb_sum_hts2,z_tip_dist,nb_MassLen, n_f_rtns20,n_f_rtns60,max_fl_pt_count,loreyrawht,p00ile_cm,p25ile_cm,p50ile_cm, p75ile_cm,iq25,iq50,iq75,mean_intns 01_24_2013.001,SF12 5413 509627.82, 4869704.98, 509999.83, 4869999.98 123,509692.55,4869856.64,18,0,80.53,81.03,84,36.2100,17.1521,4.0359,4.0359,3.8881, 2.9217,1737.13,31.42,210,210,0.828,0.955,0.281,28.50,5746.46,163727.12,0.764,1.000, 1147.23,33,769.16,19024.42,0.01,0.09,174,163,174,34.90,140,2369,2849,3157,33,81, 110,71.59 159,509679.19,4869855.54,18,0,77.62,78.97,75,30.4000,11.2000,2.5319,2.5129,2.3365, 1.8315,3248.82,21.42,90,90,0.877,0.936,0.589,22.91,2000.74,46861.45,0.691,0.999, 1772.06,14,365.47,10233.32,0.04,0.68,81,66,81,33.29,905,1869,2272,2633,55,82,98,71.62 Done" colnams <- readLines(textConnection(Lines), n=1) scan(text=colnams, what=character(0), sep="," ) # check scan code # snipped dat <- read.table( text=Lines, skip=4, sep=",", fill = TRUE) dat <- dat[-NROW(dat), ] names(dat) <- scan(text=colnams, what=character(0), sep="," ) # Read 44 items dat > -- > David > > > David Winsemius > Alameda, CA, USA > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. David Winsemius Alameda, CA, USA ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.