On 11 February 2011 19:39, Ben Bolker <bbol...@gmail.com> wrote: > [snip] > > What is dangerous/confusing is that R silently **wraps** longer lines if > fill=TRUE (which is the default for read.csv). I encountered this when > working with a colleague on a long, messy CSV file that had some phantom > extra fields in some rows, which then turned into empty lines in the > data frame. >
As a matter of fact, this is exactly what happened to a colleague of mine yesterday and caused her quite a bit of trouble. On the other hand, it could also be considered as a 'bug' in the csv file. Although no formal specification exist for the csv format, RFC 4180 [1] indicates that 'each line should contain the same number of fields throughout the file'. [1] http://tools.ietf.org/html/rfc4180 Best wishes, Laurent > Here is an example and a workaround that runs count.fields on the > whole file to find the maximum column length and set col.names > accordingly. (It assumes you don't already have a file named "test.csv" > in your working directory ...) > > I haven't dug in to try to write a patch for this -- I wanted to test > the waters and see what people thought first, and I realize that > read.table() is a very complicated piece of code that embodies a lot of > tradeoffs, so there could be lots of different approaches to trying to > mitigate this problem. I appreciate very much how hard it is to write a > robust and general function to read data files, but I also think it's > really important to minimize the number of traps in read.table(), which > will often be the first part of R that new users encounter ... > > A quick fix for this might be to allow the number of lines analyzed > for length to be settable by the user, or to allow a settable 'maxcols' > parameter, although those would only help in the case where the user > already knows there is a problem. > > cheers > Ben Bolker > > =============== > writeLines(c("A,B,C,D", > "1,a,b,c", > "2,f,g,c", > "3,a,i,j", > "4,a,b,c", > "5,d,e,f", > "6,g,h,i,j,k,l,m,n"), > con=file("test.csv")) > > > read.csv("test.csv") > try(read.csv("test.csv",fill=FALSE)) > > ## assumes header=TRUE, fill=TRUE; should be a little more careful > ## with comment, quote arguments (possibly explicit) > ## ... contains information about quote, comment.char, sep > Read.csv <- function(fn,sep=",",...) { > colnames <- scan(fn,nlines=1,what="character",sep=sep,...) > ncolnames <- length(colnames) > maxcols <- max(count.fields(fn,sep=sep,...)) > if (maxcols>ncolnames) { > colnames <- c(colnames,paste("V",(ncolnames+1):maxcols,sep="")) > } > ## assumes you don't have any other columns labeled "V[large number]" > read.csv(fn,...,col.names=colnames) > } > > Read.csv("test.csv") > > ______________________________________________ > R-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel > -- [ Laurent Gatto | slashhome.be ] ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel