On Jan 28, 2014 at 8:56pm, David Winsemius wrote: On Jan 28, 2014, at 8:43 PM, andrewH wrote:
> Hi Folks! > I have been writing a small set of utilities for dealing with files that > are > hard to open correctly for one reason or another, especially because they > are too big for memory, non-rectangular, or contain odd characters or > unexpected codings, or all of these things together. Today it suddenly hit > me that this has probably been done, done better, and upgraded to package > form a dozen times already. There were pointers to a couple functions > useful > in this regard in the Core Import/Export document. But my effort to come > up > with search terms that were productive of such packages was unsuccessful. I don't know of a package to do that. You know the quote from that Russian author whose name I am forgetting (in "Anna Karinena" perhaps) about happy families being all the same but unhappy families being impossible to classify. I think it applies to datasets as well. There are too many different dataset pathologies to allow a neat packaging approach. My approach has been to study the options in read.table very carefully and if that is insufficient look at either readLines or scan as options. It is very useful to be able to use `count.fields` with different parameter settings of "quotes" and comment.char". Wrapping it in table() can deliver a very compact, useful result. And don't forget to search the Archives if you have a regular but non-rectangular arrangement. David Winsemius Alameda, CA, USA Thanks, David! You have quickly summarized a set of techniques that it took me a long time to learn (much of it spent disentangling the truth from various misconceptions about the data-reading process. I don't think I have very much to add to your list, but as always, the effectiveness depends on correct implementation, and I have made a _lot_ of mistake in trying to implement these in the past. Moreover, all these thing become much more complicated if the file is too big to just read into a data frame. I am working with Census records right now, and my primary data file is a 14 gig csv that had me tearing my hair out trying to read it and pull out the variables I have needed at any given moment. I finally did get it read and the right subset extracted, but it was a pretty empirical process - I would just keep trying things that didn't work until I found something that did, often not quite understanding why my previous efforts had failed. I know that If I have to do this again six months from now I will have no idea how I did it. So I wanted to reduce the things that worked to functions and set up a sort of decision tree that I could work through to find and correct at least the more common problems. But I was hoping -- am still hoping, actually -- to find that someone else has already done this so I could get back to my real work. It seems like the sort of thing that could easily be buried in the 100+ pages of documentation of one of the big utility packages like Hmisc, MASS or car. I have often wished there was a data manipulation and import/export task view, with a purview to cover things like what I am talking about here, the contents of Phil Spector's book, and packages like Hadley Wickham's plyr. Warmest regards, andrewH -- View this message in context: http://r.789695.n4.nabble.com/Diagnostic-and-helper-functions-for-defective-hard-to-import-files-tp4684357p4684364.html Sent from the R help mailing list archive at Nabble.com. ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.