On 2020-10-04 19:02, John Denker via gnumeric-list wrote: > > The first rule of csv files is "don't use csv files". > > That scares me. In just one of my directories, I just now counted > two dozen .csv files created in the last 24 hours. A total of 12 > megabytes today, just in this one directory. There are others. > > My professional life depends on .csv files that I get from various > sources. Data is available to me in that format, and often no other. > > Very often I need to do calculations that can't be done in a > spreadsheet, so I export the data, krunch it using thousands of > lines of C++ and/or perl, and then import it again.
CSV files come with lots of potential issues, mostly revolving around a lack of standardization: - encoding may or may not be specified (is this UTF8? UTF16? UTF32? Latin1? Windows-1252? any of a gazillion other encodings?) - how do you quote the quote character (doubling it, escaping with a backslash, encoded with some other escape method, ...) - does it distinguish between an empty value and an empty quoted value? (sometimes the former means Null while the latter means an empty string; other times they're the same) - should one expect headers? If so, does case matter? Does order matter? (I often have columns move around but if accessed by header, they're adequately consistent) - can more than one column have the same header? - what should happen if a row has fewer entries than the header row? - what should happen if a row has *more* entries than the header row? - what should happen if there's no header row, but rows don't have the same number of columns? - parsing with some tools like awk(1) can become tedious when the comma-delimiter can appear within the data (so you have to special-case the quoting) - is the end-of-line character a Unix "LF", a DOS "CR/LF", an old Mac "CR", or the largely-unused Record Separator (RS=0x30) - what happens if data contains newlines in it? does odd quoting mean that the row is continued on the next line? - sometimes things are called CSV when they use alternate delimiters such as tab (though often called TSV files), pipe, colon, or whatever other delimiter character that comes up on a whim - the data is largely 2d only, so there's no mechanism for including multiple sheets of data other than multiple files None of these is necessarily a deal-breaker. I deal with processing hundreds of MB (maybe even GB) of CSV files each month using Python & awk, but the road is paved with the above perils. If you know the answers to those questions above for your data in question or haven't hit any of those issues, and you know that the file-format is predictable, then I would treat the "don't use CSV files" as more of an admonition to know what you're doing. And that if something breaks, you get to keep all the pieces. It's an unfortunately underdefined (but common) means for transmitting data. There are better ways, but <opinion class=controversial>like PHP, JavaScript, and MySQL, they are used because they're popular, not because they're particularly good; I use PHP, JavaScript, MySQL, and CSV files for their ubiquity, not their excellence.</opinion> So use guilt-free, but use with caution. -tkc _______________________________________________ gnumeric-list mailing list [email protected] https://mail.gnome.org/mailman/listinfo/gnumeric-list
