On Sun, Oct 11, 2020 at 12:03 PM Morten Welinder <[email protected]> wrote: > > TL;DR: Avoid csv files if you can.
I can't and don't want to because nothing else fills the same hole. More useful than advice to avoid would be a set of canonical rules for producing CSV files that don't cause problems. I see this https://www.w3.org/TR/csvw-ucr/ There's this but it seems too basic: https://www.thoughtspot.com/6-rules-creating-valid-csv-files Britton > > This note described what Gnumeric does with csv files and why csv files are > a bad idea to begin with. > > Note: this is about unattended import of csv files. With the configurable > text importer, Gnumeric can be convinced to import just about any reasonable > text-in-columns file. > > There are three phases to csv file import: > > 1. Character set resolution. > 2. The syntactic level -- chopping the file into lines and lines into fields. > 3. The semantic level -- making sense of fields. > > > Phase 1: Character Set Resolution > > Most people aren't aware, nor do they need to be aware, of this level. > It governs > how the bytes in the file are turned into characters. Most text you find on a > Linux system will be UTF-8 encoded -- including a very large fraction that is > just ASCII -- but Windows originated files will occasionally have > UTF-16 encoding, > either little or big endian, with or without a BOM marker. The > Windows files will > use 2 bytes per character while UTF-8 files use a variable length, mostly 1 > byte > per character for western text. > > The reason you don't have to think about this is that detection is > mostly automatic > and that the automatic system rarely fails. The only reason to > intervene would be > if you got ahold of some 1990s non-ASCII file with, typically, > Eastern-European > text in it. > > The Gnumeric solution here is to map everything to UTF-8, but for the > purposes here > you might as well think "ASCII". > > > Phase 2: The syntactic level > > This level handles chopping a file into lines and the resulting lines > into fields. > This is the (only) level that RFC-4180 has an opinion on. > > Let's start with lines. RFC-4180 specifies that lines are terminated with > \r\n. > In particular, if a file uses unix style \n only, then as far as RFC-4180 is > concerned the whole file is only a very long line. Since \n files > occur often in > the real world, following RFC-4180 strictly is clearly not a viable option. > > The Gnumeric solution is to accept \r\n, \n, and even the old Mac \r as line > terminators. Moreover, the last line may omit the terminator. The line part > of > the syntactic level rarely causes problems. > > The fields part of the syntactic level is where the wild west starts. Here > are some samples to think of, with comments that aren't part of the files. > > 1,2,3 # Easy > 4,5,6 > > 100.22,222.34,-123 # Still easy > 444,,123.45 > > "111,22",222,"22,33" # Not too bad, typical in Europe > "1,22",,"222,11" > > 111,22;222;22,33 # Same as above other than terminators > 1,22;;222,11 # (And, yes, still called csv even though the > separator > # isn't comma!) > > "Quote ""this""",text # Double quote in quoted text for embedded quote > > "Quote \"this\"",text # Escaped quote in quoted text (not handled > by Gnumeric) > > "Multi # Embedded line breaks in text > line > field,text > > Foo,Bar"Baz,Bof # Quote inside unquoted field > > I don't recall if we have seen files with single quotes. I certainly > wouldn't rule > it out offhand. > > It should be fairly clear that if you get the quoting style and/or the field > separator wrong, then you are stuck with a pile of garbage. > > The Gnumeric solution here is to assume the RFC-4180 quoting style, > and then look at > the text in the file, notably on a line starting with a quote if there > is any, and > make a guess and the field separator. In practice it seems to work. > > Let's assume we got it right, and move on. Note, that we are now > leaving anything > that RFC-4180 can help us with. > > > Phase 3: The semantic level > > This is the phase that takes a, roughly, rectangular array of text > bits and tries > to make sense of it. > > VERY IMPORTANT NOTE: Gnumeric (and other spreadsheets) interpret data when > they > are entered. If we determine that something is a number, then it will be > stored as a number and it does not matter whether it was entered as "100", > "1e2", "00100" or "100.000". We do not carry the string it came from around > (and depending on context such a string may not even exist in the first > place). > This mean that we know the meaning of the data and that it does not change if, > say, the spreadsheet is later loaded somewhere in Europe where "100.000" could > be a hundred thousand. On don't get me started on dates! > > If this is not the behaviour you want, then you must arrange for that > particular > piece of data to be interpreted as a string, not a number or anything > else. This > is one of the major pitfalls of csv files and the place falls squarely > on the csv > format because it has no way of saying "that's a string!" > > The use of quotes in csv files is *not* an indication that a piece of data is > a > string. Several reasons: (1) in decimal-comma locales most numbers have to be > quoted; (2) most csv producers quote everything; (3) I know of no csv > producing > programs that actually try to use quotes for that purpose and I know of no csv > consuming programs that try to interpret files that way. > > The Gnumeric solution here is to interpret the text as-if it was > entered in a cell. > This has both good and bad aspects: > > 1. A single initial quote can be used to interpret the rest as text. > Occasionally useful; rarely a problem. > > 2. An initial "=" can be used to force interpretation as a formula. > Occasionally useful; rarely a problem. > > 3. Numbers in all kinds of formats are understood. ("100", "100.22", > "100,100.22", > "$100.22", "1e+09", ...) > See below for decimal-comma versus decimal-point. > > 4. Dates get interpreted. ("21-Jan-2020", "2002-09-09", "Jan22", "1/2/22", > ...) > This is a source of problems, especially with short formats that do > not contain > the year. See below for m/d/y versus d/m/y considerations. > > 5. Anything that doesn't get interpreted otherwise becomes a string. > > > Gnumeric looks at entire columns in order to guess what date format is > being used. > If you have "1/2/2000" and later "31/3/2000" in the same column, then we > deduce > that the format is d/m/y and the former gets interpreted at > 2000-02-01. If you have > instead had "3/31/2000" in the same column, you would get 2000-01-02. > > Similarly, Gnumeric looks at whole columns in order to tell whether "100,200" > is > a little more than one hundred thousand (i.e., the comma is a > thousands separator) > or a little more than one hundred (i.e., the comma is a decimal separator). > > If a column's actual text doesn't resolve which format is being used, > the locale's > convention is used. That can mean data corruption if your 1/2/2000 > gets interpreted > the wrong way. If the column is inconsistent, you have my sympathy. > > > Conclusions: > > You should read all of the above as "Gnumeric makes a guess as to what the > data > in the csv file means". The need to make a guess is the fault of the csv file > format. Other programs have to make guesses too, and the results may be > different from Gnumeric's. > > There are certainly corner cases where one could argue that Gnumeric makes the > wrong guess. There are also files for which two different reasonable guesses > are possible, i.e., no matter what guess you make you are wrong! > > When Gnumeric interprets csv or other text import the wrong way (i.e., > differently from what you want), it is generally a mistake to try to fix the > data in the sheet using search-and- replace or hand editing. The right way is > to invoke the configurable gui text importer and set the right format. > A possible exception to this is US postal codes ("zip codes") for which you > can probably get away with using a format of "00000" to restore. > > I always advise people not to use csv files if possible. When that is not > possible, my advice is to use as simple csv files as possible. That means: > > 1. RFC-4180 syntax. Avoid multi-line fields. > 2. Simple numbers only, no thousands separator and no currency. > 3. Dates in 2020-Oct-11 or 2020-10-11 format. > 4. No strings that can be interpreted as numbers -- watch out for the gene > name > "MARCH1", zip code "00142", and "1e2". > 5. Don't start strings with an equal sign or a single quote. > > (4) is tricky because it is vague. > > If you can't do this, you're on thin ice and should identify the problem > correctly -- csv is deficient -- and move to another format. > > If you have placed "MARCH1" in a csv file and the information "it's a gene > name" elsewhere (in your head or some other file) then you really should not > blame Excel, Gnumeric, or anyone else when your data gets mangled. > > Morten > _______________________________________________ > gnumeric-list mailing list > [email protected] > https://mail.gnome.org/mailman/listinfo/gnumeric-list _______________________________________________ gnumeric-list mailing list [email protected] https://mail.gnome.org/mailman/listinfo/gnumeric-list
