Re: More than you ever wanted to know about csv files (Re: to csv or not to csv)

Britton Kerin via gnumeric-list Sun, 11 Oct 2020 14:28:31 -0700

On Sun, Oct 11, 2020 at 12:03 PM Morten Welinder <[email protected]> wrote:
>
> TL;DR: Avoid csv files if you can.


I can't and don't want to because nothing else fills the same hole.
More useful than advice to avoid would be a set of canonical rules for producing
CSV files that don't cause problems.

I see this https://www.w3.org/TR/csvw-ucr/

There's this but it seems too basic:
https://www.thoughtspot.com/6-rules-creating-valid-csv-files

Britton

>
> This note described what Gnumeric does with csv files and why csv files are
> a bad idea to begin with.
>
> Note: this is about unattended import of csv files.  With the configurable
> text importer, Gnumeric can be convinced to import just about any reasonable
> text-in-columns file.
>
> There are three phases to csv file import:
>
> 1. Character set resolution.
> 2. The syntactic level -- chopping the file into lines and lines into fields.
> 3. The semantic level -- making sense of fields.
>
>
> Phase 1: Character Set Resolution
>
> Most people aren't aware, nor do they need to be aware, of this level.
> It governs
> how the bytes in the file are turned into characters.  Most text you find on a
> Linux system will be UTF-8 encoded -- including a very large fraction that is
> just ASCII -- but Windows originated files will occasionally have
> UTF-16 encoding,
> either little or big endian, with or without a BOM marker.  The
> Windows files will
> use 2 bytes per character while UTF-8 files use a variable length, mostly 1 
> byte
> per character for western text.
>
> The reason you don't have to think about this is that detection is
> mostly automatic
> and that the automatic system rarely fails.  The only reason to
> intervene would be
> if you got ahold of some 1990s non-ASCII file with, typically, 
> Eastern-European
> text in it.
>
> The Gnumeric solution here is to map everything to UTF-8, but for the
> purposes here
> you might as well think "ASCII".
>
>
> Phase 2: The syntactic level
>
> This level handles chopping a file into lines and the resulting lines
> into fields.
> This is the (only) level that RFC-4180 has an opinion on.
>
> Let's start with lines.  RFC-4180 specifies that lines are terminated with 
> \r\n.
> In particular, if a file uses unix style \n only, then as far as RFC-4180 is
> concerned the whole file is only a very long line.  Since \n files
> occur often in
> the real world, following RFC-4180 strictly is clearly not a viable option.
>
> The Gnumeric solution is to accept \r\n, \n, and even the old Mac \r as line
> terminators.  Moreover, the last line may omit the terminator.  The line part 
> of
> the syntactic level rarely causes problems.
>
> The fields part of the syntactic level is where the wild west starts.  Here
> are some samples to think of, with comments that aren't part of the files.
>
>    1,2,3                 # Easy
>    4,5,6
>
>    100.22,222.34,-123    # Still easy
>    444,,123.45
>
>    "111,22",222,"22,33"  # Not too bad, typical in Europe
>    "1,22",,"222,11"
>
>    111,22;222;22,33      # Same as above other than terminators
>    1,22;;222,11          # (And, yes, still called csv even though the 
> separator
>                          # isn't comma!)
>
>    "Quote ""this""",text # Double quote in quoted text for embedded quote
>
>    "Quote \"this\"",text # Escaped quote in quoted text (not handled
> by Gnumeric)
>
>    "Multi                # Embedded line breaks in text
>    line
>    field,text
>
>    Foo,Bar"Baz,Bof       # Quote inside unquoted field
>
> I don't recall if we have seen files with single quotes.  I certainly
> wouldn't rule
> it out offhand.
>
> It should be fairly clear that if you get the quoting style and/or the field
> separator wrong, then you are stuck with a pile of garbage.
>
> The Gnumeric solution here is to assume the RFC-4180 quoting style,
> and then look at
> the text in the file, notably on a line starting with a quote if there
> is any, and
> make a guess and the field separator.  In practice it seems to work.
>
> Let's assume we got it right, and move on.  Note, that we are now
> leaving anything
> that RFC-4180 can help us with.
>
>
> Phase 3: The semantic level
>
> This is the phase that takes a, roughly, rectangular array of text
> bits and tries
> to make sense of it.
>
> VERY IMPORTANT NOTE: Gnumeric (and other spreadsheets) interpret data when 
> they
> are entered.  If we determine that something is a number, then it will be
> stored as a number and it does not matter whether it was entered as "100",
> "1e2", "00100" or "100.000".  We do not carry the string it came from around
> (and depending on context such a string may not even exist in the first 
> place).
> This mean that we know the meaning of the data and that it does not change if,
> say, the spreadsheet is later loaded somewhere in Europe where "100.000" could
> be a hundred thousand.  On don't get me started on dates!
>
> If this is not the behaviour you want, then you must arrange for that 
> particular
> piece of data to be interpreted as a string, not a number or anything
> else.  This
> is one of the major pitfalls of csv files and the place falls squarely
> on the csv
> format because it has no way of saying "that's a string!"
>
> The use of quotes in csv files is *not* an indication that a piece of data is 
> a
> string.  Several reasons: (1) in decimal-comma locales most numbers have to be
> quoted; (2) most csv producers quote everything; (3) I know of no csv 
> producing
> programs that actually try to use quotes for that purpose and I know of no csv
> consuming programs that try to interpret files that way.
>
> The Gnumeric solution here is to interpret the text as-if it was
> entered in a cell.
> This has both good and bad aspects:
>
> 1. A single initial quote can be used to interpret the rest as text.
>    Occasionally useful; rarely a problem.
>
> 2. An initial "=" can be used to force interpretation as a formula.
>    Occasionally useful; rarely a problem.
>
> 3. Numbers in all kinds of formats are understood.  ("100", "100.22",
> "100,100.22",
>    "$100.22", "1e+09", ...)
>    See below for decimal-comma versus decimal-point.
>
> 4. Dates get interpreted.  ("21-Jan-2020", "2002-09-09", "Jan22", "1/2/22", 
> ...)
>    This is a source of problems, especially with short formats that do
> not contain
>    the year.  See below for m/d/y versus d/m/y considerations.
>
> 5. Anything that doesn't get interpreted otherwise becomes a string.
>
>
> Gnumeric looks at entire columns in order to guess what date format is
> being used.
> If you have "1/2/2000" and later "31/3/2000" in the same column, then we 
> deduce
> that the format is d/m/y and the former gets interpreted at
> 2000-02-01.  If you have
> instead had "3/31/2000" in the same column, you would get 2000-01-02.
>
> Similarly, Gnumeric looks at whole columns in order to tell whether "100,200" 
> is
> a little more than one hundred thousand (i.e., the comma is a
> thousands separator)
> or a little more than one hundred (i.e., the comma is a decimal separator).
>
> If a column's actual text doesn't resolve which format is being used,
> the locale's
> convention is used.  That can mean data corruption if your 1/2/2000
> gets interpreted
> the wrong way.  If the column is inconsistent,  you have my sympathy.
>
>
> Conclusions:
>
> You should read all of the above as "Gnumeric makes a guess as to what the 
> data
> in the csv file means".  The need to make a guess is the fault of the csv file
> format.  Other programs have to make guesses too, and the results may be
> different from Gnumeric's.
>
> There are certainly corner cases where one could argue that Gnumeric makes the
> wrong guess.  There are also files for which two different reasonable guesses
> are possible, i.e., no matter what guess you make you are wrong!
>
> When Gnumeric interprets csv or other text import the wrong way (i.e.,
> differently from what you want), it is generally a mistake to try to fix the
> data in the sheet using search-and- replace or hand editing.  The right way is
> to invoke the configurable gui text importer and set the right format.
> A possible exception to this is US postal codes ("zip codes") for which you
> can probably get away with using a format of "00000" to restore.
>
> I always advise people not to use csv files if possible.  When that is not
> possible, my advice is to use as simple csv files as possible.  That means:
>
> 1. RFC-4180 syntax.  Avoid multi-line fields.
> 2. Simple numbers only, no thousands separator and no currency.
> 3. Dates in 2020-Oct-11 or 2020-10-11 format.
> 4. No strings that can be interpreted as numbers -- watch out for the gene 
> name
>    "MARCH1", zip code "00142", and "1e2".
> 5. Don't start strings with an equal sign or a single quote.
>
> (4) is tricky because it is vague.
>
> If you can't do this, you're on thin ice and should identify the problem
> correctly -- csv is deficient -- and move to another format.
>
> If you have placed "MARCH1" in a csv file and the information "it's a gene
> name" elsewhere (in your head or some other file) then you really should not
> blame Excel, Gnumeric, or anyone else when your data gets mangled.
>
> Morten
> _______________________________________________
> gnumeric-list mailing list
> [email protected]
> https://mail.gnome.org/mailman/listinfo/gnumeric-list
_______________________________________________
gnumeric-list mailing list
[email protected]
https://mail.gnome.org/mailman/listinfo/gnumeric-list

Re: More than you ever wanted to know about csv files (Re: to csv or not to csv)

Reply via email to