On 03/19/2018 02:23 PM, Detlef Steuer wrote:
Dear friends,
I stumbled into beheaviour of read.delim which I would consider a bug
or at least an inconsistency that should be improved upon.
Recently we had to work with data that used "", two double quotes, as
symbol to start and end character input.
Essentially the data looked like this
data.csv
========
V1, V2, V3
""data"", 3, """"
The last sequence of """" indicating a missing.
After processing the quotes, this is internally parsed as
data 3 "
Which I think is correct; in particular, """" represents single quote.
This is correct and it conforms to RFC 4180. "" in contrast represents
an empty string.
Based on my reading of RFC4180, ""data"" is not a valid field, but not
every CSV file follows that RFC, and R supports this pattern as expected
in your data. So you should be fine here.
One obvious solution to read in this data is using some gsub(),
but that's not the point I want to make.
Consider this case we found during tests:
test.csv
========
V1, V2, V3, V4
"""", """", 3, ""
and read it with
read.delim("test.csv", sep=",", header=TRUE, na.strings="\"")
After processing the quotes, this is internally parsed as
" " 3 <empty_string>
which is again I think correct (and conforms to RFC 4180)
you get the following
V1 V2 V3 V4
1 NA " 3 NA
(and a warning)
I do not get the warning on my system. The reason why the second " is
not translated to NA by na.strings is white space after the comma in the
CSV file, this works more consistently:
> read.delim("test.csv", sep=",", header=TRUE, na.strings="\"",
strip.white=TRUE)
V1 V2 V3 V4
1 NA NA 3 NA
If one needed to differentiate between " and <empty_string>, then it
might be necessary to run without the na.strings argument.
Best
Tomas
I would have assumed to get some error message or at
least the same result for both appearances of """" in the
input file.
(the setting na.strings="\"" turned out to be working for
a colleague and his specific data, while I think it shouldn't)
My main concern is the different interpretation for the two """"
sequences.
Real bug? Minor inconsistency? I don't know.
All the best
Detlef
______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel