This is a very odd error I'm hitting using read_xlsx from the readxl package
(version 1.3.1) with R version 3.6.2 (2019-12-12) , platform
x86_64-pc-linux-gnu (and updated Ubuntu 18.04). I have some largeish Excel
spreadsheets that contain clinical data. I can't share the entire raw data but
I think I can share the specific problem columns as Excel files, but not via
the list as I'm sure it rightly rejects such attachments.
The particular column contains entries like
1
1, 14
1.14
That's to say it's a column that can have empty cells, or entries which should
be integers (a limited range of them) but cells may have multiple integers and
the data entry means that people use various separators, commas, full stops and
occasionally semi-colons or colons and all with or without various amounts of
space.
I thought this would be easy to handle but this illustrates the issue I'm
hitting:
> unique(read_xlsx("Book1.xlsx", col_types = "text"))
# A tibble: 18 x 1
NOWARN
<chr>
1 NA
2 14
3 8,12,14
4 13
5 58
6 9
7 9.1300000000000008
8 11
9 11.14
10 10
11 10.14
12 9.14
13 13.14
14 9 ,13
15 9.11
16 1
17 1.1399999999999999
18 1, 14
That's reading from a single column, 981 row (including column header) Excel
xlsx file in an up to date Windoze 10 Professional running in a VM on the
Ubuntu machine.
I created that file (which I can share) by copying the data from the full file
to a new Excel spreadsheet (M$ Orifice "Professional Plus 2019" "Version 1912"
"Build 12325.20344 Click-to-run" to an empty new Excel file and using the
default save_as. The clinical data files were created in, and updated in,
versions of Excel that I can't access but the file was certainly created first
between two years and three months before now so probably with different
versions of Excel and probably in a Spanish or Catalan M$ locale.
The weird thing is that looking at the Excel cells that created those
"9.1300000000000008" and "1.1399999999999999" entries they show "9.13" and
"1.14" (respectively!). They continue to show those values plus many trailing
zeroes if I use Excel formatting to ask for 20 decimal places (I get less of
course, but no arbitrary terminal rounding digit).
It appears to me that read_xlxs() is only applying the "col_types = "text""
argument _after_ reading the column freely, reading each cell guessing the type
by its contents and so ending up with numeric values for "9.13" and "1.14"
which are then picking up rounding errors and being forced to character after
that. I say that the reading would appear to be free across all cells in the
column as there are entries of "8, 12, 14" coming before these problem entries:
> tmp <- read_xlsx("Book1.xlsx", col_types = "text")
> grep("1.1399999999999999", tmp$NOWARN, fixed = TRUE)
[1] 932 948 954
> grep("9.1300000000000008", tmp$NOWARN, fixed = TRUE)
[1] 73 189 190 271 272 390 511 645 686 710 744 830 899
> tmp$NOWARN[20]
[1] "8,12,14"
This seems completely bizarre to me. I find it very hard to believe that
read_xlsx() would guess content class (type) freely by for each individual
entry and only apply the col_types argument after doing that as that would seem
likely to be incredibly inefficient for really big spreadsheets. It seems
equally hard to believe that it would then create rounding errors (for some
guessed numerics like 9.13 and 1.14 but not for others like 11.4). However, my
guess would appear to fit the results and I am only guessing because I'm sure
my programming comprehension isn't good enough to read into the sources to
actually work out how the function works.
To make things more interesting, and to suggest that at least some of the
problem is with Excel is that when I use LibreOffice (in Ubuntu) created a
Excel file in the same way, i.e. open the clinical Excel file but in
LibreOffice, copy and paste the same column into a new LibreOffice calc
spreadsheet and save as xlsx, tmp.xlsx, I get this:
> unique(read_xlsx("tmp.xlsx", col_types = "text"))
# A tibble: 18 x 1
NOWARN
<chr>
1 NA
2 14
3 8,12,14
4 13
5 58
6 9
7 9.13
8 11
9 11.14
10 10
11 10.14
12 9.14
13 13.14
14 9 ,13
15 9.11
16 1
17 1.14
18 1, 14
Exactly what I think I should be seeing. I was working in Rstudio but get
exactly the same in a new R terminal session with only readxl loaded so I don't
think this is any weird environment or other clash.
Obviously I can, though not terribly easily for a fully generic fix, catch
these weird rounding errors and correct them, I am sure can also report this as
a suspected bug to the maintainer through the github issues system but I wanted
to check here whether anyone could see something I'm missing as I'm really a
(clinically retired) therapist and doctor, now full time researcher and I'm not
a professional statistician or programmer.
TIA,
Chris
--
Chris Evans <[email protected]> Visiting Professor, University of Sheffield
<[email protected]>
I do some consultation work for the University of Roehampton
<[email protected]> and other places
but <[email protected]> remains my main Email address. I have a work web site
at:
https://www.psyctc.org/psyctc/
and a site I manage for CORE and CORE system trust at:
http://www.coresystemtrust.org.uk/
I have "semigrated" to France, see:
https://www.psyctc.org/pelerinage2016/semigrating-to-france/
That page will also take you to my blog which started with earlier joys in
France and Spain!
If you want to book to talk, I am trying to keep that to Thursdays and my diary
is at:
https://www.psyctc.org/pelerinage2016/ceworkdiary/
Beware: French time, generally an hour ahead of UK.
______________________________________________
[email protected] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.