Re: [R] read_xlsx(readxl) apparently mangling some data input

Chris Evans Tue, 04 Feb 2020 06:19:57 -0800

Thanks. Probably should have said that the spreadsheets were originally created 
with the type of the data carefully defined to try to minimise 
mess (though I hadn't foreseen this little issue). Unfortunately that 
formatting has clearly been overridden in at least some of the data collection 
sites since then.


I have just rechecked and setting the format of the column from row 2 to the 
end as "text" in Excel does, as you'd hope, prevent Excel converting 
"9.13" being converted to numeric with the, I now realise, inevitable rounding 
consequence. Clearly the spreadsheets have had formatting overridden
at the data collection sites which created this problem. 

Maybe there's a way in Excel to allow end users to enter data but not change 
the formatting. Yes, for anyone following this for whom it might be useful,
you can do this and clearly I should have done: 
   https://www.extendoffice.com/documents/excel/1277-excel-protect-format.html

While I'm going a bit off piste or OT here, but following this line, does 
anyone recommend an open source online data entry and validation system that
can have a multidimensional database structure?  I'm trying to move things from 
Excel to LimeSurvey which isn't all that friendly but seems powerful, 
genuinely FLOSS and has proved very reliable, fast and stable for our purposes. 
 However, I don't think it can be adapted from flat form data to RMDBS
form data.  Any good options out there?

Thanks cpolwart and also TIA if anyone has suggestions for that tangential 
question (to me off list and I'll summarise if there's much and people 
asking for it).

Chris

----- Original Message -----
> From: cpolw...@chemo.org.uk
> To: "Chris Evans" <chrish...@psyctc.org>
> Cc: "R-help Mailing List" <r-help@r-project.org>
> Sent: Tuesday, 4 February, 2020 13:32:20
> Subject: Re: [R] read_xlsx(readxl) apparently mangling some data input

> This may not be possible or practical, bit can you tell excel it's a 
> character.
> The normal method of doing that is to add an ' at the front of the cell. So in
> Excel say column G has the "numbers" in column Z you create a new row with 
> ="'"
> & TEXT (G1)
> (Untested)

> Then use column Z in R.

> The other question might be simply how each cell is formatted. It you have
> people using any character to seperate numbers I bet some cells are formatted
> as text and some as numbers... Hence R may be respecting that.

> Other thoughts would be to clean it in excel (a horrific thought but something
> like

> = REPLACE ( TEXT (G1), SEARCH (".", TEXT (G1)), 1, ";" )

> Would replace first dot with ; meaning it won't look like a number. You'd 
> still
> have second dots but you can address that in R

> Paste values only and format as text (general) could help test that.

> The "rounding" issue is caused by R treating it as a floating point number as
> binary then converting it. You may already be aware of that. Once R thinks it
> is as.numeric(9.13) you can't stop that. So you need R never to think it is
> numeric.

> On 4 Feb 2020 12:07, Chris Evans <chrish...@psyctc.org> wrote:

>> This is a very odd error I'm hitting using read_xlsx from the readxl package
>> (version 1.3.1) with R version 3.6.2 (2019-12-12) , platform
>> x86_64-pc-linux-gnu (and updated Ubuntu 18.04). I have some largeish Excel
>> spreadsheets that contain clinical data. I can't share the entire raw data 
>> but
>> I think I can share the specific problem columns as Excel files, but not via
>> the list as I'm sure it rightly rejects such attachments.

>> The particular column contains entries like
>> 1
>> 1, 14

>> 1.14

>> That's to say it's a column that can have empty cells, or entries which 
>> should
>> be integers (a limited range of them) but cells may have multiple integers 
>> and
>> the data entry means that people use various separators, commas, full stops 
>> and
>> occasionally semi-colons or colons and all with or without various amounts of
>> space.

>> I thought this would be easy to handle but this illustrates the issue I'm
>> hitting:

>> > unique(read_xlsx("Book1.xlsx", col_types = "text"))
>> # A tibble: 18 x 1
>> NOWARN
>> <chr>
>> 1 NA
>> 2 14
>> 3 8,12,14
>> 4 13
>> 5 58
>> 6 9
>> 7 9.1300000000000008
>> 8 11
>> 9 11.14
>> 10 10
>> 11 10.14
>> 12 9.14
>> 13 13.14
>> 14 9 ,13
>> 15 9.11
>> 16 1
>> 17 1.1399999999999999
>> 18 1, 14

>> That's reading from a single column, 981 row (including column header) Excel
>> xlsx file in an up to date Windoze 10 Professional running in a VM on the
>> Ubuntu machine.

>> I created that file (which I can share) by copying the data from the full 
>> file
>> to a new Excel spreadsheet (M$ Orifice "Professional Plus 2019" "Version 
>> 1912"
>> "Build 12325.20344 Click-to-run" to an empty new Excel file and using the
>> default save_as. The clinical data files were created in, and updated in,
>> versions of Excel that I can't access but the file was certainly created 
>> first
>> between two years and three months before now so probably with different
>> versions of Excel and probably in a Spanish or Catalan M$ locale.

>> The weird thing is that looking at the Excel cells that created those
>> "9.1300000000000008" and "1.1399999999999999" entries they show "9.13" and
>> "1.14" (respectively!). They continue to show those values plus many trailing
>> zeroes if I use Excel formatting to ask for 20 decimal places (I get less of
>> course, but no arbitrary terminal rounding digit).

>> It appears to me that read_xlxs() is only applying the "col_types = "text""
>> argument _after_ reading the column freely, reading each cell guessing the 
>> type
>> by its contents and so ending up with numeric values for "9.13" and "1.14"
>> which are then picking up rounding errors and being forced to character after
>> that. I say that the reading would appear to be free across all cells in the
>> column as there are entries of "8, 12, 14" coming before these problem 
>> entries:

>> > tmp <- read_xlsx("Book1.xlsx", col_types = "text")
>> > grep("1.1399999999999999", tmp$NOWARN, fixed = TRUE)
>> [1] 932 948 954
>> > grep("9.1300000000000008", tmp$NOWARN, fixed = TRUE)
>> [1] 73 189 190 271 272 390 511 645 686 710 744 830 899
>> > tmp$NOWARN[20]
>> [1] "8,12,14"

>> This seems completely bizarre to me. I find it very hard to believe that
>> read_xlsx() would guess content class (type) freely by for each individual
>> entry and only apply the col_types argument after doing that as that would 
>> seem
>> likely to be incredibly inefficient for really big spreadsheets. It seems
>> equally hard to believe that it would then create rounding errors (for some
>> guessed numerics like 9.13 and 1.14 but not for others like 11.4). However, 
>> my
>> guess would appear to fit the results and I am only guessing because I'm sure
>> my programming comprehension isn't good enough to read into the sources to
>> actually work out how the function works.

>> To make things more interesting, and to suggest that at least some of the
>> problem is with Excel is that when I use LibreOffice (in Ubuntu) created a
>> Excel file in the same way, i.e. open the clinical Excel file but in
>> LibreOffice, copy and paste the same column into a new LibreOffice calc
>> spreadsheet and save as xlsx, tmp.xlsx, I get this:

>> > unique(read_xlsx("tmp.xlsx", col_types = "text"))
>> # A tibble: 18 x 1
>> NOWARN
>> <chr>
>> 1 NA
>> 2 14
>> 3 8,12,14
>> 4 13
>> 5 58
>> 6 9
>> 7 9.13
>> 8 11
>> 9 11.14
>> 10 10
>> 11 10.14
>> 12 9.14
>> 13 13.14
>> 14 9 ,13
>> 15 9.11
>> 16 1
>> 17 1.14
>> 18 1, 14

>> Exactly what I think I should be seeing. I was working in Rstudio but get
>> exactly the same in a new R terminal session with only readxl loaded so I 
>> don't
>> think this is any weird environment or other clash.

>> Obviously I can, though not terribly easily for a fully generic fix, catch 
>> these
>> weird rounding errors and correct them, I am sure can also report this as a
>> suspected bug to the maintainer through the github issues system but I wanted
>> to check here whether anyone could see something I'm missing as I'm really a
>> (clinically retired) therapist and doctor, now full time researcher and I'm 
>> not
>> a professional statistician or programmer.

>> TIA,

>> Chris

>> --
>> Chris Evans <ch...@psyctc.org> Visiting Professor, University of Sheffield
>> <chris.ev...@sheffield.ac.uk>
>> I do some consultation work for the University of Roehampton
>> <chris.ev...@roehampton.ac.uk> and other places
>> but <ch...@psyctc.org> remains my main Email address. I have a work web site 
>> at:
>> https://www.psyctc.org/psyctc/
>> and a site I manage for CORE and CORE system trust at:
>> http://www.coresystemtrust.org.uk/
>> I have "semigrated" to France, see:
>> https://www.psyctc.org/pelerinage2016/semigrating-to-france/
>> That page will also take you to my blog which started with earlier joys in
>> France and Spain!

>> If you want to book to talk, I am trying to keep that to Thursdays and my 
>> diary
>> is at:
>> https://www.psyctc.org/pelerinage2016/ceworkdiary/
>> Beware: French time, generally an hour ahead of UK.

>> ______________________________________________
>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.

-- 
Chris Evans <ch...@psyctc.org> Visiting Professor, University of Sheffield 
<chris.ev...@sheffield.ac.uk>
I do some consultation work for the University of Roehampton 
<chris.ev...@roehampton.ac.uk> and other places
but <ch...@psyctc.org> remains my main Email address.  I have a work web site 
at:
   https://www.psyctc.org/psyctc/
and a site I manage for CORE and CORE system trust at:
   http://www.coresystemtrust.org.uk/
I have "semigrated" to France, see: 
   https://www.psyctc.org/pelerinage2016/semigrating-to-france/ 
That page will also take you to my blog which started with earlier joys in 
France and Spain!

If you want to book to talk, I am trying to keep that to Thursdays and my diary 
is at:
   https://www.psyctc.org/pelerinage2016/ceworkdiary/
Beware: French time, generally an hour ahead of UK.

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] read_xlsx(readxl) apparently mangling some data input

Reply via email to