This list can be priceless (and has taught me so much over, hm, over a decade certainly now!)
Thanks both: makes perfect sense (of course) and shows my naivety in the way I was thinking about this. I'm intrigued that it's LibreOffice actually using lower precision that avoids the issue that was puzzling me, again, makes perfect sense. Further, Peter's ix <- !(is.na(xn)) & xn%%1 != 0 is delicious and exactly the sort of thing I don't see unaided. I know I would have done something horribly more clumsy. It's also the sort of little revelation about the potential power of %% that I think I _will_ remember and no doubt find myself using again in the future. Thanks both, huge help to me and, as I suspected, a wasteful github issue report prevented! Chris ----- Original Message ----- > From: "PIKAL Petr" <petr.pi...@precheza.cz> > To: "Chris Evans" <chrish...@psyctc.org>, "R-help Mailing List" > <r-help@r-project.org> > Sent: Tuesday, 4 February, 2020 13:39:31 > Subject: RE: read_xlsx(readxl) apparently mangling some data input > Hi > > Floating point representation > > I prepared excel file with arbitrary first row and second row > > 45.65 and 45.65/5 > > The division result should be 9.13 (exactly), but based on floation point > representation in binary computers (FAQ 7.31) it results in 9.129999999... > However Excel shows exact value (9.13) although internally it stores this > 9.129999.. Probably they do not want to disturb its audience. > > Therefore read_xlsx reads it correctly > >> temp <- read_xlsx(file.choose()) >> temp > # A tibble: 2 x 2 > a1 a2 > <dbl> <chr> > 1 12 8,8,10 > 2 45.6 9.129999999999999 >> as.data.frame(temp) > a1 a2 > 1 12.00 8,8,10 > 2 45.65 9.129999999999999 > > Cheers > Petr > >> -----Original Message----- >> From: R-help <r-help-boun...@r-project.org> On Behalf Of Chris Evans >> Sent: Tuesday, February 4, 2020 1:07 PM >> To: R-help Mailing List <r-help@r-project.org> >> Subject: [R] read_xlsx(readxl) apparently mangling some data input >> >> This is a very odd error I'm hitting using read_xlsx from the readxl > package >> (version 1.3.1) with R version 3.6.2 (2019-12-12) , platform > x86_64-pc-linux- >> gnu (and updated Ubuntu 18.04). I have some largeish Excel spreadsheets >> that contain clinical data. I can't share the entire raw data but I think > I can >> share the specific problem columns as Excel files, but not via the list as > I'm >> sure it rightly rejects such attachments. >> >> The particular column contains entries like >> 1 >> 1, 14 >> >> 1.14 >> >> That's to say it's a column that can have empty cells, or entries which > should >> be integers (a limited range of them) but cells may have multiple integers >> and the data entry means that people use various separators, commas, full >> stops and occasionally semi-colons or colons and all with or without > various >> amounts of space. >> >> I thought this would be easy to handle but this illustrates the issue I'm >> hitting: >> >> > unique(read_xlsx("Book1.xlsx", col_types = "text")) >> # A tibble: 18 x 1 >> NOWARN >> <chr> >> 1 NA >> 2 14 >> 3 8,12,14 >> 4 13 >> 5 58 >> 6 9 >> 7 9.1300000000000008 >> 8 11 >> 9 11.14 >> 10 10 >> 11 10.14 >> 12 9.14 >> 13 13.14 >> 14 9 ,13 >> 15 9.11 >> 16 1 >> 17 1.1399999999999999 >> 18 1, 14 >> >> That's reading from a single column, 981 row (including column header) >> Excel xlsx file in an up to date Windoze 10 Professional running in a VM > on >> the Ubuntu machine. >> >> I created that file (which I can share) by copying the data from the full > file to >> a new Excel spreadsheet (M$ Orifice "Professional Plus 2019" "Version > 1912" >> "Build 12325.20344 Click-to-run" to an empty new Excel file and using the >> default save_as. The clinical data files were created in, and updated in, >> versions of Excel that I can't access but the file was certainly created > first >> between two years and three months before now so probably with different >> versions of Excel and probably in a Spanish or Catalan M$ locale. >> >> The weird thing is that looking at the Excel cells that created those >> "9.1300000000000008" and "1.1399999999999999" entries they show "9.13" >> and "1.14" (respectively!). They continue to show those values plus many >> trailing zeroes if I use Excel formatting to ask for 20 decimal places (I > get less >> of course, but no arbitrary terminal rounding digit). >> >> It appears to me that read_xlxs() is only applying the "col_types = > "text"" >> argument _after_ reading the column freely, reading each cell guessing the >> type by its contents and so ending up with numeric values for "9.13" and >> "1.14" which are then picking up rounding errors and being forced to >> character after that. I say that the reading would appear to be free > across all >> cells in the column as there are entries of "8, 12, 14" coming before > these >> problem entries: >> >> > tmp <- read_xlsx("Book1.xlsx", col_types = "text") >> > grep("1.1399999999999999", tmp$NOWARN, fixed = TRUE) >> [1] 932 948 954 >> > grep("9.1300000000000008", tmp$NOWARN, fixed = TRUE) >> [1] 73 189 190 271 272 390 511 645 686 710 744 830 899 >> > tmp$NOWARN[20] >> [1] "8,12,14" >> >> This seems completely bizarre to me. I find it very hard to believe that >> read_xlsx() would guess content class (type) freely by for each individual >> entry and only apply the col_types argument after doing that as that would >> seem likely to be incredibly inefficient for really big spreadsheets. It > seems >> equally hard to believe that it would then create rounding errors (for > some >> guessed numerics like 9.13 and 1.14 but not for others like 11.4). > However, >> my guess would appear to fit the results and I am only guessing because > I'm >> sure my programming comprehension isn't good enough to read into the >> sources to actually work out how the function works. >> >> To make things more interesting, and to suggest that at least some of the >> problem is with Excel is that when I use LibreOffice (in Ubuntu) created a >> Excel file in the same way, i.e. open the clinical Excel file but in > LibreOffice, >> copy and paste the same column into a new LibreOffice calc spreadsheet >> and save as xlsx, tmp.xlsx, I get this: >> >> > unique(read_xlsx("tmp.xlsx", col_types = "text")) >> # A tibble: 18 x 1 >> NOWARN >> <chr> >> 1 NA >> 2 14 >> 3 8,12,14 >> 4 13 >> 5 58 >> 6 9 >> 7 9.13 >> 8 11 >> 9 11.14 >> 10 10 >> 11 10.14 >> 12 9.14 >> 13 13.14 >> 14 9 ,13 >> 15 9.11 >> 16 1 >> 17 1.14 >> 18 1, 14 >> >> Exactly what I think I should be seeing. I was working in Rstudio but get >> exactly the same in a new R terminal session with only readxl loaded so I >> don't think this is any weird environment or other clash. >> >> Obviously I can, though not terribly easily for a fully generic fix, catch > these >> weird rounding errors and correct them, I am sure can also report this as > a >> suspected bug to the maintainer through the github issues system but I >> wanted to check here whether anyone could see something I'm missing as >> I'm really a (clinically retired) therapist and doctor, now full time > researcher >> and I'm not a professional statistician or programmer. >> >> TIA, >> >> Chris >> >> >> >> -- >> Chris Evans <ch...@psyctc.org> Visiting Professor, University of Sheffield >> <chris.ev...@sheffield.ac.uk> I do some consultation work for the >> University of Roehampton <chris.ev...@roehampton.ac.uk> and other >> places but <ch...@psyctc.org> remains my main Email address. I have a >> work web site at: >> https://www.psyctc.org/psyctc/ >> and a site I manage for CORE and CORE system trust at: >> http://www.coresystemtrust.org.uk/ >> I have "semigrated" to France, see: >> https://www.psyctc.org/pelerinage2016/semigrating-to-france/ >> That page will also take you to my blog which started with earlier joys in >> France and Spain! >> >> If you want to book to talk, I am trying to keep that to Thursdays and my >> diary is at: >> https://www.psyctc.org/pelerinage2016/ceworkdiary/ >> Beware: French time, generally an hour ahead of UK. >> >> ______________________________________________ >> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide http://www.R-project.org/posting- >> guide.html > > and provide commented, minimal, self-contained, reproducible code. -- Chris Evans <ch...@psyctc.org> Visiting Professor, University of Sheffield <chris.ev...@sheffield.ac.uk> I do some consultation work for the University of Roehampton <chris.ev...@roehampton.ac.uk> and other places but <ch...@psyctc.org> remains my main Email address. I have a work web site at: https://www.psyctc.org/psyctc/ and a site I manage for CORE and CORE system trust at: http://www.coresystemtrust.org.uk/ I have "semigrated" to France, see: https://www.psyctc.org/pelerinage2016/semigrating-to-france/ That page will also take you to my blog which started with earlier joys in France and Spain! If you want to book to talk, I am trying to keep that to Thursdays and my diary is at: https://www.psyctc.org/pelerinage2016/ceworkdiary/ Beware: French time, generally an hour ahead of UK. ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.