Re: [Rd] locales and readLines
I think you need to delimit a bit more what you want to do. It is difficult in general to tell what encoding a text file is in, and very much harder if this is a data file containing only a small proportion of non-ASCII text, which might not even be words in a human language (but abbreviations or acronyms). If you have experience with systems that do try to guess (e.g. Unix 'file') you will know that they are pretty fallible. There are Perl modules available, for example: I checked Encode::Guess which says · Because of the algorithm used, ISO-8859 series and other single- byte encodings do not work well unless either one of ISO-8859 is the only one suspect (besides ascii and utf8). · Do not mix national standard encodings and the corresponding vendor encodings. It is, after all, just a guess. You should alway be explicit when it comes to encodings. But there are some, especially Japanese, environ- ment that guess-coding is a must. Use this module with care. I think you may have missed that the main way to specify an encoding for a file is readLines(file("fn", encoding="latin2")) and not the encoding arg to readLines (although the help page is quite clear that the latter does not re-encode). The latter only allows UTF-8 and latin1. The author of a package that offers facilities to read non-ASCII text does need to offer the user a way to specify the encoding. I think suggesting that is 'an extra burden' is exceedingly negative: you could rather be thankful that R provides the facilities these days to do so. And if the package or its examples contains non-ASCII character strings, it is de rigeur for the author to consider how it might work on other people's systems. Notice that source() already has some of the 'smarts' you are asking about if 'file' is a file and not a connection, and you could provide a similar wrapper for readLines. That is useful either when the user can specify a small set of possible encodings or when such a set can be deduced from the locale. If the concern is that file might be UTF-8 or latin1, this is often a good guess (latin1 files can be valid UTF-8 but rarely are). However, if you have Russian text which might be in one of the several 8-bit encodings, the only way I know to decide which is to see if they make sense (and if they are acronyms, they may in all the possible encodings). BTW, to guess an encoding you really need to process all the input, so this is not appropriate for general connections, and for large files it might be better to do it external to R, e.g. via Perl etc. I would say minimal good practice would be to - allow the user to specify the encoding of text files. - ensure you have specified the encoding of all non-ASCII data in your package (which includes documentation, for example). I'd leave guessing to others: as http://www.cs.tut.fi/~jkorpela/chars.html says, It is hopefully obvious from the preceding discussion that a sequence of octets can be interpreted in a multitude of ways when processed as character data. By looking at the octet sequence only, you cannot even know whether each octet presents one character or just part of a two-octet presentation of a character, or something more complicated. Sometimes one can guess the encoding, but data processing and transfer shouldn't be guesswork. On Fri, 31 Aug 2007, Martin Morgan wrote: R-developers, I'm looking for some 'best practices', or perhaps an upstream solution (I have a deja vu about this, so sorry if it's already been asked). Problems occur when a file is encoded as latin1, but the user has a UTF-8 locale (or I guess more generally when the input locale does not match R's). Here are two examples from the Bioconductor help list: https://stat.ethz.ch/pipermail/bioconductor/2007-August/018947.html (the relevant command is library(GEOquery); gse <- getGEO('GSE94')) https://stat.ethz.ch/pipermail/bioconductor/2007-July/018204.html I think solutions are: * Specify the encoding in readLines. * Convert the input using iconv. * Tell the user to set their locale to match the input file (!) Unfortunately, these (1 & 2, anyway) place extra burden on the package author, to become educated about locales, the encoding conventions of the files they read, and to know how R deals with encodings. Are there other / better solutions? Any chance for some (additional) 'smarts' when reading files? Martin -- Brian D. Ripley, [EMAIL PROTECTED] Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UKFax: +44 1865 272595__ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] Typo in regex help page
Hi! I believe there is a typo in R/src/library/base/man/regex.Rd The 52nd line looks like: The metacharacters are in EREs are ... ^^^ Gregor __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] read.spss converts string variables with value labels to (PR#9896)
Full_Name: Jan Hucin Version: 2.5.1 (foreign 0.8-20) OS: WinXP Submission from: (NULL) (195.113.83.7) When reading an SPSS file: - containing some variable of type String - with value labels at that variable - and with determination which values of that variable are considered to be missing, I have always get where digits were in the original SPSS file. Example: Let's have in an SPSS file "some.sav" the variable A. The type of the variable is String of length 1. Let's have a value labeling: 1 = Yes, 2 = No, 8 = Invalid, 9 = Missing. Let's determine that value 9 is considered to be missing. When this file is read by abc=read.spss("some.sav",use.value.labels=TRUE), we get in abc$A on places where "1", "2" etc. were. Surprisingly, we get "N/A" (not !) on the place where the string "N/A" is. If we specify use.value.labels=FALSE, then we get string values (such as "1", "2") but we lose value labels (Yes, No etc.). Let me add that if the variable in the original SPSS file was of type Numeric (not String), there would be no problem. __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] When 1+2 != 3 (PR#9895)
Full_Name: Marco Vicentini, University of Verona Version: 2.4.1 & 2.5.1 OS: OsX & WinXP Submission from: (NULL) (157.27.253.46) When I proceed to test the following equation 1 + 2 == 3, I obviously obtain the value TRUE. But when I tryed to do the same using real number (i.e. 0.1 + 0.2 == 0.3) I obtained an unusual FALSE. In the online help there are some tricks for this problem. It suggests to use identical(...) which again answer FALSE. Only using isTRUE(all.equal(0.3, 0.1 + 0.2)) I can obtain the true value TRUE. But the problem does not concern only the operator ==. Many other functions, among over: sort, order, unique, duplicate, identical are not able to deal with this problem. This is very dangerous because no advice are provide by the online help, and anybody can use these functions no think to unusual results. I think that the problem is due to how double number are store by the C compiler. If it may be usefull, I have written to small function (Unique and isEqual) which can deal with this problem of the double numbers. I also add some other conditions for the same problem. 0.3 == 0.15 + 0.15 0.3 == 0.1 + 0.2 1 - 0.7 == 0.3 0.1 == 1 - 0.9 0.2 == 1 - 0.2 - 0.2 - 0.2 - 0.2 -0.2 == 1 - 0.2 - 0.2 - 0.2 - 0.2 - 0.2 - 0.2 identical (0.3, 0.1 + 0.2) all.equal (0.3, 0.1 + 0.2) identical (-0.2 , 1 - 0.2 - 0.2 - 0.2 - 0.2 - 0.2 - 0.2) all.equal (-0.2 , 1 - 0.2 - 0.2 - 0.2 - 0.2 - 0.2 - 0.2) isTRUE( all.equal (-0.2 , 1 - 0.2 - 0.2 - 0.2 - 0.2 - 0.2 - 0.2) ) -0.2 == 1 - 0.2 - 0.2 - 0.2 - 0.2 - 0.2 - 0.2 a= -0.2 b= 1 - 0.2 - 0.2 - 0.2 - 0.2 - 0.2 - 0.2 x<-c(a,b) sprintf("%.15f",x) sprintf("%.50f",x) Unique <- function(x, digits = 8, fast = TRUE) { if (fast) { unique (round(x * 10^digits)) / 10^digits } else { x = sort(x) for (i in 1:(length(x)-1)) if (isTRUE(all.equal(x[i],x[i+1]))) x[i] = NaN x [ which (!is.nan(x)) ] }} isEqual <- function (object, x, tol = 1e-9) { if (!is.vector(object)) stop("Object must be a vector") if (is.character(object)) stop("Object can not be a character") if (!is.real(x)) stop("x must be a real number") if (any(is.na(c(object,x stop("NA is not supported") if (length(x) != 1) stop("length x must equal to 1") ifelse (abs(object - x) < tol, TRUE,FALSE) # .Call("isEqual",as.real(object),as.real(x),as.real(tol), PACKAGE="mvUtils") } __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] When 1+2 != 3 (PR#9895)
On Mon, Sep 03, 2007 at 08:59:22AM +0200, [EMAIL PROTECTED] wrote: > Full_Name: Marco Vicentini, University of Verona > Version: 2.4.1 & 2.5.1 > OS: OsX & WinXP > Submission from: (NULL) (157.27.253.46) > > > When I proceed to test the following equation 1 + 2 == 3, I obviously obtain > the > value TRUE. But when I tryed to do the same using real number (i.e. 0.1 + 0.2 > == > 0.3) I obtained an unusual FALSE. > In the online help there are some tricks for this problem. It suggests to use > identical(...) which again answer FALSE. Only using isTRUE(all.equal(0.3, 0.1 > + > 0.2)) I can obtain the true value TRUE. A rational number has a finite binary expansion iff its denominator is a power of 2. Numbers 0.1 and 0.2 are 1/10 and 1/5, so they have 5 in their denominator. Their binary expansion is 0.1 = .0001100110011001100110011001100110... 0.2 = .0011001100110011001100110011001100... A double variable stores the numbers rounded to 53 significant binary digits. Hence, they are not exactly 0.1 and 0.2, as may be seen in formatC(0.1,digits=30) # [1] "0.15551115123126" formatC(0.2,digits=30) # [1] "0.200011102230246252" In order to compare numbers with some tolerance, the function all.equal may be used, which you also mention below. See its help page, which specifies the tolerance to be .Machine$double.eps ^ 0.5. > But the problem does not concern only the operator ==. Many other functions, > among over: sort, order, unique, duplicate, identical are not able to deal > with > this problem. This is very dangerous because no advice are provide by the > online > help, and anybody can use these functions no think to unusual results. > > I think that the problem is due to how double number are store by the C > compiler. Not C compiler, but the hardware. Petr Savicky. > If it may be usefull, I have written to small function (Unique and isEqual) > which can deal with this problem of the double numbers. > > I also add some other conditions for the same problem. > > 0.3 == 0.15 + 0.15 > 0.3 == 0.1 + 0.2 > 1 - 0.7 == 0.3 > 0.1 == 1 - 0.9 > > 0.2 == 1 - 0.2 - 0.2 - 0.2 - 0.2 >-0.2 == 1 - 0.2 - 0.2 - 0.2 - 0.2 - 0.2 - 0.2 > > identical (0.3, 0.1 + 0.2) > all.equal (0.3, 0.1 + 0.2) > > identical (-0.2 , 1 - 0.2 - 0.2 - 0.2 - 0.2 - 0.2 - 0.2) > all.equal (-0.2 , 1 - 0.2 - 0.2 - 0.2 - 0.2 - 0.2 - 0.2) > > isTRUE( all.equal (-0.2 , 1 - 0.2 - 0.2 - 0.2 - 0.2 - 0.2 - 0.2) ) > > >-0.2 == 1 - 0.2 - 0.2 - 0.2 - 0.2 - 0.2 - 0.2 > > a= -0.2 > b= 1 - 0.2 - 0.2 - 0.2 - 0.2 - 0.2 - 0.2 > > x<-c(a,b) > sprintf("%.15f",x) > sprintf("%.50f",x) > > > > Unique <- function(x, digits = 8, fast = TRUE) { > > if (fast) { > unique (round(x * 10^digits)) / 10^digits > } else { > x = sort(x) > for (i in 1:(length(x)-1)) > if (isTRUE(all.equal(x[i],x[i+1]))) x[i] = NaN > x [ which (!is.nan(x)) ] > }} > > isEqual <- function (object, x, tol = 1e-9) { > if (!is.vector(object)) stop("Object must be a vector") > if (is.character(object)) stop("Object can not be a character") > if (!is.real(x)) stop("x must be a real number") > if (any(is.na(c(object,x stop("NA is not supported") > if (length(x) != 1) stop("length x must equal to 1") > > ifelse (abs(object - x) < tol, TRUE,FALSE) > # .Call("isEqual",as.real(object),as.real(x),as.real(tol), > PACKAGE="mvUtils") > } > > __ > R-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel > __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] When 1+2 != 3 (PR#9895)
On 03/09/2007 2:59 AM, [EMAIL PROTECTED] wrote: > Full_Name: Marco Vicentini, University of Verona > Version: 2.4.1 & 2.5.1 > OS: OsX & WinXP > Submission from: (NULL) (157.27.253.46) > > > When I proceed to test the following equation 1 + 2 == 3, I obviously obtain > the > value TRUE. But when I tryed to do the same using real number (i.e. 0.1 + 0.2 > == > 0.3) I obtained an unusual FALSE. > In the online help there are some tricks for this problem. It suggests to use > identical(...) which again answer FALSE. Only using isTRUE(all.equal(0.3, 0.1 > + > 0.2)) I can obtain the true value TRUE. > > But the problem does not concern only the operator ==. Many other functions, > among over: sort, order, unique, duplicate, identical are not able to deal > with > this problem. This is very dangerous because no advice are provide by the > online > help, and anybody can use these functions no think to unusual results. The FAQ 7.31 gives general help on this. Repeating it in every instance where it affects computations wouldn't make sense. Please don't report unavoidable problems as bugs. Duncan Murdoch > > I think that the problem is due to how double number are store by the C > compiler. > > If it may be usefull, I have written to small function (Unique and isEqual) > which can deal with this problem of the double numbers. > > I also add some other conditions for the same problem. > > 0.3 == 0.15 + 0.15 > 0.3 == 0.1 + 0.2 > 1 - 0.7 == 0.3 > 0.1 == 1 - 0.9 > > 0.2 == 1 - 0.2 - 0.2 - 0.2 - 0.2 >-0.2 == 1 - 0.2 - 0.2 - 0.2 - 0.2 - 0.2 - 0.2 > > identical (0.3, 0.1 + 0.2) > all.equal (0.3, 0.1 + 0.2) > > identical (-0.2 , 1 - 0.2 - 0.2 - 0.2 - 0.2 - 0.2 - 0.2) > all.equal (-0.2 , 1 - 0.2 - 0.2 - 0.2 - 0.2 - 0.2 - 0.2) > > isTRUE( all.equal (-0.2 , 1 - 0.2 - 0.2 - 0.2 - 0.2 - 0.2 - 0.2) ) > > >-0.2 == 1 - 0.2 - 0.2 - 0.2 - 0.2 - 0.2 - 0.2 > > a= -0.2 > b= 1 - 0.2 - 0.2 - 0.2 - 0.2 - 0.2 - 0.2 > > x<-c(a,b) > sprintf("%.15f",x) > sprintf("%.50f",x) > > > > Unique <- function(x, digits = 8, fast = TRUE) { > > if (fast) { > unique (round(x * 10^digits)) / 10^digits > } else { > x = sort(x) > for (i in 1:(length(x)-1)) > if (isTRUE(all.equal(x[i],x[i+1]))) x[i] = NaN > x [ which (!is.nan(x)) ] > }} > > isEqual <- function (object, x, tol = 1e-9) { > if (!is.vector(object)) stop("Object must be a vector") > if (is.character(object)) stop("Object can not be a character") > if (!is.real(x)) stop("x must be a real number") > if (any(is.na(c(object,x stop("NA is not supported") > if (length(x) != 1) stop("length x must equal to 1") > > ifelse (abs(object - x) < tol, TRUE,FALSE) > # .Call("isEqual",as.real(object),as.real(x),as.real(tol), > PACKAGE="mvUtils") > } > > __ > R-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] buglet?? in nlme:::corRatio documentation
[hoping to redeem myself for my last spurious bug report] From ?corRatio: Letting d denote the range and n denote the nugget effect, the correlation between two observations a distance r apart is (r/d)^2/(1+(r/d)^2) when no nugget effect is present and (1-n)*(r/d)^2/(1+(r/d)^2) when a nugget effect is assumed. This disagrees with the C code (corStruct.c) /* Rational class */ static double ratio_corr(double val) { double val2 = val * val; return(1/(1+val2)); } and with common sense (correlation structures should start from 1 and reach zero for large distances; the structure listed in the documentation starts at 0 and goes to 1 [or (1-n)] for large distances) -- if you don't want to think about it, use R instead: curve(x^2/(1+x^2),from=0,to=5) curve(1/(1+x^2),add=TRUE,col=2,from=0) What's odd, and makes me really nervous, is that the expression found in the documentation is also that found in Pinheiro and Bates 2000 (Table 5.2, p. 232). It's not listed in the errata for the first printing http://cm.bell-labs.com/cm/ms/departments/sia/project/nlme/MEMSS/Errata ; I have the second printing. (I haven't dug out my geostats books to check this, but found at least one paper that cites the "correct" (1/(1+(d/r)^2) formula -- see below cheers Ben Bolker @ARTICLE{Ekstrom+2005, author = {Ekstr{\o}m, Claus T. and Bak, S{\o}ren and Rudemo, Mats}, title = {Pixel-level Signal Modelling with Spatial Correlation for Two-Colour Microarrays}, journal = {Statistical Applications in Genetics and Molecular Biology}, year = {2005}, volume = {4}, number = {1} timestamp = {2007.09.03}, url = {http://www.bepress.com/sagmb/vol4/iss1/art6} } __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] When 1+2 != 3 (PR#9895)
On 9/2/07, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote: > Full_Name: Marco Vicentini, University of Verona > Version: 2.4.1 & 2.5.1 > OS: OsX & WinXP > Submission from: (NULL) (157.27.253.46) > > > When I proceed to test the following equation 1 + 2 == 3, I obviously obtain > the > value TRUE. But when I tryed to do the same using real number (i.e. 0.1 + 0.2 > == > 0.3) I obtained an unusual FALSE. > In the online help there are some tricks for this problem. It suggests to use > identical(...) which again answer FALSE. Only using isTRUE(all.equal(0.3, 0.1 > + > 0.2)) I can obtain the true value TRUE. > > But the problem does not concern only the operator ==. Many other functions, > among over: sort, order, unique, duplicate, identical are not able to deal > with > this problem. This is very dangerous because no advice are provide by the > online > help, and anybody can use these functions no think to unusual results. > > I think that the problem is due to how double number are store by the C > compiler. > > If it may be usefull, I have written to small function (Unique and isEqual) > which can deal with this problem of the double numbers. Quiz: What about utility functions equalsE() and equalsPi()? ...together with examples illustrating when they return TRUE and when they return FALSE. Cheers /Henrik > > I also add some other conditions for the same problem. > > 0.3 == 0.15 + 0.15 > 0.3 == 0.1 + 0.2 > 1 - 0.7 == 0.3 > 0.1 == 1 - 0.9 > > 0.2 == 1 - 0.2 - 0.2 - 0.2 - 0.2 >-0.2 == 1 - 0.2 - 0.2 - 0.2 - 0.2 - 0.2 - 0.2 > > identical (0.3, 0.1 + 0.2) > all.equal (0.3, 0.1 + 0.2) > > identical (-0.2 , 1 - 0.2 - 0.2 - 0.2 - 0.2 - 0.2 - 0.2) > all.equal (-0.2 , 1 - 0.2 - 0.2 - 0.2 - 0.2 - 0.2 - 0.2) > > isTRUE( all.equal (-0.2 , 1 - 0.2 - 0.2 - 0.2 - 0.2 - 0.2 - 0.2) ) > > >-0.2 == 1 - 0.2 - 0.2 - 0.2 - 0.2 - 0.2 - 0.2 > > a= -0.2 > b= 1 - 0.2 - 0.2 - 0.2 - 0.2 - 0.2 - 0.2 > > x<-c(a,b) > sprintf("%.15f",x) > sprintf("%.50f",x) > > > > Unique <- function(x, digits = 8, fast = TRUE) { > > if (fast) { > unique (round(x * 10^digits)) / 10^digits > } else { > x = sort(x) > for (i in 1:(length(x)-1)) > if (isTRUE(all.equal(x[i],x[i+1]))) x[i] = NaN > x [ which (!is.nan(x)) ] > }} > > isEqual <- function (object, x, tol = 1e-9) { > if (!is.vector(object)) stop("Object must be a vector") > if (is.character(object)) stop("Object can not be a character") > if (!is.real(x)) stop("x must be a real number") > if (any(is.na(c(object,x stop("NA is not supported") > if (length(x) != 1) stop("length x must equal to 1") > > ifelse (abs(object - x) < tol, TRUE,FALSE) > # .Call("isEqual",as.real(object),as.real(x),as.real(tol), > PACKAGE="mvUtils") > } > > __ > R-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel > __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] When 1+2 != 3 (PR#9895)
On 03-Sep-07 15:12:06, Henrik Bengtsson wrote: > On 9/2/07, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote: >> [...] >> If it may be usefull, I have written to small function >> (Unique and isEqual) >> which can deal with this problem of the double numbers. > > Quiz: What about utility functions equalsE() and equalsPi()? > ...together with examples illustrating when they return TRUE and when > they return FALSE. > > Cheers > > /Henrik Well, if you guys want a Quiz: ... My favourite example of something which will probably never work on R (or any machine which implements fixed-length binary real arithmetic). An interated function scheme on [0,1] is defined by if 0 <= x <= 0.5 then next x = 2*x if 0.5 < x <= 1 then next x = 2*(1 - x) in R: nextX <- function(x){ifelse(x<=0.5, 2*x, 2*(1-x))} and try, e.g., x<-3/7; for(i in (1:60)){x<-nextX(x); print(c(i,x))} x = 0 is an absorbing state. x = 1 -> x = 0 x = 1/2 -> 1 -> 0 ... (these work in R) If K is an odd integer, and 0 < r < K, then x = r/K -> ... leads into a periodic set. E.g. (see above) 3/7 -> 6/7 -> 2/7 -> 4/7 -> 2/7 All other numbers x outside these sets generate non-periodic sequences. Apart from the case where initial x = 1/2^k, none of the above is true in R (e.g. the example above). So can you devise an "isEqual" function which will make this work? It's only Monday .. plenty of time! Best wishes, Ted. E-Mail: (Ted Harding) <[EMAIL PROTECTED]> Fax-to-email: +44 (0)870 094 0861 Date: 03-Sep-07 Time: 17:32:38 -- XFMail -- E-Mail: (Ted Harding) <[EMAIL PROTECTED]> Fax-to-email: +44 (0)870 094 0861 Date: 03-Sep-07 Time: 18:50:23 -- XFMail -- __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] locales and readLines
Thank you very much for explaining this. I had indeed overlooked the use of encoding in 'file'. I also appreciate how unsatisfactory guessing at the encoding can be, and that scanning the entire file is not appropriate for large files or general connections. Sorry that 'burden' came across as negative, more along the lines of 'burden of responsibility for handling the inputs the package developer implies they'll handle'. Much better than the burden of saying 'sorry, no can do'. Thanks again, Martin Prof Brian Ripley <[EMAIL PROTECTED]> writes: > I think you need to delimit a bit more what you want to do. It is > difficult in general to tell what encoding a text file is in, and very > much harder if this is a data file containing only a small proportion > of non-ASCII text, which might not even be words in a human language > (but abbreviations or acronyms). > > If you have experience with systems that do try to guess (e.g. Unix > 'file') you will know that they are pretty fallible. There are Perl > modules available, for example: I checked Encode::Guess which says > > · Because of the algorithm used, ISO-8859 series and other single- > byte encodings do not work well unless either one of ISO-8859 is > the only one suspect (besides ascii and utf8). > > · Do not mix national standard encodings and the corresponding vendor > encodings. > > It is, after all, just a guess. You should alway be explicit when it > comes to encodings. But there are some, especially Japanese, environ- > ment that guess-coding is a must. Use this module with care. > > > I think you may have missed that the main way to specify an encoding > for a file is > > readLines(file("fn", encoding="latin2")) > > and not the encoding arg to readLines (although the help page is quite > clear that the latter does not re-encode). The latter only allows > UTF-8 > and latin1. > > The author of a package that offers facilities to read non-ASCII text > does need to offer the user a way to specify the encoding. I think > suggesting that is 'an extra burden' is exceedingly negative: you > could rather be thankful that R provides the facilities these days to > do so. And if the package or its examples contains non-ASCII > character strings, it is de rigeur for the author to consider how it > might work on other people's systems. > > Notice that source() already has some of the 'smarts' you are asking > about if 'file' is a file and not a connection, and you could provide > a similar wrapper for readLines. That is useful either when the user > can specify a small set of possible encodings or when such a set can > be deduced from the locale. If the concern is that file might be > UTF-8 or latin1, this is often a good guess (latin1 files can be valid > UTF-8 but rarely are). However, if you have Russian text which might > be in one of the several 8-bit encodings, the only way I know to > decide which is to see if they make sense (and if they are acronyms, > they may in all the possible encodings). > > BTW, to guess an encoding you really need to process all the input, so > this is not appropriate for general connections, and for large files > it might be better to do it external to R, e.g. via Perl etc. > > I would say minimal good practice would be to > > - allow the user to specify the encoding of text files. > - ensure you have specified the encoding of all non-ASCII data in your >package (which includes documentation, for example). > > I'd leave guessing to others: as > http://www.cs.tut.fi/~jkorpela/chars.html says, > >It is hopefully obvious from the preceding discussion that a sequence of >octets can be interpreted in a multitude of ways when processed as >character data. By looking at the octet sequence only, you cannot even >know whether each octet presents one character or just part of a >two-octet presentation of a character, or something more complicated. >Sometimes one can guess the encoding, but data processing and transfer >shouldn't be guesswork. > > > > On Fri, 31 Aug 2007, Martin Morgan wrote: > >> R-developers, >> >> I'm looking for some 'best practices', or perhaps an upstream solution >> (I have a deja vu about this, so sorry if it's already been asked). >> Problems occur when a file is encoded as latin1, but the user has a >> UTF-8 locale (or I guess more generally when the input locale does not >> match R's). Here are two examples from the Bioconductor help list: >> >> https://stat.ethz.ch/pipermail/bioconductor/2007-August/018947.html >> >> (the relevant command is library(GEOquery); gse <- getGEO('GSE94')) >> >> https://stat.ethz.ch/pipermail/bioconductor/2007-July/018204.html >> >> I think solutions are: >> >> * Specify the encoding in readLines. >> >> * Convert the input using iconv. >> >> * Tell the user to set their locale to match the input file (!) >> >> Unfortunately, these (1 & 2, anyway) place extra burden on the package >> author,
Re: [Rd] When 1+2 != 3 (PR#9895)
Not sure if this counts but using the Ryacas package > library(Ryacas) > x <- Sym("x") > Set(x, Sym(3)/7) expression(3/7) > cat(i, "0: "); print(x) 10 0: expression(3/7) > for(i in 1:10) { + yacas("Set(x, If(x <= 1/2, 2*x, 2*(1-x)))") + cat(i, "i: "); print(x) + } 1 i: expression(6/7) 2 i: expression(2/7) 3 i: expression(4/7) 4 i: expression(6/7) 5 i: expression(2/7) 6 i: expression(4/7) 7 i: expression(6/7) 8 i: expression(2/7) 9 i: expression(4/7) 10 i: expression(6/7) On 9/3/07, Ted Harding <[EMAIL PROTECTED]> wrote: > On 03-Sep-07 15:12:06, Henrik Bengtsson wrote: > > On 9/2/07, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote: > >> [...] > >> If it may be usefull, I have written to small function > >> (Unique and isEqual) > >> which can deal with this problem of the double numbers. > > > > Quiz: What about utility functions equalsE() and equalsPi()? > > ...together with examples illustrating when they return TRUE and when > > they return FALSE. > > > > Cheers > > > > /Henrik > > Well, if you guys want a Quiz: ... My favourite example > of something which will probably never work on R (or any > machine which implements fixed-length binary real arithmetic). > > An interated function scheme on [0,1] is defined by > > if 0 <= x <= 0.5 then next x = 2*x > > if 0.5 < x <= 1 then next x = 2*(1 - x) > > in R: > > nextX <- function(x){ifelse(x<=0.5, 2*x, 2*(1-x))} > > and try, e.g., > > x<-3/7; for(i in (1:60)){x<-nextX(x); print(c(i,x))} > > x = 0 is an absorbing state. > x = 1 -> x = 0 > x = 1/2 -> 1 -> 0 > ... > (these work in R) > > If K is an odd integer, and 0 < r < K, then > > x = r/K -> ... leads into a periodic set. > > E.g. (see above) 3/7 -> 6/7 -> 2/7 -> 4/7 -> 2/7 > > All other numbers x outside these sets generate non-periodic > sequences. > > Apart from the case where initial x = 1/2^k, none of the > above is true in R (e.g. the example above). > > So can you devise an "isEqual" function which will make this > work? > > It's only Monday .. plenty of time! > Best wishes, > Ted. > > > E-Mail: (Ted Harding) <[EMAIL PROTECTED]> > Fax-to-email: +44 (0)870 094 0861 > Date: 03-Sep-07 Time: 17:32:38 > -- XFMail -- > > > E-Mail: (Ted Harding) <[EMAIL PROTECTED]> > Fax-to-email: +44 (0)870 094 0861 > Date: 03-Sep-07 Time: 18:50:23 > -- XFMail -- > > __ > R-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel > __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] When 1+2 != 3 (PR#9895)
On 03-Sep-07 19:25:58, Gabor Grothendieck wrote: > Not sure if this counts but using the Ryacas package Gabor, I'm afraid it doesn't count! (Though I didn't exclude it explicitly). I'm not interested in the behaviour of the sequence with denominator = 7 particularly. The system is in fact an example of simulating chaotic systems on a computer. For instance, one of the classic illustrations is next x = 2*x*(1-x) for any real x. The question is, how does a finite-length binary representation behave? Petr Savicky [privately] sent me a similar example: Starting with r/K: nextr <- function(r){ifelse(r<=K/2, 2*r, 2*(K-r))} "For K = 7 and r = 3, this yields r = 3, 6, 2, 4, 6, ... Dividing this by K=7, one gets the correct period with approximately correct numbers." Best wishes, Ted. E-Mail: (Ted Harding) <[EMAIL PROTECTED]> Fax-to-email: +44 (0)870 094 0861 Date: 03-Sep-07 Time: 21:02:27 -- XFMail -- __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Consistency of serialize(): please enlighten me
I have a couple of ideas - serialize() can store references (and some simple assignment are just stored as references until one tries to modify part of the copy, i.e. in a copy-on-write manner); ocassionally, it will also store the package name as an attribute to the class name in which the class was defined. Maybe neither of this is the case, but what does a hexdump tell you? (just printing the result of rawToChar() to the console). Henrik Bengtsson wrote: > Forgot... > > On 8/31/07, Henrik Bengtsson <[EMAIL PROTECTED]> wrote: >> Hi, >> >> I am puzzled with serialize(). It comes down generating identical >> hash codes for (apparently) identical objects using digest::digest(), >> which in turn relies on serialize(). Here is an example illustration >> the issue: >> >> ser <- function(object, ...) { >> list( >> names = names(object), >> namesRaw = charToRaw(names(object)), >> ser = serialize(names(object), connection=NULL, ascii=FALSE) >> ) >> } # ser() >> >> # Object to be serialized >> key <- key0 <- list(abc="Hello"); >> >> # Store results >> d <- list(); >> >> # 1. As is >> d[[1]] <- ser(key); >> >> # 2. Set names and redo (hardwired: identical to what's already there) >> names(key) <- "abc"; >> d[[2]] <- ser(key); >> >> # 3. Set names and redo (generic: char->raw->char) >> key <- key0; >> names(key) <- sapply(names(key), FUN=function(name) >> rawToChar(charToRaw(name))); >> d[[3]] <- ser(key); >> >> # All names are identical >> for (kk in 2:length(d)) >> stopifnot(identical(d[[1]]$names, d[[kk]]$names)); >> >> # All raw names are identical >> for (kk in 2:length(d)) >> stopifnot(identical(d[[1]]$namesRaw, d[[kk]]$namesRaw)); >> >> # But, the serialized names differ. >> print(identical(d[[1]]$ser, d[[2]]$ser)); >> print(identical(d[[1]]$ser, d[[3]]$ser)); >> print(identical(d[[2]]$ser, d[[3]]$ser)); > > With R version 2.6.0 Under development (unstable) (2007-08-23 r42614) I get: > [1] TRUE > [1] FALSE > [1] FALSE > > and with R version 2.5.1 Patched (2007-07-19 r42284): > [1] FALSE > [1] FALSE > [1] TRUE > >> So, it seems like there is some extra information in the names >> attribute that is part of the serialization. Is it possible to show >> they differ at the R level? What is that extra information? >> Promises...? >> >> Please enlighten me. >> >> Henrik >> > > __ > R-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel