date:20070903

Re: [Rd] locales and readLines

2007-09-03 Thread Prof Brian Ripley

I think you need to delimit a bit more what you want to do.  It is 
difficult in general to tell what encoding a text file is in, and very 
much harder if this is a data file containing only a small proportion of 
non-ASCII text, which might not even be words in a human language (but 
abbreviations or acronyms).


If you have experience with systems that do try to guess (e.g. Unix 
'file') you will know that they are pretty fallible.  There are Perl 
modules available, for example: I checked Encode::Guess which says


   ·   Because of the algorithm used, ISO-8859 series and other single-
   byte encodings do not work well unless either one of ISO-8859 is
   the only one suspect (besides ascii and utf8).

   ·   Do not mix national standard encodings and the corresponding vendor
   encodings.

   It is, after all, just a guess.  You should alway be explicit when it
   comes to encodings.  But there are some, especially Japanese, environ-
   ment that guess-coding is a must.  Use this module with care.


I think you may have missed that the main way to specify an encoding for a 
file is


readLines(file("fn", encoding="latin2"))

and not the encoding arg to readLines (although the help page is quite 
clear that the latter does not re-encode).  The latter only allows UTF-8 
and latin1.


The author of a package that offers facilities to read non-ASCII text does 
need to offer the user a way to specify the encoding.  I think suggesting 
that is 'an extra burden' is exceedingly negative: you could rather be 
thankful that R provides the facilities these days to do so.  And if the 
package or its examples contains non-ASCII character strings, it is de 
rigeur for the author to consider how it might work on other people's 
systems.


Notice that source() already has some of the 'smarts' you are asking about 
if 'file' is a file and not a connection, and you could provide a similar 
wrapper for readLines.  That is useful either when the user can specify a 
small set of possible encodings or when such a set can be deduced from the 
locale.  If the concern is that file might be UTF-8 or latin1, this is 
often a good guess (latin1 files can be valid UTF-8 but rarely are). 
However, if you have Russian text which might be in one of the several 
8-bit encodings, the only way I know to decide which is to see if they 
make sense (and if they are acronyms, they may in all the possible 
encodings).


BTW, to guess an encoding you really need to process all the input, so 
this is not appropriate for general connections, and for large files it 
might be better to do it external to R, e.g. via Perl etc.


I would say minimal good practice would be to

- allow the user to specify the encoding of text files.
- ensure you have specified the encoding of all non-ASCII data in your
  package (which includes documentation, for example).

I'd leave guessing to others: as
http://www.cs.tut.fi/~jkorpela/chars.html says,

  It is hopefully obvious from the preceding discussion that a sequence of
  octets can be interpreted in a multitude of ways when processed as
  character data. By looking at the octet sequence only, you cannot even
  know whether each octet presents one character or just part of a
  two-octet presentation of a character, or something more complicated.
  Sometimes one can guess the encoding, but data processing and transfer
  shouldn't be guesswork.



On Fri, 31 Aug 2007, Martin Morgan wrote:


R-developers,

I'm looking for some 'best practices', or perhaps an upstream solution
(I have a deja vu about this, so sorry if it's already been asked).
Problems occur when a file is encoded as latin1, but the user has a
UTF-8 locale (or I guess more generally when the input locale does not
match R's).  Here are two examples from the Bioconductor help list:

https://stat.ethz.ch/pipermail/bioconductor/2007-August/018947.html

(the relevant command is library(GEOquery); gse <- getGEO('GSE94'))

https://stat.ethz.ch/pipermail/bioconductor/2007-July/018204.html

I think solutions are:

* Specify the encoding in readLines.

* Convert the input using iconv.

* Tell the user to set their locale to match the input file (!)

Unfortunately, these (1 & 2, anyway) place extra burden on the package
author, to become educated about locales, the encoding conventions of
the files they read, and to know how R deals with encodings.

Are there other / better solutions? Any chance for some (additional)
'smarts' when reading files?

Martin



--
Brian D. Ripley,  [EMAIL PROTECTED]
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel:  +44 1865 272861 (self)
1 South Parks Road, +44 1865 272866 (PA)
Oxford OX1 3TG, UKFax:  +44 1865 272595__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

[Rd] Typo in regex help page

2007-09-03 Thread Gregor Gorjanc

Hi!

I believe there is a typo in

R/src/library/base/man/regex.Rd

The 52nd line looks like:

The metacharacters are in EREs are ...
 ^^^

Gregor

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

[Rd] read.spss converts string variables with value labels to (PR#9896)

2007-09-03 Thread honza

Full_Name: Jan Hucin
Version: 2.5.1 (foreign 0.8-20)
OS: WinXP
Submission from: (NULL) (195.113.83.7)


When reading an SPSS file:

- containing some variable of type String
- with value labels at that variable
- and with determination which values of that variable are considered to be
missing,

I have always get  where digits were in the original SPSS file.

Example:
Let's have in an SPSS file "some.sav" the variable A. The type of the variable
is String of length 1.
Let's have a value labeling: 1 = Yes, 2 = No, 8 = Invalid, 9 = Missing.
Let's determine that value 9 is considered to be missing.
When this file is read by abc=read.spss("some.sav",use.value.labels=TRUE), we
get  in abc$A on places where "1", "2" etc. were. Surprisingly, we get "N/A"
(not !) on the place where the string "N/A" is.

If we specify use.value.labels=FALSE, then we get string values (such as "1",
"2") but we lose value labels (Yes, No etc.).

Let me add that if the variable in the original SPSS file was of type Numeric
(not String), there would be no problem.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

[Rd] When 1+2 != 3 (PR#9895)

2007-09-03 Thread marco . vicentini

Full_Name: Marco Vicentini, University of Verona
Version: 2.4.1 & 2.5.1
OS: OsX & WinXP
Submission from: (NULL) (157.27.253.46)


When I proceed to test the following equation 1 + 2 == 3, I obviously obtain the
value TRUE. But when I tryed to do the same using real number (i.e. 0.1 + 0.2 ==
0.3)  I obtained an unusual FALSE.
In the online help there are some tricks for this problem. It suggests to use
identical(...) which again answer FALSE. Only using isTRUE(all.equal(0.3, 0.1 +
0.2)) I can obtain the true value TRUE.

But the problem does not concern only the operator ==. Many other functions,
among over:  sort, order, unique, duplicate, identical are not able to deal with
this problem. This is very dangerous because no advice are provide by the online
help, and anybody can use these functions no think to unusual results.

I think that the problem is due to how double number are store by the C
compiler.

If it may be usefull, I have written to small function (Unique and isEqual)
which can deal with this problem of the double numbers.

I also add some other conditions for the same problem.

0.3 == 0.15 + 0.15
0.3 == 0.1 + 0.2
1 - 0.7 == 0.3
0.1 == 1 - 0.9

0.2 == 1 - 0.2 - 0.2 - 0.2 - 0.2
   -0.2 == 1 - 0.2 - 0.2 - 0.2 - 0.2 - 0.2 - 0.2

identical (0.3, 0.1 + 0.2)
all.equal (0.3, 0.1 + 0.2)

identical (-0.2 , 1 - 0.2 - 0.2 - 0.2 - 0.2 - 0.2 - 0.2)
all.equal (-0.2 , 1 - 0.2 - 0.2 - 0.2 - 0.2 - 0.2 - 0.2)

isTRUE( all.equal (-0.2 , 1 - 0.2 - 0.2 - 0.2 - 0.2 - 0.2 - 0.2) )


   -0.2 == 1 - 0.2 - 0.2 - 0.2 - 0.2 - 0.2 - 0.2

a= -0.2 
b= 1 - 0.2 - 0.2 - 0.2 - 0.2 - 0.2 - 0.2

x<-c(a,b)
sprintf("%.15f",x)
sprintf("%.50f",x)



Unique <- function(x, digits = 8, fast = TRUE) {

if (fast) {
unique (round(x * 10^digits)) / 10^digits   
} else {
x = sort(x)
for (i in 1:(length(x)-1))
if (isTRUE(all.equal(x[i],x[i+1]))) x[i] = NaN
x [ which (!is.nan(x)) ]
}}

isEqual <- function (object, x, tol = 1e-9) {
if (!is.vector(object)) stop("Object must be a vector")
if (is.character(object)) stop("Object can not be a character")
if (!is.real(x)) stop("x must be a real number")
if (any(is.na(c(object,x stop("NA is not supported")
if (length(x) != 1) stop("length x must equal to 1")

ifelse (abs(object - x) < tol, TRUE,FALSE) 
#   .Call("isEqual",as.real(object),as.real(x),as.real(tol), 
PACKAGE="mvUtils")
}

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] When 1+2 != 3 (PR#9895)

2007-09-03 Thread Petr Savicky

On Mon, Sep 03, 2007 at 08:59:22AM +0200, [EMAIL PROTECTED] wrote:
> Full_Name: Marco Vicentini, University of Verona
> Version: 2.4.1 & 2.5.1
> OS: OsX & WinXP
> Submission from: (NULL) (157.27.253.46)
> 
> 
> When I proceed to test the following equation 1 + 2 == 3, I obviously obtain 
> the
> value TRUE. But when I tryed to do the same using real number (i.e. 0.1 + 0.2 
> ==
> 0.3)  I obtained an unusual FALSE.
> In the online help there are some tricks for this problem. It suggests to use
> identical(...) which again answer FALSE. Only using isTRUE(all.equal(0.3, 0.1 
> +
> 0.2)) I can obtain the true value TRUE.

A rational number has a finite binary expansion iff its denominator is a power
of 2. Numbers 0.1 and 0.2 are 1/10 and 1/5, so they have 5 in their denominator.
Their binary expansion is
 0.1  = .0001100110011001100110011001100110...
 0.2  = .0011001100110011001100110011001100...
A double variable stores the numbers rounded to 53 significant binary digits.
Hence, they are not exactly 0.1 and 0.2, as may be seen in
  formatC(0.1,digits=30) # [1] "0.15551115123126"
  formatC(0.2,digits=30) # [1] "0.200011102230246252"

In order to compare numbers with some tolerance, the function all.equal
may be used, which you also mention below. See its help page, which
specifies the tolerance to be  .Machine$double.eps ^ 0.5.

> But the problem does not concern only the operator ==. Many other functions,
> among over:  sort, order, unique, duplicate, identical are not able to deal 
> with
> this problem. This is very dangerous because no advice are provide by the 
> online
> help, and anybody can use these functions no think to unusual results.
> 
> I think that the problem is due to how double number are store by the C
> compiler.

Not C compiler, but the hardware.

Petr Savicky.

> If it may be usefull, I have written to small function (Unique and isEqual)
> which can deal with this problem of the double numbers.
> 
> I also add some other conditions for the same problem.
> 
>   0.3 == 0.15 + 0.15
> 0.3 == 0.1 + 0.2
> 1 - 0.7 == 0.3
> 0.1 == 1 - 0.9
> 
> 0.2 == 1 - 0.2 - 0.2 - 0.2 - 0.2
>-0.2 == 1 - 0.2 - 0.2 - 0.2 - 0.2 - 0.2 - 0.2
> 
> identical (0.3, 0.1 + 0.2)
> all.equal (0.3, 0.1 + 0.2)
> 
> identical (-0.2 , 1 - 0.2 - 0.2 - 0.2 - 0.2 - 0.2 - 0.2)
> all.equal (-0.2 , 1 - 0.2 - 0.2 - 0.2 - 0.2 - 0.2 - 0.2)
> 
> isTRUE( all.equal (-0.2 , 1 - 0.2 - 0.2 - 0.2 - 0.2 - 0.2 - 0.2) )
> 
> 
>-0.2 == 1 - 0.2 - 0.2 - 0.2 - 0.2 - 0.2 - 0.2
> 
> a= -0.2 
> b= 1 - 0.2 - 0.2 - 0.2 - 0.2 - 0.2 - 0.2
> 
> x<-c(a,b)
> sprintf("%.15f",x)
> sprintf("%.50f",x)
> 
> 
> 
> Unique <- function(x, digits = 8, fast = TRUE) {
> 
>   if (fast) {
>   unique (round(x * 10^digits)) / 10^digits   
>   } else {
>   x = sort(x)
>   for (i in 1:(length(x)-1))
>   if (isTRUE(all.equal(x[i],x[i+1]))) x[i] = NaN
>   x [ which (!is.nan(x)) ]
>   }}
> 
> isEqual <- function (object, x, tol = 1e-9) {
>   if (!is.vector(object)) stop("Object must be a vector")
>   if (is.character(object)) stop("Object can not be a character")
>   if (!is.real(x)) stop("x must be a real number")
>   if (any(is.na(c(object,x stop("NA is not supported")
>   if (length(x) != 1) stop("length x must equal to 1")
> 
>   ifelse (abs(object - x) < tol, TRUE,FALSE) 
> # .Call("isEqual",as.real(object),as.real(x),as.real(tol), 
> PACKAGE="mvUtils")
> }
> 
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] When 1+2 != 3 (PR#9895)

2007-09-03 Thread Duncan Murdoch

On 03/09/2007 2:59 AM, [EMAIL PROTECTED] wrote:
> Full_Name: Marco Vicentini, University of Verona
> Version: 2.4.1 & 2.5.1
> OS: OsX & WinXP
> Submission from: (NULL) (157.27.253.46)
> 
> 
> When I proceed to test the following equation 1 + 2 == 3, I obviously obtain 
> the
> value TRUE. But when I tryed to do the same using real number (i.e. 0.1 + 0.2 
> ==
> 0.3)  I obtained an unusual FALSE.
> In the online help there are some tricks for this problem. It suggests to use
> identical(...) which again answer FALSE. Only using isTRUE(all.equal(0.3, 0.1 
> +
> 0.2)) I can obtain the true value TRUE.
> 
> But the problem does not concern only the operator ==. Many other functions,
> among over:  sort, order, unique, duplicate, identical are not able to deal 
> with
> this problem. This is very dangerous because no advice are provide by the 
> online
> help, and anybody can use these functions no think to unusual results.

The FAQ 7.31 gives general help on this.  Repeating it in every instance 
where it affects computations wouldn't make sense.

Please don't report unavoidable problems as bugs.

Duncan Murdoch

> 
> I think that the problem is due to how double number are store by the C
> compiler.
> 
> If it may be usefull, I have written to small function (Unique and isEqual)
> which can deal with this problem of the double numbers.
> 
> I also add some other conditions for the same problem.
> 
>   0.3 == 0.15 + 0.15
> 0.3 == 0.1 + 0.2
> 1 - 0.7 == 0.3
> 0.1 == 1 - 0.9
> 
> 0.2 == 1 - 0.2 - 0.2 - 0.2 - 0.2
>-0.2 == 1 - 0.2 - 0.2 - 0.2 - 0.2 - 0.2 - 0.2
> 
> identical (0.3, 0.1 + 0.2)
> all.equal (0.3, 0.1 + 0.2)
> 
> identical (-0.2 , 1 - 0.2 - 0.2 - 0.2 - 0.2 - 0.2 - 0.2)
> all.equal (-0.2 , 1 - 0.2 - 0.2 - 0.2 - 0.2 - 0.2 - 0.2)
> 
> isTRUE( all.equal (-0.2 , 1 - 0.2 - 0.2 - 0.2 - 0.2 - 0.2 - 0.2) )
> 
> 
>-0.2 == 1 - 0.2 - 0.2 - 0.2 - 0.2 - 0.2 - 0.2
> 
> a= -0.2 
> b= 1 - 0.2 - 0.2 - 0.2 - 0.2 - 0.2 - 0.2
> 
> x<-c(a,b)
> sprintf("%.15f",x)
> sprintf("%.50f",x)
> 
> 
> 
> Unique <- function(x, digits = 8, fast = TRUE) {
> 
>   if (fast) {
>   unique (round(x * 10^digits)) / 10^digits   
>   } else {
>   x = sort(x)
>   for (i in 1:(length(x)-1))
>   if (isTRUE(all.equal(x[i],x[i+1]))) x[i] = NaN
>   x [ which (!is.nan(x)) ]
>   }}
> 
> isEqual <- function (object, x, tol = 1e-9) {
>   if (!is.vector(object)) stop("Object must be a vector")
>   if (is.character(object)) stop("Object can not be a character")
>   if (!is.real(x)) stop("x must be a real number")
>   if (any(is.na(c(object,x stop("NA is not supported")
>   if (length(x) != 1) stop("length x must equal to 1")
> 
>   ifelse (abs(object - x) < tol, TRUE,FALSE) 
> # .Call("isEqual",as.real(object),as.real(x),as.real(tol), 
> PACKAGE="mvUtils")
> }
> 
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

[Rd] buglet?? in nlme:::corRatio documentation

2007-09-03 Thread Ben Bolker


  [hoping to redeem myself for my last spurious bug report]

 From ?corRatio:

   Letting d denote the range and n denote the nugget effect, the
 correlation between two observations a distance r apart is
 (r/d)^2/(1+(r/d)^2) when no nugget effect is present and
 (1-n)*(r/d)^2/(1+(r/d)^2) when a nugget effect is  assumed.

  This disagrees with the C code (corStruct.c)

/* Rational class */

static double
ratio_corr(double val)
{
double val2 = val * val;
return(1/(1+val2));
}

  and with common sense (correlation structures should start from
1 and reach zero for large distances; the structure listed in the
documentation starts at 0 and goes to 1 [or (1-n)] for large distances) --
if you don't want to think about it, use R instead:
 
curve(x^2/(1+x^2),from=0,to=5)
curve(1/(1+x^2),add=TRUE,col=2,from=0)

  What's odd, and makes me really nervous, is that the expression found
in the documentation is also that found in Pinheiro and Bates 2000
(Table 5.2, p. 232).  It's not listed in the errata for the first printing
http://cm.bell-labs.com/cm/ms/departments/sia/project/nlme/MEMSS/Errata ;
I have the second printing.

  (I haven't dug out my geostats books to check this, but found at least
one paper that cites the "correct" (1/(1+(d/r)^2) formula -- see below

 cheers
Ben Bolker

@ARTICLE{Ekstrom+2005,
  author = {Ekstr{\o}m, Claus T. and Bak, S{\o}ren and Rudemo, Mats},
  title = {Pixel-level Signal Modelling with Spatial Correlation for 
Two-Colour
Microarrays},
  journal = {Statistical Applications in Genetics and Molecular Biology},
  year = {2005},
  volume = {4},
  number = {1}
  timestamp = {2007.09.03},
  url = {http://www.bepress.com/sagmb/vol4/iss1/art6}
}

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] When 1+2 != 3 (PR#9895)

2007-09-03 Thread Henrik Bengtsson

On 9/2/07, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote:
> Full_Name: Marco Vicentini, University of Verona
> Version: 2.4.1 & 2.5.1
> OS: OsX & WinXP
> Submission from: (NULL) (157.27.253.46)
>
>
> When I proceed to test the following equation 1 + 2 == 3, I obviously obtain 
> the
> value TRUE. But when I tryed to do the same using real number (i.e. 0.1 + 0.2 
> ==
> 0.3)  I obtained an unusual FALSE.
> In the online help there are some tricks for this problem. It suggests to use
> identical(...) which again answer FALSE. Only using isTRUE(all.equal(0.3, 0.1 
> +
> 0.2)) I can obtain the true value TRUE.
>
> But the problem does not concern only the operator ==. Many other functions,
> among over:  sort, order, unique, duplicate, identical are not able to deal 
> with
> this problem. This is very dangerous because no advice are provide by the 
> online
> help, and anybody can use these functions no think to unusual results.
>
> I think that the problem is due to how double number are store by the C
> compiler.
>
> If it may be usefull, I have written to small function (Unique and isEqual)
> which can deal with this problem of the double numbers.

Quiz: What about utility functions equalsE() and equalsPi()?
...together with examples illustrating when they return TRUE and when
they return FALSE.

Cheers

/Henrik

>
> I also add some other conditions for the same problem.
>
> 0.3 == 0.15 + 0.15
> 0.3 == 0.1 + 0.2
> 1 - 0.7 == 0.3
> 0.1 == 1 - 0.9
>
> 0.2 == 1 - 0.2 - 0.2 - 0.2 - 0.2
>-0.2 == 1 - 0.2 - 0.2 - 0.2 - 0.2 - 0.2 - 0.2
>
> identical (0.3, 0.1 + 0.2)
> all.equal (0.3, 0.1 + 0.2)
>
> identical (-0.2 , 1 - 0.2 - 0.2 - 0.2 - 0.2 - 0.2 - 0.2)
> all.equal (-0.2 , 1 - 0.2 - 0.2 - 0.2 - 0.2 - 0.2 - 0.2)
>
> isTRUE( all.equal (-0.2 , 1 - 0.2 - 0.2 - 0.2 - 0.2 - 0.2 - 0.2) )
>
>
>-0.2 == 1 - 0.2 - 0.2 - 0.2 - 0.2 - 0.2 - 0.2
>
> a= -0.2
> b= 1 - 0.2 - 0.2 - 0.2 - 0.2 - 0.2 - 0.2
>
> x<-c(a,b)
> sprintf("%.15f",x)
> sprintf("%.50f",x)
>
>
>
> Unique <- function(x, digits = 8, fast = TRUE) {
>
> if (fast) {
> unique (round(x * 10^digits)) / 10^digits
> } else {
> x = sort(x)
> for (i in 1:(length(x)-1))
> if (isTRUE(all.equal(x[i],x[i+1]))) x[i] = NaN
> x [ which (!is.nan(x)) ]
> }}
>
> isEqual <- function (object, x, tol = 1e-9) {
> if (!is.vector(object)) stop("Object must be a vector")
> if (is.character(object)) stop("Object can not be a character")
> if (!is.real(x)) stop("x must be a real number")
> if (any(is.na(c(object,x stop("NA is not supported")
> if (length(x) != 1) stop("length x must equal to 1")
>
> ifelse (abs(object - x) < tol, TRUE,FALSE)
> #   .Call("isEqual",as.real(object),as.real(x),as.real(tol), 
> PACKAGE="mvUtils")
> }
>
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] When 1+2 != 3 (PR#9895)

2007-09-03 Thread Ted Harding

On 03-Sep-07 15:12:06, Henrik Bengtsson wrote:
> On 9/2/07, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote:
>> [...]
>> If it may be usefull, I have written to small function
>> (Unique and isEqual)
>> which can deal with this problem of the double numbers.
> 
> Quiz: What about utility functions equalsE() and equalsPi()?
> ...together with examples illustrating when they return TRUE and when
> they return FALSE.
> 
> Cheers
> 
> /Henrik

Well, if you guys want a Quiz: ... My favourite example
of something which will probably never work on R (or any
machine which implements fixed-length binary real arithmetic).

An interated function scheme on [0,1] is defined by

  if 0 <= x <= 0.5 then next x = 2*x

  if 0.5 < x <= 1  then next x = 2*(1 - x)

in R:

  nextX <- function(x){ifelse(x<=0.5, 2*x, 2*(1-x))}

and try, e.g.,

 x<-3/7; for(i in (1:60)){x<-nextX(x); print(c(i,x))}

x = 0 is an absorbing state.
x = 1 -> x = 0
x = 1/2 -> 1 -> 0
...
(these work in R)

If K is an odd integer, and 0 < r < K, then

x = r/K ->  ... leads into a periodic set.

E.g. (see above) 3/7 -> 6/7 -> 2/7 -> 4/7 -> 2/7

All other numbers x outside these sets generate non-periodic
sequences.

Apart from the case where initial x = 1/2^k, none of the
above is true in R (e.g. the example above).

So can you devise an "isEqual" function which will make this
work?

It's only Monday .. plenty of time!
Best wishes,
Ted.


E-Mail: (Ted Harding) <[EMAIL PROTECTED]>
Fax-to-email: +44 (0)870 094 0861
Date: 03-Sep-07   Time: 17:32:38
-- XFMail --


E-Mail: (Ted Harding) <[EMAIL PROTECTED]>
Fax-to-email: +44 (0)870 094 0861
Date: 03-Sep-07   Time: 18:50:23
-- XFMail --

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] locales and readLines

2007-09-03 Thread Martin Morgan

Thank you very much for explaining this. I had indeed overlooked the
use of encoding in 'file'. I also appreciate how unsatisfactory
guessing at the encoding can be, and that scanning the entire file is
not appropriate for large files or general connections.

Sorry that 'burden' came across as negative, more along the lines of
'burden of responsibility for handling the inputs the package
developer implies they'll handle'. Much better than the burden of
saying 'sorry, no can do'.

Thanks again,

Martin

Prof Brian Ripley <[EMAIL PROTECTED]> writes:

> I think you need to delimit a bit more what you want to do.  It is
> difficult in general to tell what encoding a text file is in, and very
> much harder if this is a data file containing only a small proportion
> of non-ASCII text, which might not even be words in a human language
> (but abbreviations or acronyms).
>
> If you have experience with systems that do try to guess (e.g. Unix
> 'file') you will know that they are pretty fallible.  There are Perl
> modules available, for example: I checked Encode::Guess which says
>
> ·   Because of the algorithm used, ISO-8859 series and other single-
> byte encodings do not work well unless either one of ISO-8859 is
> the only one suspect (besides ascii and utf8).
>
> ·   Do not mix national standard encodings and the corresponding vendor
> encodings.
>
> It is, after all, just a guess.  You should alway be explicit when it
> comes to encodings.  But there are some, especially Japanese, environ-
> ment that guess-coding is a must.  Use this module with care.
>
>
> I think you may have missed that the main way to specify an encoding
> for a file is
>
> readLines(file("fn", encoding="latin2"))
>
> and not the encoding arg to readLines (although the help page is quite
> clear that the latter does not re-encode).  The latter only allows
> UTF-8
> and latin1.
>
> The author of a package that offers facilities to read non-ASCII text
> does need to offer the user a way to specify the encoding.  I think
> suggesting that is 'an extra burden' is exceedingly negative: you
> could rather be thankful that R provides the facilities these days to
> do so.  And if the package or its examples contains non-ASCII
> character strings, it is de rigeur for the author to consider how it
> might work on other people's systems.
>
> Notice that source() already has some of the 'smarts' you are asking
> about if 'file' is a file and not a connection, and you could provide
> a similar wrapper for readLines.  That is useful either when the user
> can specify a small set of possible encodings or when such a set can
> be deduced from the locale.  If the concern is that file might be
> UTF-8 or latin1, this is often a good guess (latin1 files can be valid
> UTF-8 but rarely are). However, if you have Russian text which might
> be in one of the several 8-bit encodings, the only way I know to
> decide which is to see if they make sense (and if they are acronyms,
> they may in all the possible encodings).
>
> BTW, to guess an encoding you really need to process all the input, so
> this is not appropriate for general connections, and for large files
> it might be better to do it external to R, e.g. via Perl etc.
>
> I would say minimal good practice would be to
>
> - allow the user to specify the encoding of text files.
> - ensure you have specified the encoding of all non-ASCII data in your
>package (which includes documentation, for example).
>
> I'd leave guessing to others: as
> http://www.cs.tut.fi/~jkorpela/chars.html says,
>
>It is hopefully obvious from the preceding discussion that a sequence of
>octets can be interpreted in a multitude of ways when processed as
>character data. By looking at the octet sequence only, you cannot even
>know whether each octet presents one character or just part of a
>two-octet presentation of a character, or something more complicated.
>Sometimes one can guess the encoding, but data processing and transfer
>shouldn't be guesswork.
>
>
>
> On Fri, 31 Aug 2007, Martin Morgan wrote:
>
>> R-developers,
>>
>> I'm looking for some 'best practices', or perhaps an upstream solution
>> (I have a deja vu about this, so sorry if it's already been asked).
>> Problems occur when a file is encoded as latin1, but the user has a
>> UTF-8 locale (or I guess more generally when the input locale does not
>> match R's).  Here are two examples from the Bioconductor help list:
>>
>> https://stat.ethz.ch/pipermail/bioconductor/2007-August/018947.html
>>
>> (the relevant command is library(GEOquery); gse <- getGEO('GSE94'))
>>
>> https://stat.ethz.ch/pipermail/bioconductor/2007-July/018204.html
>>
>> I think solutions are:
>>
>> * Specify the encoding in readLines.
>>
>> * Convert the input using iconv.
>>
>> * Tell the user to set their locale to match the input file (!)
>>
>> Unfortunately, these (1 & 2, anyway) place extra burden on the package
>> author,

Re: [Rd] When 1+2 != 3 (PR#9895)

2007-09-03 Thread Gabor Grothendieck

Not sure if this counts but using the Ryacas package

> library(Ryacas)
> x <- Sym("x")
> Set(x, Sym(3)/7)
expression(3/7)
> cat(i, "0: "); print(x)
10 0: expression(3/7)
> for(i in 1:10) {
+ yacas("Set(x, If(x <= 1/2, 2*x, 2*(1-x)))")
+ cat(i, "i: "); print(x)
+ }
1 i: expression(6/7)
2 i: expression(2/7)
3 i: expression(4/7)
4 i: expression(6/7)
5 i: expression(2/7)
6 i: expression(4/7)
7 i: expression(6/7)
8 i: expression(2/7)
9 i: expression(4/7)
10 i: expression(6/7)

On 9/3/07, Ted Harding <[EMAIL PROTECTED]> wrote:
> On 03-Sep-07 15:12:06, Henrik Bengtsson wrote:
> > On 9/2/07, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote:
> >> [...]
> >> If it may be usefull, I have written to small function
> >> (Unique and isEqual)
> >> which can deal with this problem of the double numbers.
> >
> > Quiz: What about utility functions equalsE() and equalsPi()?
> > ...together with examples illustrating when they return TRUE and when
> > they return FALSE.
> >
> > Cheers
> >
> > /Henrik
>
> Well, if you guys want a Quiz: ... My favourite example
> of something which will probably never work on R (or any
> machine which implements fixed-length binary real arithmetic).
>
> An interated function scheme on [0,1] is defined by
>
>  if 0 <= x <= 0.5 then next x = 2*x
>
>  if 0.5 < x <= 1  then next x = 2*(1 - x)
>
> in R:
>
>  nextX <- function(x){ifelse(x<=0.5, 2*x, 2*(1-x))}
>
> and try, e.g.,
>
>  x<-3/7; for(i in (1:60)){x<-nextX(x); print(c(i,x))}
>
> x = 0 is an absorbing state.
> x = 1 -> x = 0
> x = 1/2 -> 1 -> 0
> ...
> (these work in R)
>
> If K is an odd integer, and 0 < r < K, then
>
> x = r/K ->  ... leads into a periodic set.
>
> E.g. (see above) 3/7 -> 6/7 -> 2/7 -> 4/7 -> 2/7
>
> All other numbers x outside these sets generate non-periodic
> sequences.
>
> Apart from the case where initial x = 1/2^k, none of the
> above is true in R (e.g. the example above).
>
> So can you devise an "isEqual" function which will make this
> work?
>
> It's only Monday .. plenty of time!
> Best wishes,
> Ted.
>
> 
> E-Mail: (Ted Harding) <[EMAIL PROTECTED]>
> Fax-to-email: +44 (0)870 094 0861
> Date: 03-Sep-07   Time: 17:32:38
> -- XFMail --
>
> 
> E-Mail: (Ted Harding) <[EMAIL PROTECTED]>
> Fax-to-email: +44 (0)870 094 0861
> Date: 03-Sep-07   Time: 18:50:23
> -- XFMail --
>
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] When 1+2 != 3 (PR#9895)

2007-09-03 Thread Ted Harding

On 03-Sep-07 19:25:58, Gabor Grothendieck wrote:
> Not sure if this counts but using the Ryacas package

Gabor, I'm afraid it doesn't count! (Though I didn't
exclude it explicitly). I'm not interested in the behaviour
of the sequence with denominator = 7 particularly.
The system is in fact an example of simulating chaotic
systems on a computer.

For instance, one of the classic illustrations is

  next x = 2*x*(1-x)

for any real x. The question is, how does a finite-length
binary representation behave?

Petr Savicky [privately] sent me a similar example:
Starting with r/K:

nextr <- function(r){ifelse(r<=K/2, 2*r, 2*(K-r))}

  "For K = 7 and r = 3, this yields r = 3,  6,  2,  4,  6, ...
   Dividing this by K=7, one gets the correct period with
   approximately correct numbers."

Best wishes,
Ted.


E-Mail: (Ted Harding) <[EMAIL PROTECTED]>
Fax-to-email: +44 (0)870 094 0861
Date: 03-Sep-07   Time: 21:02:27
-- XFMail --

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] Consistency of serialize(): please enlighten me

2007-09-03 Thread Hin-Tak Leung

I have a couple of ideas - serialize() can store references (and some 
simple assignment are just stored as references until one tries to 
modify part of the copy, i.e. in a copy-on-write manner); ocassionally,
it will also store the package name as an attribute to the class name in
which the class was defined. Maybe neither of this is the case, but what
does a hexdump tell you? (just printing the result of rawToChar() to the
console).

Henrik Bengtsson wrote:
> Forgot...
> 
> On 8/31/07, Henrik Bengtsson <[EMAIL PROTECTED]> wrote:
>> Hi,
>>
>> I am puzzled with serialize().  It comes down generating identical
>> hash codes for (apparently) identical objects using digest::digest(),
>> which in turn relies on serialize().  Here is an example illustration
>> the issue:
>>
>> ser <- function(object, ...) {
>>   list(
>> names = names(object),
>> namesRaw = charToRaw(names(object)),
>> ser = serialize(names(object), connection=NULL, ascii=FALSE)
>>   )
>> } # ser()
>>
>> # Object to be serialized
>> key <- key0 <- list(abc="Hello");
>>
>> # Store results
>> d <- list();
>>
>> # 1. As is
>> d[[1]] <- ser(key);
>>
>> # 2. Set names and redo (hardwired: identical to what's already there)
>> names(key) <- "abc";
>> d[[2]] <- ser(key);
>>
>> # 3. Set names and redo (generic: char->raw->char)
>> key <- key0;
>> names(key) <- sapply(names(key), FUN=function(name) 
>> rawToChar(charToRaw(name)));
>> d[[3]] <- ser(key);
>>
>> # All names are identical
>> for (kk in 2:length(d))
>>   stopifnot(identical(d[[1]]$names, d[[kk]]$names));
>>
>> # All raw names are identical
>> for (kk in 2:length(d))
>>   stopifnot(identical(d[[1]]$namesRaw, d[[kk]]$namesRaw));
>>
>> # But, the serialized names differ.
>> print(identical(d[[1]]$ser, d[[2]]$ser));
>> print(identical(d[[1]]$ser, d[[3]]$ser));
>> print(identical(d[[2]]$ser, d[[3]]$ser));
> 
> With R version 2.6.0 Under development (unstable) (2007-08-23 r42614) I get:
> [1] TRUE
> [1] FALSE
> [1] FALSE
> 
> and with R version 2.5.1 Patched (2007-07-19 r42284):
> [1] FALSE
> [1] FALSE
> [1] TRUE
> 
>> So, it seems like there is some extra information in the names
>> attribute that is part of the serialization.  Is it possible to show
>> they differ at the R level?  What is that extra information?
>> Promises...?
>>
>> Please enlighten me.
>>
>> Henrik
>>
> 
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] locales and readLines

[Rd] Typo in regex help page

[Rd] read.spss converts string variables with value labels to (PR#9896)

[Rd] When 1+2 != 3 (PR#9895)

Re: [Rd] When 1+2 != 3 (PR#9895)

Re: [Rd] When 1+2 != 3 (PR#9895)

[Rd] buglet?? in nlme:::corRatio documentation

Re: [Rd] When 1+2 != 3 (PR#9895)

Re: [Rd] When 1+2 != 3 (PR#9895)

Re: [Rd] locales and readLines

Re: [Rd] When 1+2 != 3 (PR#9895)

Re: [Rd] When 1+2 != 3 (PR#9895)

Re: [Rd] Consistency of serialize(): please enlighten me

13 matches

Site Navigation

Mail list logo

Footer information