[Rd] writeLines argument useBytes = TRUE still making conversions

2018-02-15 Thread Davor Josipovic
I think this behavior is inconsistent with the documentation:

  tmp <- 'é'
  tmp <- iconv(tmp, to = 'UTF-8')
  print(Encoding(tmp))
  print(charToRaw(tmp))
  tmpfilepath <- tempfile()
  writeLines(tmp, con = file(tmpfilepath, encoding = 'UTF-8'), useBytes = TRUE)

[1] "UTF-8"
[1] c3 a9

Raw text as hex: c3 83 c2 a9

If I switch to useBytes = FALSE, then the variable is written correctly as  c3 
a9.

Any thoughts? This behavior is related to this issue: 
https://github.com/yihui/knitr/issues/1509


[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] missing extern in GraphicsBase.h

2018-02-15 Thread dmitrii . pasechnik
Dear all,
in src/include/GraphicsBase.h one has a declaration

int baseRegisterIndex;

the same as in src/main/devices.c

which causes problems on Solaris, see bug #17385, 
and other platforms with "unusual" linkers, see bug #16633.

By right, global variables like baseRegisterIndex are to be
declared just once, and not in a header file, but in a *.c file.
Then, to use them elsewhere in the code, one declares them as
extern in the header. 
(as proposed on #17385)

Otherwise one has an undefined behaviour,
some linkers might silently prepend extern, some not...

May I humbly request attention to this bug
(which is classified as UNCONFIRNMED---and indeed it needs an extra
effort to reproduce the error on, say, Linux --- but it really is an obvious C
bug, which will rear its ugly head sooner or later again)

Thanks,
Dmitrii



signature.asc
Description: Digital signature
__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] writeLines argument useBytes = TRUE still making conversions

2018-02-15 Thread Kevin Ushey
I suspect your UTF-8 string is being stripped of its encoding before
write, and so assumed to be in the system native encoding, and then
re-encoded as UTF-8 when written to the file. You can see something
similar with:

> tmp <- 'é'
> tmp <- iconv(tmp, to = 'UTF-8')
> Encoding(tmp) <- "unknown"
> charToRaw(iconv(tmp, to = "UTF-8"))
[1] c3 83 c2 a9

It's worth saying that:

file(..., encoding = "UTF-8")

means "attempt to re-encode strings as UTF-8 when writing to this
file". However, if you already know your text is UTF-8, then you
likely want to avoid opening a connection that might attempt to
re-encode the input. Conversely (assuming I'm understanding the
documentation correctly)

file(..., encoding = "native.enc")

means "assume that strings are in the native encoding, and hence
translation is unnecessary". Note that it does not mean "attempt to
translate strings to the native encoding".

Also note that writeLines(..., useBytes = FALSE) will explicitly
translate to the current encoding before sending bytes to the
requested connection. In other words, there are two locations where
translation might occur in your example:

   1) In the call to writeLines(),
   2) When characters are passed to the connection.

In your case, it sounds like translation should be suppressed at both steps.

I think this is documented correctly in ?writeLines (and also the
Encoding section of ?file), but the behavior may feel unfamiliar at
first glance.

Kevin

On Wed, Feb 14, 2018 at 11:36 PM, Davor Josipovic  wrote:
>
> I think this behavior is inconsistent with the documentation:
>
>   tmp <- 'é'
>   tmp <- iconv(tmp, to = 'UTF-8')
>   print(Encoding(tmp))
>   print(charToRaw(tmp))
>   tmpfilepath <- tempfile()
>   writeLines(tmp, con = file(tmpfilepath, encoding = 'UTF-8'), useBytes = 
> TRUE)
>
> [1] "UTF-8"
> [1] c3 a9
>
> Raw text as hex: c3 83 c2 a9
>
> If I switch to useBytes = FALSE, then the variable is written correctly as  
> c3 a9.
>
> Any thoughts? This behavior is related to this issue: 
> https://github.com/yihui/knitr/issues/1509
>
>
> [[alternative HTML version deleted]]
>
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] writeLines argument useBytes = TRUE still making conversions

2018-02-15 Thread Ista Zahn
On Thu, Feb 15, 2018 at 11:19 AM, Kevin Ushey  wrote:
> I suspect your UTF-8 string is being stripped of its encoding before
> write, and so assumed to be in the system native encoding, and then
> re-encoded as UTF-8 when written to the file. You can see something
> similar with:
>
> > tmp <- 'é'
> > tmp <- iconv(tmp, to = 'UTF-8')
> > Encoding(tmp) <- "unknown"
> > charToRaw(iconv(tmp, to = "UTF-8"))
> [1] c3 83 c2 a9
>
> It's worth saying that:
>
> file(..., encoding = "UTF-8")
>
> means "attempt to re-encode strings as UTF-8 when writing to this
> file". However, if you already know your text is UTF-8, then you
> likely want to avoid opening a connection that might attempt to
> re-encode the input. Conversely (assuming I'm understanding the
> documentation correctly)
>
> file(..., encoding = "native.enc")
>
> means "assume that strings are in the native encoding, and hence
> translation is unnecessary". Note that it does not mean "attempt to
> translate strings to the native encoding".

If all that is true I think ?file needs some attention. I've read it
several times now and I just don't see how it can be interpreted as
you've described it.

Best,
Ista

>
> Also note that writeLines(..., useBytes = FALSE) will explicitly
> translate to the current encoding before sending bytes to the
> requested connection. In other words, there are two locations where
> translation might occur in your example:
>
>1) In the call to writeLines(),
>2) When characters are passed to the connection.
>
> In your case, it sounds like translation should be suppressed at both steps.
>
> I think this is documented correctly in ?writeLines (and also the
> Encoding section of ?file), but the behavior may feel unfamiliar at
> first glance.
>
> Kevin
>
> On Wed, Feb 14, 2018 at 11:36 PM, Davor Josipovic  wrote:
>>
>> I think this behavior is inconsistent with the documentation:
>>
>>   tmp <- 'é'
>>   tmp <- iconv(tmp, to = 'UTF-8')
>>   print(Encoding(tmp))
>>   print(charToRaw(tmp))
>>   tmpfilepath <- tempfile()
>>   writeLines(tmp, con = file(tmpfilepath, encoding = 'UTF-8'), useBytes = 
>> TRUE)
>>
>> [1] "UTF-8"
>> [1] c3 a9
>>
>> Raw text as hex: c3 83 c2 a9
>>
>> If I switch to useBytes = FALSE, then the variable is written correctly as  
>> c3 a9.
>>
>> Any thoughts? This behavior is related to this issue: 
>> https://github.com/yihui/knitr/issues/1509
>>
>>
>> [[alternative HTML version deleted]]
>>
>> __
>> R-devel@r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] missing extern in GraphicsBase.h

2018-02-15 Thread Paul Murrell

Hi

I have committed the suggested "extern" patch.

Could you please confirm that this fixes the issue on Solaris (and 
anything else you can test) ?


Thanks!

Paul

On 16/02/18 03:39, dmitrii.pasech...@maths.ox.ac.uk wrote:

Dear all,
in src/include/GraphicsBase.h one has a declaration

int baseRegisterIndex;

the same as in src/main/devices.c

which causes problems on Solaris, see bug #17385,
and other platforms with "unusual" linkers, see bug #16633.

By right, global variables like baseRegisterIndex are to be
declared just once, and not in a header file, but in a *.c file.
Then, to use them elsewhere in the code, one declares them as
extern in the header.
(as proposed on #17385)

Otherwise one has an undefined behaviour,
some linkers might silently prepend extern, some not...

May I humbly request attention to this bug
(which is classified as UNCONFIRNMED---and indeed it needs an extra
effort to reproduce the error on, say, Linux --- but it really is an obvious C
bug, which will rear its ugly head sooner or later again)

Thanks,
Dmitrii



__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel



--
Dr Paul Murrell
Department of Statistics
The University of Auckland
Private Bag 92019
Auckland
New Zealand
64 9 3737599 x85392
p...@stat.auckland.ac.nz
http://www.stat.auckland.ac.nz/~paul/

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] Duplicate column names created by base::merge() when by.x has the same name as a column in y

2018-02-15 Thread Scott Ritchie
Hi,

I was unable to find a bug report for this with a cursory search, but would
like clarification if this is intended or unavoidable behaviour:

```{r}
# Create example data.frames
parents <- data.frame(name=c("Sarah", "Max", "Qin", "Lex"),
  sex=c("F", "M", "F", "M"),
  age=c(41, 43, 36, 51))
children <- data.frame(parent=c("Sarah", "Max", "Qin"),
   name=c("Oliver", "Sebastian", "Kai-lee"),
   sex=c("M", "M", "F"),
   age=c(5,8,7))

# Merge() creates a duplicated "name" column:
merge(parents, children, by.x = "name", by.y = "parent")
```

Output:
```
   name sex.x age.x  name sex.y age.y
1   Max M43 Sebastian M 8
2   Qin F36   Kai-lee F 7
3 Sarah F41Oliver M 5
Warning message:
In merge.data.frame(parents, children, by.x = "name", by.y = "parent") :
  column name ‘name’ is duplicated in the result
```

Kind Regards,

Scott Ritchie

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel