Re: [Rd] writeLines argument useBytes = TRUE still making conversions

2018-02-19 Thread Tomas Kalibera


I think it is as Kevin described in an earlier response - the garbled 
output is because a UTF-8 encoded string is assumed to be native 
encoding (which happens not to be UTF-8 on the platform where this is 
observed) and converted again to UTF-8.


I think the documentation is consistent with the observed behavior


   tmp <- 'é'
   tmp <- iconv(tmp, to = 'UTF-8')
   print(Encoding(tmp))
   print(charToRaw(tmp))
   tmpfilepath <- tempfile()
   writeLines(tmp, con = file(tmpfilepath, encoding = 'UTF-8'), useBytes = TRUE)

[1] "UTF-8"
[1] c3 a9

Raw text as hex: c3 83 c2 a9
useBytes=TRUE in writeLines means that the UTF-8 string will be passed 
byte-by-byte to the connection. encoding="UTF-8" tells the connection to 
convert the bytes to UTF-8 (from native encoding). So the second step is 
converting a string which is assumed to be in native encoding, but in 
fact it is in UTF-8.


The documentation describes "useBytes=TRUE" as for expert use only, it 
can be useful for avoiding unnecessary conversions in some special 
cases, but one has then to make sure that no more conversions are 
attempted (so use "" as encoding of in "file", for instance). The long 
advice short would be to not use useBytes=TRUE with writeLines, but 
depend on the default behavior.


Tomas


On 02/17/2018 11:24 PM, Kevin Ushey wrote:

Of course, right after writing this e-mail I tested on my Windows
machine and did not see what I expected:


charToRaw(before)

[1] c3 a9

charToRaw(after)

[1] e9

so obviously I'm misunderstanding something as well.

Best,
Kevin

On Sat, Feb 17, 2018 at 2:19 PM, Kevin Ushey  wrote:

 From my understanding, translation is implied in this line of ?file (from the
Encoding section):

 The encoding of the input/output stream of a connection can be specified
 by name in the same way as it would be given to iconv: see that help page
 for how to find out what encoding names are recognized on your platform.
 Additionally, "" and "native.enc" both mean the ‘native’ encoding, that is
 the internal encoding of the current locale and hence no translation is
 done.

This is also hinted at in the documentation in ?readLines for its 'encoding'
argument, which has a different semantic meaning from the 'encoding' argument
as used with R connections:

 encoding to be assumed for input strings. It is used to mark character
 strings as known to be in Latin-1 or UTF-8: it is not used to re-encode
 the input. To do the latter, specify the encoding as part of the
 connection con or via options(encoding=): see the examples.

It might be useful to augment the documentation in ?file with something like:

 The 'encoding' argument is used to request the translation of strings when
 writing to a connection.

and, perhaps to further drive home the point about not translating when
encoding = "native.enc":

 Note that R will not attempt translation of strings when encoding is
 either "" or "native.enc" (the default, as per getOption("encoding")).
 This implies that attempting to write, for example, UTF-8 encoded content
 to a connection opened using "native.enc" will retain its original UTF-8
 encoding -- it will not be translated.

It is a bit surprising that 'native.enc' means "do not translate" rather than
"attempt translation to the encoding associated with the current locale", but
those are the semantics and they are not bound to change.

This is the code I used to convince myself of that case:

 conn <- file(tempfile(), encoding = "native.enc", open = "w+")

 before <- iconv('é', to = "UTF-8")
 cat(before, file = conn, sep = "\n")
 after <- readLines(conn)

 charToRaw(before)
 charToRaw(after)

with output:

 > charToRaw(before)
 [1] c3 a9
 > charToRaw(after)
 [1] c3 a9

Best,
Kevin


On Thu, Feb 15, 2018 at 9:16 AM, Ista Zahn  wrote:

On Thu, Feb 15, 2018 at 11:19 AM, Kevin Ushey  wrote:

I suspect your UTF-8 string is being stripped of its encoding before
write, and so assumed to be in the system native encoding, and then
re-encoded as UTF-8 when written to the file. You can see something
similar with:

 > tmp <- 'é'
 > tmp <- iconv(tmp, to = 'UTF-8')
 > Encoding(tmp) <- "unknown"
 > charToRaw(iconv(tmp, to = "UTF-8"))
 [1] c3 83 c2 a9

It's worth saying that:

 file(..., encoding = "UTF-8")

means "attempt to re-encode strings as UTF-8 when writing to this
file". However, if you already know your text is UTF-8, then you
likely want to avoid opening a connection that might attempt to
re-encode the input. Conversely (assuming I'm understanding the
documentation correctly)

 file(..., encoding = "native.enc")

means "assume that strings are in the native encoding, and hence
translation is unnecessary". Note that it does not mean "attempt to
translate strings to the native encoding".

If all that is true I think ?file needs some attention. I've read it
several times now and I just don't see how

Re: [Rd] readLines interaction with gsub different in R-dev

2018-02-19 Thread Tomas Kalibera

Thank you for the report and analysis. Now fixed in R-devel.
Tomas

On 02/17/2018 08:24 PM, William Dunlap via R-devel wrote:

I think the problem in R-devel happens when there are non-ASCII characters
in any
of the strings passed to gsub.

txt <- vapply(list(as.raw(c(0x41, 0x6d, 0xc3, 0xa9, 0x6c, 0x69, 0x65)),
as.raw(c(0x41, 0x6d, 0x65, 0x6c, 0x69, 0x61))), rawToChar, "")
txt
#[1] "Amélie" "Amelia"
Encoding(txt)
#[1] "unknown" "unknown"
gsub(perl=TRUE, "(\\w)(\\w)", "<\\L\\1\\U\\2>", txt)
#[1] "", txt[1])
#[1] "", txt[2])
#[1] ""

I can change the Encoding to "latin1" or "UTF-8" and get similar results
from gsub.


Bill Dunlap
TIBCO Software
wdunlap tibco.com

On Sat, Feb 17, 2018 at 7:35 AM, Hugh Parsonage 
wrote:


| Confirmed for R-devel (current) on Ubuntu 17.10.  But ... isn't the
regexp
| you use wrong, ie isn't R-devel giving the correct answer?

No, I don't think R-devel is correct (or at least consistent with the
documentation). My interpretation of gsub("(\\w)", "\\U\\1", entry,
perl = TRUE) is "Take every word character and replace it with itself,
converted to uppercase."

Perhaps my example was too minimal. Consider the following:

R> gsub("(\\w)", "\\U\\1", entry, perl = TRUE)
[1] "A"

R> gsub("(\\w)", "\\1", entry, perl = TRUE)
[1] "author: Amélie"   # OK, but very different to 'A', despite only
not specifying uppercase

R> gsub("(\\w)", "\\U\\1", "author: Amelie", perl = TRUE)
[1] "AUTHOR: AMELIE"  # OK, but very different to 'A',

R> gsub("^(\\w+?): (\\w)", "\\U\\1\\E: \\2", entry, perl = TRUE)
  "AUTHOR"  # Where did everything after the first group go?

I should note the following example too:
R> gsub("(\\w)", "\\U\\1", entry, perl = TRUE, useBytes = TRUE)
[1] "AUTHOR: AMéLIE"  # latin1 encoding


A call to `readLines` (possibly `scan()` and `read.table` and friends)
is essential.




On 18 February 2018 at 02:15, Dirk Eddelbuettel  wrote:

On 17 February 2018 at 21:10, Hugh Parsonage wrote:
| I was told to re-raise this issue with R-dev:
|
| In the documentation of R-dev and R-3.4.3, under ?gsub
|
| > replacement
| >... For perl = TRUE only, it can also contain "\U" or "\L" to

convert the rest of the replacement to upper or lower case and "\E" to end
case conversion.

|
| However, the following code runs differently:
|
| tempf <- tempfile()
| writeLines(enc2utf8("author: Amélie"), con = tempf, useBytes = TRUE)
| entry <- readLines(tempf, encoding = "UTF-8")
| gsub("(\\w)", "\\U\\1", entry, perl = TRUE)
|
|
| "AUTHOR: AMÉLIE"  # R-3.4.3
|
| "A"  # R-dev

Confirmed for R-devel (current) on Ubuntu 17.10.  But ... isn't the

regexp

you use wrong, ie isn't R-devel giving the correct answer?

R> tempf <- tempfile()
R> writeLines(enc2utf8("author: Amélie"), con = tempf, useBytes = TRUE)
R> entry <- readLines(tempf, encoding = "UTF-8")
R> gsub("(\\w)", "\\U\\1", entry, perl = TRUE)
[1] "A"
R> gsub("(\\w+)", "\\U\\1", entry, perl = TRUE)
[1] "AUTHOR"
R> gsub("(.*)", "\\U\\1", entry, perl = TRUE)
[1] "AUTHOR: AMÉLIE"
R>

Dirk

--
http://dirk.eddelbuettel.com | @eddelbuettel | e...@debian.org

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] [parallel] fixes load balancing of parLapplyLB

2018-02-19 Thread Christian Krause
Dear R-Devel List,

I have installed R 3.4.3 with the patch applied on our cluster and ran a 
*real-world* job of one of our users to confirm that the patch works to my 
satisfaction. Here are the results.

The original was a series of jobs, all essentially doing the same stuff using 
bootstrapped data, so for the original there is more data and I show the 
arithmetic mean with standard deviation. The confirmation with the patched R 
was only a single instance of that series of jobs.

## Job Efficiency

The job efficiency is defined as (this is what the `qacct-efficiency` tool 
below does):

```
efficiency = cputime / cores / wallclocktime * 100%
```

In simpler words: how well did the job utilize its CPU cores. It shows the 
percentage of time the job was actually doing stuff, as opposed to the 
difference:

```
wasted = 100% - efficiency
```

... which, essentially, tells us how much of the resources were wasted, i.e. 
CPU cores just idling, without being used by anyone. We care a lot about that 
because, for our scientific computing cluster, wasted resources is like burning 
money.

### original

This is the entire series from our job accounting database, filteres the 
successful jobs, calculates efficiency and then shows the average and standard 
deviation of the efficiency:

```
$ qacct -j 4433299 | qacct-success | qacct-efficiency | meansd
n=945 ∅ 61.7276 ± 7.78719
```

This is the entire series from our job accounting database, filteres the 
successful jobs, calculates efficiency and does sort of a histogram-like 
binning before calculation of mean and standard deviation (to get a more 
detailed impression of the distribution when standard deviation of the previous 
command is comparatively high):

```
$ qacct -j 4433299 | qacct-success | qacct-efficiency | meansd-bin -w 10 | sort 
-gk1 | column -t
10  -  20  ->  n=3∅  19.216667   ±  0.9112811494447459
20  -  30  ->  n=6∅  26.418  ±  2.665996374091058
30  -  40  ->  n=12   ∅  35.115834   ±  2.8575783082671196
40  -  50  ->  n=14   ∅  45.35285714285715   ±  2.98623361591005
50  -  60  ->  n=344  ∅  57.114593023255814  ±  2.1922005551774415
60  -  70  ->  n=453  ∅  64.29536423841049   ±  2.8334788433963856
70  -  80  ->  n=108  ∅  72.95592592592598   ±  2.5219474143639276
80  -  90  ->  n=5∅  81.526  ±  1.2802265424525452
```

I have attached an example graph from our monitoring system of a single 
instance in my previous mail. There you can see that the load balancing does 
not actually work, i.e. same as `parLapply`. This reflects in the job 
efficiency.

### patch applied

This is the single instance I used to confirm that the patch works:

```
$ qacct -j 4562202 | qacct-efficiency
97.36
```

The graph from our monitoring system is attached. As you can see, the load 
balancing works to a satisfying degree and the efficiency is well above 90% 
which was what I had hoped for :-)

## Additional Notes

The list used in this jobs `parLapplyLB` is 5812 elements long. With the 
`splitList`-chunking from the patch, you'll get 208 lists of about 28 elements 
(208 chunks of size 28). The job ran on 28 CPU cores and had a wallclock time 
of 120351.590 seconds, i.e. 33.43 hours. Thus, the function we apply to our 
list takes about 580 seconds per list element, i.e. about 10 minutes. I 
suppose, for that runtime, we would get even better load balancing if we would 
reduce the chunk size even further, maybe even down to 1, thus getting our 
efficiency even closer to 100%.

Of course, for really short-running functions, a higher chunk size may be more 
efficient because of the overhead. In our case, the overhead is negligible and 
that is why the low chunk size works really well. In contrast, for smallish 
lists with short-running functions, you might not even need load balancing and 
`parLapply` suffices. It only becomes an issue, when the runtime of the 
function is high and / or varying.

In our case, the entire runtime of the entire series of jobs was:

```
$ qacct -j 4433299 | awk '$1 == "wallclock" { sum += $2 } END { print sum, 
"seconds" }'
4.72439e+09 seconds
```

Thats about 150 years on a single core or 7.5 years on a 20 core server! Our 
user was constantly using about 500 cores, so this took about 110 days. If you 
compare this to my 97% efficiency example, the jobs could have been finished in 
75 days instead ;-)

## Upcoming Patch

If this patch gets applied to the R code base (and I hope it will :-)) my 
colleague and I will submit another patch that adds the chunk size as an 
optional parameter to all off the load balancing functions. With that 
parameter, users of these functions *can* decide for themselves which chunk 
size they prefer for their code. As mentioned before, the most efficient chunk 
size depends on the used functions runtime, which is the only thing R does not 
know and users really should be allowed to specify explicitly. The default of 
this new optional parameter would be the

Re: [Rd] [parallel] fixes load balancing of parLapplyLB

2018-02-19 Thread Henrik Bengtsson
Hi, I'm trying to understand the rationale for your proposed amount of
splitting and more precisely why that one is THE one.

If I put labels on your example numbers in one of your previous post:

 nbrOfElements <- 97
 nbrOfWorkers <- 5

With these, there are two extremes in how you can split up the
processing in chunks such that all workers are utilized:

(A) Each worker, called multiple times, processes one element each time:

> nbrOfElements <- 97
> nbrOfWorkers <- 5
> nbrOfChunks <- nbrOfElements
> sapply(parallel:::splitList(1:nbrOfElements, nbrOfChunks), length)
 [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[30] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[59] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[88] 1 1 1 1 1 1 1 1 1 1


(B) Each worker, called once, processes multiple element:

> nbrOfElements <- 97
> nbrOfWorkers <- 5
> nbrOfChunks <- nbrOfWorkers
> sapply(parallel:::splitList(1:nbrOfElements, nbrOfChunks), length)
[1] 20 19 19 19 20

I understand that neither of these two extremes may be the best when
it comes to orchestration overhead and load balancing. Instead, the
best might be somewhere in-between, e.g.

(C) Each worker, called multiple times, processing multiple elements:

> nbrOfElements <- 97
> nbrOfWorkers <- 5
> nbrOfChunks <- nbrOfElements / nbrOfWorkers
> sapply(parallel:::splitList(1:nbrOfElements, nbrOfChunks), length)
 [1] 5 5 5 5 4 5 5 5 5 5 4 5 5 5 5 4 5 5 5 5

However, there are multiple alternatives between the two extremes, e.g.

> nbrOfChunks <- scale * nbrOfElements / nbrOfWorkers

So, is there a reason why you argue for scale = 1.0 to be the optimal?

FYI, In future.apply::future_lapply(X, FUN, ...) there is a
'future.scheduling' scale factor(*) argument where default
future.scheduling = 1 corresponds to (B) and future.scheduling = +Inf
to (A).  Using future.scheduling = 4 achieves the amount of
load-balancing you propose in (C).   (*) Different definition from the
above 'scale'. (Disclaimer: I'm the author)

/Henrik

On Mon, Feb 19, 2018 at 10:21 AM, Christian Krause
 wrote:
> Dear R-Devel List,
>
> I have installed R 3.4.3 with the patch applied on our cluster and ran a 
> *real-world* job of one of our users to confirm that the patch works to my 
> satisfaction. Here are the results.
>
> The original was a series of jobs, all essentially doing the same stuff using 
> bootstrapped data, so for the original there is more data and I show the 
> arithmetic mean with standard deviation. The confirmation with the patched R 
> was only a single instance of that series of jobs.
>
> ## Job Efficiency
>
> The job efficiency is defined as (this is what the `qacct-efficiency` tool 
> below does):
>
> ```
> efficiency = cputime / cores / wallclocktime * 100%
> ```
>
> In simpler words: how well did the job utilize its CPU cores. It shows the 
> percentage of time the job was actually doing stuff, as opposed to the 
> difference:
>
> ```
> wasted = 100% - efficiency
> ```
>
> ... which, essentially, tells us how much of the resources were wasted, i.e. 
> CPU cores just idling, without being used by anyone. We care a lot about that 
> because, for our scientific computing cluster, wasted resources is like 
> burning money.
>
> ### original
>
> This is the entire series from our job accounting database, filteres the 
> successful jobs, calculates efficiency and then shows the average and 
> standard deviation of the efficiency:
>
> ```
> $ qacct -j 4433299 | qacct-success | qacct-efficiency | meansd
> n=945 ∅ 61.7276 ± 7.78719
> ```
>
> This is the entire series from our job accounting database, filteres the 
> successful jobs, calculates efficiency and does sort of a histogram-like 
> binning before calculation of mean and standard deviation (to get a more 
> detailed impression of the distribution when standard deviation of the 
> previous command is comparatively high):
>
> ```
> $ qacct -j 4433299 | qacct-success | qacct-efficiency | meansd-bin -w 10 | 
> sort -gk1 | column -t
> 10  -  20  ->  n=3∅  19.216667   ±  0.9112811494447459
> 20  -  30  ->  n=6∅  26.418  ±  2.665996374091058
> 30  -  40  ->  n=12   ∅  35.115834   ±  2.8575783082671196
> 40  -  50  ->  n=14   ∅  45.35285714285715   ±  2.98623361591005
> 50  -  60  ->  n=344  ∅  57.114593023255814  ±  2.1922005551774415
> 60  -  70  ->  n=453  ∅  64.29536423841049   ±  2.8334788433963856
> 70  -  80  ->  n=108  ∅  72.95592592592598   ±  2.5219474143639276
> 80  -  90  ->  n=5∅  81.526  ±  1.2802265424525452
> ```
>
> I have attached an example graph from our monitoring system of a single 
> instance in my previous mail. There you can see that the load balancing does 
> not actually work, i.e. same as `parLapply`. This reflects in the job 
> efficiency.
>
> ### patch applied
>
> This is the single instance I used to confirm that the patch works:
>
> ```
> $ qacct -j 4562202 | qacct-efficiency
> 97.36
> ```
>
> The graph fro