Re: [Rd] accelerating matrix multiply

2017-01-16 Thread Tomas Kalibera


Hi Robert,

thanks for the report and your suggestions how to make the NaN checks 
faster.


Based on my experiments it seems that the "break" in the loop actually 
can have positive impact on performance even in the common case when we 
don't have NaNs. With gcc on linux (corei7), where isnan is inlined, the 
"break" version uses a conditional jump while the "nobreak" version uses 
a conditional move. The conditional jump is faster because it takes 
advantage of the branch prediction. Neither of the two versions is 
vectorized (only scalar SSE instructions used).


How do you run R on Xeon Phi? Do you offload the NaN checks to the Phi 
coprocessor? So far I tried without offloading to Phi, icc could 
vectorize the "nobreak" version, but the performance of it was the same 
as "break".


For my experiments I extracted NaN checks into a function. This was the 
"break" version (same performance as the current code):


static __attribute__ ((noinline)) Rboolean hasNA(double *x, int n) {
  for (R_xlen_t i = 0; i < n; i++)
if (ISNAN(x[i])) return TRUE;
  return FALSE;
}

And this was the "nobreak" version:

static __attribute__ ((noinline)) Rboolean hasNA(double *x, int n) {
  Rboolean has = FALSE;
  for (R_xlen_t i = 0; i < n; i++)
if (ISNAN(x[i])) has=TRUE;
  return has;
}

Thanks,
Tomas

On 01/11/2017 02:28 PM, Cohn, Robert S wrote:

Do you have R code (including set.seed(.) if relevant) to show on how to 
generate
the large square matrices you've mentioned in the beginning?  So we get to some
reproducible benchmarks?


Hi Martin,

Here is the program I used. I only generate 2 random numbers and reuse them to 
make the benchmark run faster. Let me know if there is something I can do to 
help--alternate benchmarks, tests, experiments with compilers other than icc.

MKL LAPACK behavior is undefined for NaN's so I left the check in, just made it 
more efficient on a CPU with SIMD. Thanks for looking at this.

set.seed (1)
m <- 3
n <- 3
A <- matrix (runif(2),nrow=m,ncol=n)
B <- matrix (runif(2),nrow=m,ncol=n)
print(typeof(A[1,2]))
print(A[1,2])

# Matrix multiply
system.time (C <- B %*% A)
system.time (C <- B %*% A)
system.time (C <- B %*% A)

-Original Message-
From: Martin Maechler [mailto:maech...@stat.math.ethz.ch]
Sent: Tuesday, January 10, 2017 8:59 AM
To: Cohn, Robert S 
Cc: r-devel@r-project.org
Subject: Re: [Rd] accelerating matrix multiply


Cohn, Robert S 
 on Sat, 7 Jan 2017 16:41:42 + writes:

I am using R to multiply some large (30k x 30k double) matrices on a
64 core machine (xeon phi).  I added some timers to src/main/array.c
to see where the time is going. All of the time is being spent in the
matprod function, most of that time is spent in dgemm. 15 seconds is
in matprod in some code that is checking if there are NaNs.

system.time (C <- B %*% A)

nancheck: wall time 15.240282s
dgemm: wall time 43.111064s
  matprod: wall time 58.351572s
 user   system  elapsed
2710.154   20.999   58.398

The NaN checking code is not being vectorized because of the early
exit when NaN is detected:

/* Don't trust the BLAS to handle NA/NaNs correctly: PR#4582
 * The test is only O(n) here.
 */
for (R_xlen_t i = 0; i < NRX*ncx; i++)
if (ISNAN(x[i])) {have_na = TRUE; break;}
if (!have_na)
for (R_xlen_t i = 0; i < NRY*ncy; i++)
if (ISNAN(y[i])) {have_na = TRUE; break;}

I tried deleting the 'break'. By inspecting the asm code, I verified
that the loop was not being vectorized before, but now is vectorized.
Total time goes down:

system.time (C <- B %*% A)
nancheck: wall time  1.898667s
dgemm: wall time 43.913621s
  matprod: wall time 45.812468s
 user   system  elapsed
2727.877   20.723   45.859

The break accelerates the case when there is a NaN, at the expense of
the much more common case when there isn't a NaN. If a NaN is
detected, it doesn't call dgemm and calls its own matrix multiply,
which makes the NaN check time insignificant so I doubt the early exit
provides any benefit.

I was a little surprised that the O(n) NaN check is costly compared to
the O(n**2) dgemm that follows. I think the reason is that nan check
is single thread and not vectorized, and my machine can do 2048
floating point ops/cycle when you consider the cores/dual issue/8 way
SIMD/muladd, and the constant factor will be significant for even
large matrices.

Would you consider deleting the breaks? I can submit a patch if that
will help. Thanks.

Robert

Thank you Robert for bringing the issue up ("again", possibly).
Within R core, some have seen somewhat similar timing on some platforms (gcc) 
.. but much less dramatical differences e.g. on macOS with clang.

As seen in the source code you cite above, the current implementation was 
triggered by a nasty BLAS bug .. actually also showing up only on some 
platforms, possibly depending on runtime libraries in addition to the compilers 
used.

Do you have R code (inclu

Re: [Rd] accelerating matrix multiply

2017-01-16 Thread Cohn, Robert S
Hi Tomas,

Can you share the full code for your benchmark, compiler options, and 
performance results so that I can try to reproduce them? There are a lot of 
variables that can affect the results. Private email is fine if it is too much 
for the mailing list.

I am measuring on Knight's Landing (KNL) that was released in November. KNL is 
not a co-processor so no offload is necessary. R executes directly on the Phi, 
which looks like a multi-core machine with 64 cores.

Robert

-Original Message-
From: Tomas Kalibera [mailto:tomas.kalib...@gmail.com] 
Sent: Monday, January 16, 2017 12:00 PM
To: Cohn, Robert S 
Cc: r-devel@r-project.org
Subject: Re: [Rd] accelerating matrix multiply


Hi Robert,

thanks for the report and your suggestions how to make the NaN checks faster.

Based on my experiments it seems that the "break" in the loop actually can have 
positive impact on performance even in the common case when we don't have NaNs. 
With gcc on linux (corei7), where isnan is inlined, the "break" version uses a 
conditional jump while the "nobreak" version uses a conditional move. The 
conditional jump is faster because it takes advantage of the branch prediction. 
Neither of the two versions is vectorized (only scalar SSE instructions used).

How do you run R on Xeon Phi? Do you offload the NaN checks to the Phi 
coprocessor? So far I tried without offloading to Phi, icc could vectorize the 
"nobreak" version, but the performance of it was the same as "break".

For my experiments I extracted NaN checks into a function. This was the "break" 
version (same performance as the current code):

static __attribute__ ((noinline)) Rboolean hasNA(double *x, int n) {
   for (R_xlen_t i = 0; i < n; i++)
 if (ISNAN(x[i])) return TRUE;
   return FALSE;
}

And this was the "nobreak" version:

static __attribute__ ((noinline)) Rboolean hasNA(double *x, int n) {
   Rboolean has = FALSE;
   for (R_xlen_t i = 0; i < n; i++)
 if (ISNAN(x[i])) has=TRUE;
   return has;
}

Thanks,
Tomas

On 01/11/2017 02:28 PM, Cohn, Robert S wrote:
>> Do you have R code (including set.seed(.) if relevant) to show on how 
>> to generate the large square matrices you've mentioned in the 
>> beginning?  So we get to some reproducible benchmarks?
>
> Hi Martin,
>
> Here is the program I used. I only generate 2 random numbers and reuse them 
> to make the benchmark run faster. Let me know if there is something I can do 
> to help--alternate benchmarks, tests, experiments with compilers other than 
> icc.
>
> MKL LAPACK behavior is undefined for NaN's so I left the check in, just made 
> it more efficient on a CPU with SIMD. Thanks for looking at this.
>
> set.seed (1)
> m <- 3
> n <- 3
> A <- matrix (runif(2),nrow=m,ncol=n)
> B <- matrix (runif(2),nrow=m,ncol=n)
> print(typeof(A[1,2]))
> print(A[1,2])
>
> # Matrix multiply
> system.time (C <- B %*% A)
> system.time (C <- B %*% A)
> system.time (C <- B %*% A)
>
> -Original Message-
> From: Martin Maechler [mailto:maech...@stat.math.ethz.ch]
> Sent: Tuesday, January 10, 2017 8:59 AM
> To: Cohn, Robert S 
> Cc: r-devel@r-project.org
> Subject: Re: [Rd] accelerating matrix multiply
>
>> Cohn, Robert S 
>>  on Sat, 7 Jan 2017 16:41:42 + writes:
>> I am using R to multiply some large (30k x 30k double) matrices on a
>> 64 core machine (xeon phi).  I added some timers to src/main/array.c 
>> to see where the time is going. All of the time is being spent in the 
>> matprod function, most of that time is spent in dgemm. 15 seconds is 
>> in matprod in some code that is checking if there are NaNs.
>>> system.time (C <- B %*% A)
>> nancheck: wall time 15.240282s
>> dgemm: wall time 43.111064s
>>   matprod: wall time 58.351572s
>>  user   system  elapsed
>> 2710.154   20.999   58.398
>>
>> The NaN checking code is not being vectorized because of the early 
>> exit when NaN is detected:
>>
>>  /* Don't trust the BLAS to handle NA/NaNs correctly: PR#4582
>>   * The test is only O(n) here.
>>   */
>>  for (R_xlen_t i = 0; i < NRX*ncx; i++)
>>  if (ISNAN(x[i])) {have_na = TRUE; break;}
>>  if (!have_na)
>>  for (R_xlen_t i = 0; i < NRY*ncy; i++)
>>  if (ISNAN(y[i])) {have_na = TRUE; break;}
>>
>> I tried deleting the 'break'. By inspecting the asm code, I verified 
>> that the loop was not being vectorized before, but now is vectorized.
>> Total time goes down:
>>
>> system.time (C <- B %*% A)
>> nancheck: wall time  1.898667s
>> dgemm: wall time 43.913621s
>>   matprod: wall time 45.812468s
>>  user   system  elapsed
>> 2727.877   20.723   45.859
>>
>> The break accelerates the case when there is a NaN, at the expense of 
>> the much more common case when there isn't a NaN. If a NaN is 
>> detected, it doesn't call dgemm and calls its own matrix multiply, 
>> which makes the NaN check time insignificant so I doubt the early 
>> exit provides any benefit.
>>
>> I was a little surprised that the O(n) NaN c

[Rd] strptime("1","%m") returns NA

2017-01-16 Thread frederik
Hi R Devel,

I wrote some code which depends on 'strptime' being able to parse an
incomplete date, like this:

> base::strptime("2016","%Y")
[1] "2016-01-14 PST"

The above works - although it's odd that it gives the month and day
for Sys.time(). I might expect it to set them both to zero as the GNU
libc strptime does on my system, or to use January 1 which would also
be reasonable.

When I specify the month, however, I get NA:

> base::strptime("2016-12","%Y-%m")
[1] NA
> base::strptime("1", "%m")
[1] NA

Any reason for this to be the case?

I reported a bug here:

https://bugs.r-project.org/bugzilla3/show_bug.cgi?id=17212

but I don't think I'm getting emails from Bugzilla so maybe best to
ping me if anyone replies there instead.

I've just written a simple reimplementation of 'strptime' for my own
use; I hope this bug report may be useful to others.

Thank you,

Frederick

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] bug in rbind?

2017-01-16 Thread Krzysztof Banas
I suspect there may be a bug in base::rbind.data.frame

Below there is minimal example of the problem:

m <- matrix (1:12, 3)
dfm <- data.frame (c = 1 : 3, m = I (m))
str (dfm)

m.names <- m
rownames (m.names) <- letters [1:3]
dfm.names <- data.frame (c = 1 : 3, m = I (m.names))
str (dfm.names)

rbind (m, m.names)
rbind (m.names, m)
rbind (dfm, dfm.names)

#not working
rbind (dfm.names, dfm)

Error in rbind(deparse.level, ...) : replacement has length zero

rbind (dfm, dfm.names)$m


 [,1] [,2] [,3] [,4]

147   10

258   11

369   12

a   147   10

b   258   11

c   369   12





Important: This email is confidential and may be privileged. If you are not the 
intended recipient, please delete it and notify us immediately; you should not 
copy or use it for any purpose, nor disclose its contents to any other person. 
Thank you.

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel