[Rd] strange bahaviour of predict.lm

2020-03-16 Thread Moshe Olshansky via R-devel
Hello,
Below is my code:
> A <- matrix(rnorm(10*3),ncol=3)
> b <- runif(10)
> reg <- lm(b ~ A)
> A1 <- matrix(rnorm(5*3),ncol=3)
> A1 <- as.data.frame(A1)
> b1 <- predict(reg,A1)
Warning message:
'newdata' had 5 rows but variables found have 10 rows 

  And instead of being an array of length 5, b1 is of length 10 and is 
identical to reg$fitted.values
I think that it should not be like this.
Let me note that for lm I do not care about this as much since I can use 
reg$coefficients, but unfortunately this behaviour is "inherited" by other 
methods. When I am trying to fit a regression tree, predicting from the object 
without using 'predict' method is less trivial.
Thank you,Moshe.
P.S. just in case:> sessionInfo()
R version 3.6.2 (2019-12-12)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Linux Mint 19.1

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1

locale:
[1] C

attached base packages:
[1] stats graphics  grDevices utils datasets  methods   base     

loaded via a namespace (and not attached):
[1] compiler_3.6.2 tools_3.6.2   





[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] dist function in R is very slow

2017-06-17 Thread Moshe Olshansky via R-devel
Dear R developers,
I am visualising high dimensional genomic data and for this purpose I need to 
compute pairwise distances between many points in a high-dimensional space (say 
I have a matrix of 5,000 rows and 20,000 columns, so the result is a 
5,000x5,000 matrix or it's upper diagonal).Computing such thing in R takes many 
hours (I am doing this on a Linux server with more than 100 GB of RAM, so this 
is not the problem). When I write the matrix to disk, read it ans compute the 
distances in C, write them to the disk and read them into R it takes 10 - 15 
minutes (and I did not spend much time on optimising my C code).The question is 
why the R function is so slow? I understand that it calls C (or C++) to compute 
the distance. My suspicion is that the transposed matrix is passed to C and so 
each time a distance between two columns of a matrix is computed, and since C 
stores matrices by rows it is very inefficient and causes many cache misses (my 
first C implementation was like this and I had to stop the r
 un after an hour when it failed to complete).If my suspicion is correct, is it 
possible to re-write the dist function so that it works faster on large 
matrices?
Best regards,Moshe OlshanskyMonash University

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] dist function in R is very slow

2017-06-17 Thread Moshe Olshansky via R-devel
Hi Stefan,
Thank you very much for pointing me to the wordspace package. It does the job a 
bit faster than my C code but is 100 times more convenient.
By the way, since the tcrossprod function in the Matrix package is so fast, the 
Euclidean distance can be computed very fast:
euc_dist <- function(m) {mtm <- Matrix::tcrossprod(m); sq <- rowSums(m*m);  
sqrt(outer(sq,sq,"+") - 2*mtm)}
It takes less than 50 seconds for my (dense) matrix of 5,054 rows and 12,803 
columns, while dist.matrix with method="euclidean" takes almost 10 minutes 
(which is still orders of magnitude faster than dist).


  From: Stefan Evert 
 To: Moshe Olshansky  
Cc: R-devel Mailing List 
 Sent: Sunday, 18 June 2017, 2:33
 Subject: Re: [Rd] dist function in R is very slow
  

> On 17 Jun 2017, at 08:47, Moshe Olshansky via R-devel  
> wrote:
> 
> I am visualising high dimensional genomic data and for this purpose I need to 
> compute pairwise distances between many points in a high-dimensional space 
> (say I have a matrix of 5,000 rows and 20,000 columns, so the result is a 
> 5,000x5,000 matrix or it's upper diagonal).Computing such thing in R takes 
> many hours (I am doing this on a Linux server with more than 100 GB of RAM, 
> so this is not the problem). When I write the matrix to disk, read it ans 
> compute the distances in C, write them to the disk and read them into R it 
> takes 10 - 15 minutes (and I did not spend much time on optimising my C 
> code).The question is why the R function is so slow? I understand that it 
> calls C (or C++) to compute the distance. My suspicion is that the transposed 
> matrix is passed to C and so each time a distance between two columns of a 
> matrix is computed, and since C stores matrices by rows it is very 
> inefficient and causes many cache misses (my first C implementation was like 
> this and I had to stop the run after an hour when it failed to complete).

There are two many reasons for the relatively low speed of the built-in dist() 
function: (i) it operates on row vectors, which leads to many cache misses 
because matrices are stored by column in R (as you guessed); (ii) the function 
takes care to handle missing values correctly, which adds a relatively 
expensive test and conditional branch to each iteration of the inner loop.

A faster implementation, which omits the NA test and can compute distances 
between column vectors, is available as dist.matrix() in the "wordspace" 
package.  However, it cannot be used with matrices that might contain NAs (and 
doesn't warn about such arguments).

If you want the best possible speed, use cosine similarity (or equivalently, 
angular distance).  The underlying cross product is very efficient with a 
suitable BLAS implementation.

Best,
Stefan

   
[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] dist function in R is very slow

2017-06-18 Thread Moshe Olshansky via R-devel
Hi Stefan,
You are right about the possible loss of accuracy computing the Euclidean 
distance the way I did. In some cases you probably even can get a negative 
value to compute a square root (so I am making all negative numbers 0). To do 
what I did one must know that it is all right in their case.I tried 
wordspace.openmp wuth 8 threads and it reduces the time to just over 2.5 
minutes. This is more than enough for me.I am not sure whether you have any 
chance to beat the speed of (t)crossprod since they may be using a 
(complexity-wise) faster algorithm for matrix multiplication (may be with FFT - 
I am not sure).
Once again, thank you very much for your comments and help.



  From: Stefan Evert 
 To: Moshe Olshansky  
Cc: R-devel Mailing List 
 Sent: Monday, 19 June 2017, 2:23
 Subject: Re: [Rd] dist function in R is very slow
   

> By the way, since the tcrossprod function in the Matrix package is so fast, 
> the Euclidean distance can be computed very fast:

Indeed.

> euc_dist <- function(m) {mtm <- Matrix::tcrossprod(m); sq <- rowSums(m*m);  
> sqrt(outer(sq,sq,"+") - 2*mtm)}

There are two reasons why I didn't use this optimization in "wordspace":

1) It can be inaccurate for small distances between vectors of large Euclidean 
length because of loss of significance in the subtraction step.  This is not 
just a theoretical concern – I've seen data sets were this became a real 
problem.

2) It incurs substantial memory overhead for a large distance matrix. Your code 
allocates at least five matrices of this size: outer(…), mtm, 2 * mtm, outer(…) 
- 2*mtm, and the final result obtained by taking the square root.  [Actually, 
there is additional overhead for m*m (an even larger matrix) when computing the 
Euclidean norms, but this could be avoided with sq <- rowNorms(m, 
method="euclidean").]

I am usually more concerned about RAM than raw processing speed, so the package 
was designed to keep memory overhead as low as possible and allow users to work 
with realistic data sets on ordinary laptop computers.


> It takes less than 50 seconds for my (dense) matrix of 5,054 rows and 12,803 
> columns, while dist.matrix with method="euclidean" takes almost 10 minutes 
> (which is still orders of magnitude faster than dist).

It's a little disappointing that dist.matrix() is still relatively slow despite 
all simplifications and better cache consistency (the function automatically 
transposes the input matrix and computes distances by columns rather than 
rows).  I'm a little surprised about your timing, though.  Testing with a 
random 5000 x 2 matrix, my MacBook computers the full Euclidean distance 
matrix in about 5 minutes.  

If your machine (and version of R) supports OpenMP, you can improve performance 
by allowing multithreading with wordspace.openmp(threads=).  In my test 
case, I get a 2.2x speed-up with 4 threads (2m 15s instead of 5m).


Best wishes,
Stefan

   
[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] Problem with a regular expression.

2017-08-17 Thread Moshe Olshansky via R-devel
I tried this on a Linux (Ubuntu) server invoking R from the command line and 
the result was the same, except that I could kill the R session from another 
terminal window.


  From: Rui Barradas 
 To: Chris Triggs ; "r-devel@r-project.org" 
 
Cc: Thomas Lumley 
 Sent: Thursday, 17 August 2017, 17:26
 Subject: Re: [Rd] Problem with a regular expression.
   
Hello,

This seems to be serious.
RGui.exe, fresh session. I've clicked File > New Script and wrote

Oldterm <- c("A", "B", "A", "*", "B")
strsplit(Oldterm, ")" )

Ran each instruction at a time with Ctrl+r and with the strsplit call 
the system froze.

Ctrl+Alt+Del didn't work, I had to go for the power switch button.

sessionInfo()
R version 3.4.1 (2017-06-30)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

Matrix products: default

locale:
[1] LC_COLLATE=Portuguese_Portugal.1252 
LC_CTYPE=Portuguese_Portugal.1252
[3] LC_MONETARY=Portuguese_Portugal.1252 LC_NUMERIC=C 

[5] LC_TIME=Portuguese_Portugal.1252

attached base packages:
[1] stats    graphics  grDevices utils    datasets  methods  base

loaded via a namespace (and not attached):
[1] compiler_3.4.1


Rui Barradas

Em 16-08-2017 23:31, Chris Triggs escreveu:
> Hi...
>
> I have come upon a problem with a regular expression which causes base-R to 
> freeze.  I have reproduced the phenomenon on several machines running R under 
> Windows 10, and also under OSX  on different Apple MACs.
>
> The minimal example is:-
> Oldterm is a vector of characters, e.g. "A", "B", "A", "*", "B"
> The regular expression is ")"
>
> The call which freezes R is
> strsplit(Oldterm, ")" )
>
> Thomas - after he had reproduced the problem - suggested that I submit it to 
> r-devel.
>
> Best wishes
>            Chris Triggs
>
>
>     [[alternative HTML version deleted]]
>
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


   
[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] Duncan's retirement: who's taking over Rtools?

2017-09-28 Thread Moshe Olshansky via R-devel
I think that even though some Microsoft employees may have good intentions 
Microsoft as a company can not be trusted. There will be always a danger that 
they will try to create their own version of R which works only on Windows and 
that will become increasingly divergent from "other" R. We witnessed such 
(cursed in my opinion) attempts with their treatment of Java and Internet 
Explorer. So I think that if we want to keep R as one language it is very 
important that the person(s) responsible for R on Windows will have no 
association with Microsoft.
Best regards,Moshe.
 

On Friday, 29 September 2017, 1:47:36 am GMT+10, David Smith via R-devel 
 wrote:  
 
 Likewise, a hearty THANK YOU from me and the rest of the team at Microsoft for 
all the work you, Duncan, have put into making R available for Windows users 
around the world over the past 15 years. I know it wasn't easy (Windows is not 
without its quirks), but R users everywhere, ourselves included, are deeply 
appreciative and have benefited greatly.

The Microsoft R team is willing and able to produce builds for R on Windows 
going forward. As Duncan noted, we've been doing this already for some time for 
MRAN. I'd love to hear thoughts from this community on what that might mean, 
and Duncan I'll also reach out to you directly off-list. 

Cheers,
# David  

-Original Message-
From: R-devel [mailto:r-devel-boun...@r-project.org] On Behalf Of Joris Meys
Sent: Thursday, September 28, 2017 08:28
To: r-devel@r-project.org
Subject: [Rd] Duncan's retirement: who's taking over Rtools?

Dear dev team,

I was sorry to see the announcement of Duncan about his retirement from 
maintaining the R Windows build and Rtools. Duncan, thank you incredibly much 
for your 15 years of devotion and your impressive contribution to the R 
community as a whole.

Thinking about the future, I wondered whether there were plans for the 
succession of Duncan. Is it the intention to continue providing Rtools and a 
Windows build, or are these tasks left open for anyone (possibly Microsoft 
itself) to take them over? And if so, how will the decision be made on that?

Cheers
Joris
--
---

Biowiskundedagen 2017-2018
https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.biowiskundedagen.ugent.be%2F&data=02%7C01%7Cdavidsmi%40microsoft.com%7C2fd515da9138451611b508d50685822b%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636422092884858146&sdata=BK7GESC6ladsk6cig0ima%2BbdV1sQ5Gdeng%2BhWvtgwj4%3D&reserved=0

---

Joris Meys
Statistical consultant

Ghent University
Faculty of Bioscience Engineering
Department of Mathematical Modelling, Statistics and Bio-Informatics

tel :  +32 (0)9 264 61 79
joris.m...@ugent.be
---
Disclaimer : 
https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fhelpdesk.ugent.be%2Fe-maildisclaimer.php&data=02%7C01%7Cdavidsmi%40microsoft.com%7C2fd515da9138451611b508d50685822b%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636422092884858146&sdata=PFbW9gv7%2Byi6puj42LyWHPPBqeYd83L3oQunaLTTSnw%3D&reserved=0

    [[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fstat.ethz.ch%2Fmailman%2Flistinfo%2Fr-devel&data=02%7C01%7Cdavidsmi%40microsoft.com%7C2fd515da9138451611b508d50685822b%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636422092884858146&sdata=7ZzH9QJUaGLOIR8u2b72PMK6ze7r7hk0mleytyLC7pk%3D&reserved=0

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
  
[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel