Re: [R] Need a vectorized way to avoid two nested FOR loops

Rama Ramakrishnan Thu, 08 Oct 2009 10:42:25 -0700

Bert, Jim, Dimitris and Joris,

Thank you all very much for your prompt help and suggestions.

After trying the ideas out, I have decided to go with Bert's approachsince it is by far the fastest of the lot.


Thanks again!

Rama Ramakrishnan


On Oct 8, 2009, at 12:49 PM, Bert Gunter wrote:

If I understand your intent, I believe you can get what you wantmuch faster

(no interpreted loops and linear times) by looking at this slightly
differently.

First of all, the choice of columns is unimportant, as indexing canbe usedto create a data frame containing only the columns of interest. So Ithinkyou can abstract your request to: group the rows of a data frame sothat allrows in a group "match." Now the problem here is exactly what youmean by"match." If the data are numeric, finite precision arithmeticrequires oneto ask whether you mean **exactly equal** or just equal within atolerance.I shall assume the former, but the latter is often what one wants.It is a

little more difficult to handle, but one way to do it with the present

approach is to first round to a few digits that represent thetolerance and

then proceed with the rounded values.

As always (and as recommended by the posting guide !) a smallreproducible

example is helpful:

## Create a data frame with groups of identical rows.

z <- data.frame(matrix(rnorm(60),ncol=3))[sample(20,50,repl=TRUE),]

## now create a factor column of "id's" in which identical columns
## have identical id's (a hash)

id <- factor(do.call(paste,c(z,sep="+")))

## The levels of the factors now "index" groups of rows that "match"
## They can be easily accessed in a variety of way, e.g.

as.numeric(id)
## gives all rows of each group of matching rows
## the same integer index.

etc.
This all requires only linear time.

Hope this helps -- or my apologies if I have misinterpreted what was
requested.


Bert Gunter
Genentech Nonclinical Biostatistics



-----Original Message-----

From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On

Behalf Of Dimitris Rizopoulos
Sent: Thursday, October 08, 2009 6:28 AM
To: joris meys
Cc: r-help@r-project.org; Rama Ramakrishnan
Subject: Re: [R] Need a vectorized way to avoid two nested FOR loops

Another approach is:

n <- 20
set.seed(2)
x <- as.data.frame(matrix(sample(1:2, n*6, TRUE), nrow = n))
x.col <- c(1, 3, 5)

values <- do.call(paste, c(x[x.col], sep = "\r"))
out <- lapply(seq_along(ind), function (i) {
    ind <- which(values == values[i])
    ind[!ind %in% i]
})
out


Best,
Dimitris


joris meys wrote:

Neat piece of code, Jim, but it still uses a nested loop. If youorder

the matrix first, you only need one passage through the whole matrix
to find the information you need.

Off course I don't take into account the ordering. If the ordering

algorithm doesn't work in linear time, then it doesn't reallymatter I

guess. The limiting step would become the ordering algorithm.

Kind regards
Joris

On Thu, Oct 8, 2009 at 2:24 PM, jim holtman <jholt...@gmail.com>wrote:

I answered the wrong question.  Here is the code to find all the
matches for each row:

n <- 20
set.seed(2)
# create test dataframe
x <- as.data.frame(matrix(sample(1:2,n*6, TRUE), nrow=n))
x
x.col <- c(1,3,5)

# match against all the other rows
x.match1 <- apply(x[, x.col], 1, function(a){
  .mat <- which(apply(x[, x.col], 1, function(z){
      all(a == z)
  }))
})

# remove matches to itself
x.match2 <- lapply(seq(length(x.match1)), function(z){
  x.match1[[z]][!(x.match1[[z]] %in% z)]
})
# x.match2 contains which rows indices match

On Wed, Oct 7, 2009 at 3:52 PM, Rama Ramakrishnan<r...@alum.mit.edu>

wrote:

Hi Friends,
I have a data frame d. Let vars be the column indices for asubset of

the

columns in d (e.g., vars <- c(1,3,4,8))

For each row r in d, I want to collect all the other rows in d that

match

the values in row r for just the columns in vars.
The naive way to do this is to have a for loop stepping througheach row

in

d, and within the loop have another loop going through all the rows

again,

checking for equality. This is quadratic in the number of rowsand takes

way

too long. Is there a better, "vectorized" way to do this?

Thanks in advance!

Rama Ramakrishnan

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide

http://www.R-project.org/posting-guide.html

and provide commented, minimal, self-contained, reproducible code.



--
Jim Holtman
Cincinnati, OH
+1 513 646 9390

What is the problem that you are trying to solve?

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide

http://www.R-project.org/posting-guide.html

and provide commented, minimal, self-contained, reproducible code.


______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide

http://www.R-project.org/posting-guide.html

and provide commented, minimal, self-contained, reproducible code.


--
Dimitris Rizopoulos
Assistant Professor
Department of Biostatistics
Erasmus University Medical Center

Address: PO Box 2040, 3000 CA Rotterdam, the Netherlands
Tel: +31/(0)10/7043478
Fax: +31/(0)10/7043014

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Need a vectorized way to avoid two nested FOR loops

Reply via email to