[Rd] S3 best practice

2007-03-02 Thread Robin Hankin
Hello everyone

Suppose I have an S3 class "dog" and a function plot.dog() which
looks like this:

plot.dog <- function(x,show.uncertainty, ...){
 
   if (show.uncertainty){
   
   }
}


I think that it would be better to somehow precalculate the
uncertainty stuff and plot it separately.

How best to do this
in the context of an S3 method for plot()?

What is Best Practice here?



--
Robin Hankin
Uncertainty Analyst
National Oceanography Centre, Southampton
European Way, Southampton SO14 3ZH, UK
  tel  023-8059-7743

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] Install.packages() bug in Windows XP (PR#9540)

2007-03-02 Thread juan
Dear User=B4s,
=20
run R2.2.0 for Windows Version (S.O. Windows XP) when install.packages()
function, after select the mirror, shown:
--- Please select a CRAN mirror for use in this session ---
Aviso: unable to access index for repository
http://cran.br.r-project.org/bin/windows/contrib/2.2
Aviso: unable to access index for repository
http://www.stats.ox.ac.uk/pub/RWin/bin/windows/contrib/2.2
Erro em install.packages() : argumento "pkgs" ausente, sem padr=E3o

Thanks,
=20
=20
Eng=BA Juan S. Ramseyer.
=20

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Install.packages() bug in Windows XP (PR#9540)

2007-03-02 Thread Uwe Ligges
This is not a bug.
And if a bug, you are asked to only report bugs of recent versions of R!
Please ask questions on R-help!

1. Please check your firewall and proxy settings.
2. Please upgrade to a recent version of R.

Uwe Ligges



[EMAIL PROTECTED] wrote:
> Dear User=B4s,
> =20
> run R2.2.0 for Windows Version (S.O. Windows XP) when install.packages()
> function, after select the mirror, shown:
> --- Please select a CRAN mirror for use in this session ---
> Aviso: unable to access index for repository
> http://cran.br.r-project.org/bin/windows/contrib/2.2
> Aviso: unable to access index for repository
> http://www.stats.ox.ac.uk/pub/RWin/bin/windows/contrib/2.2
> Erro em install.packages() : argumento "pkgs" ausente, sem padr=E3o
> 
> Thanks,
> =20
> =20
> Eng=BA Juan S. Ramseyer.
> =20
> 
>   [[alternative HTML version deleted]]
> 
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] S3 best practice

2007-03-02 Thread Seth Falcon
Robin Hankin <[EMAIL PROTECTED]> writes:

> Hello everyone
>
> Suppose I have an S3 class "dog" and a function plot.dog() which
> looks like this:
>
> plot.dog <- function(x,show.uncertainty, ...){
>  
>if (show.uncertainty){
>and superimpose the results on the simple plot>
>}
> }

How uncertain is the dog in the window?

> I think that it would be better to somehow precalculate the
> uncertainty stuff and plot it separately.
>
> How best to do this
> in the context of an S3 method for plot()?

Doing long computations within plot functions can be annoying because
often one needs to "tweak" the visual style of a plot and this
requires numerous round trips.  So I like your idea of precomputing
the uncertainty stuff.

uncertainty.dog could return data that could then optionally be passed
into the plot method.  Another possibility is that the dog "class"
could store the uncertainty data and then the plot method would plot
it if it is there (and/or if an option to plot is given).  In this
case, I guess it would be:  x <- addUncertainty(x)

+ seth

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] Wishlist: Make screeplot() a generic (PR#9541)

2007-03-02 Thread gavin . simpson
Full_Name: Gavin Simpson
Version: 2.5.0
OS: Linux (FC5)
Submission from: (NULL) (128.40.33.76)


Screeplots are a common plot-type used to interpret the results of various
ordination methods and other techniques. A number of packages include ordination
techniques not included in a standard R installation. screeplot() works for
princomp and prcomp objects, but not for these other techniques as it was not
designed to do so. The current situation means, for example, that I have called
a function Screeplot() in one of my packages, but it would be easier for users
if they only had to remember to use screeplot() to generate a screeplot.

I would like to request that screeplot be made generic and methods for prcomp
and princomp added to R devel. This way, package authors can provide screeplot
methods for their functions as appropriate.

I have taken a look at the sources for R devel (from the SVN repository) in file
princomp-add.R and prcomp.R and it looks a relatively simple change to make
screeplot generic.

I would be happy to provide patches and documentation if R Core were interested
in making this change - I haven't done this yet as I don't want to spend time
doing something that might not be acceptable to R core in general.

Many thanks,

G

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] extracting rows from a data frame by looping over the row names: performance issues

2007-03-02 Thread Herve Pages
Hi,


I have a big data frame:

  > mat <- matrix(rep(paste(letters, collapse=""), 5*30), ncol=5)
  > dat <- as.data.frame(mat)

and I need to do some computation on each row. Currently I'm doing this:

  > for (key in row.names(dat)) { row <- dat[key, ]; ... do some computation on 
row... }

which could probably considered a very natural (and R'ish) way of doing it
(but maybe I'm wrong and the real idiom for doing this is something different).

The problem with this "idiomatic form" is that it is _very_ slow. The loop
itself + the simple extraction of the rows (no computation on the rows) takes
10 hours on a powerful server (quad core Linux with 8G of RAM)!

Looping over the first 100 rows takes 12 seconds:

  > system.time(for (key in row.names(dat)[1:100]) { row <- dat[key, ] })
 user  system elapsed
   12.637   0.120  12.756

But if, instead of the above, I do this:

  > for (i in nrow(dat)) { row <- sapply(dat, function(col) col[i]) }

then it's 20 times faster!!

  > system.time(for (i in 1:100) { row <- sapply(dat, function(col) col[i]) })
 user  system elapsed
0.576   0.096   0.673

I hope you will agree that this second form is much less natural.

So I was wondering why the "idiomatic form" is so slow? Shouldn't the idiomatic
form be, not only elegant and easy to read, but also efficient?


Thanks,
H.


> sessionInfo()
R version 2.5.0 Under development (unstable) (2007-01-05 r40386)
x86_64-unknown-linux-gnu

locale:
LC_CTYPE=en_US;LC_NUMERIC=C;LC_TIME=en_US;LC_COLLATE=en_US;LC_MONETARY=en_US;LC_MESSAGES=en_US;LC_PAPER=en_US;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US;LC_IDENTIFICATION=C

attached base packages:
[1] "stats" "graphics"  "grDevices" "utils" "datasets"  "methods"
[7] "base"

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] extracting rows from a data frame by looping over the row names: performance issues

2007-03-02 Thread Herve Pages
Herve Pages wrote:
...
> But if, instead of the above, I do this:
> 
>   > for (i in nrow(dat)) { row <- sapply(dat, function(col) col[i]) }

Should have been:

  > for (i in 1:nrow(dat)) { row <- sapply(dat, function(col) col[i]) }

> 
> then it's 20 times faster!!
> 
>   > system.time(for (i in 1:100) { row <- sapply(dat, function(col) col[i]) })
>  user  system elapsed
> 0.576   0.096   0.673

...

Cheers,
H.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] extracting rows from a data frame by looping over the row names: performance issues

2007-03-02 Thread Roger D. Peng
Extracting rows from data frames is tricky, since each of the columns could be 
of a different class.  For your toy example, it seems a matrix would be a more 
reasonable option.

R-devel has some improvements to row extraction, if I remember correctly.  You 
might want to try your example there.

-roger

Herve Pages wrote:
> Hi,
> 
> 
> I have a big data frame:
> 
>   > mat <- matrix(rep(paste(letters, collapse=""), 5*30), ncol=5)
>   > dat <- as.data.frame(mat)
> 
> and I need to do some computation on each row. Currently I'm doing this:
> 
>   > for (key in row.names(dat)) { row <- dat[key, ]; ... do some computation 
> on row... }
> 
> which could probably considered a very natural (and R'ish) way of doing it
> (but maybe I'm wrong and the real idiom for doing this is something 
> different).
> 
> The problem with this "idiomatic form" is that it is _very_ slow. The loop
> itself + the simple extraction of the rows (no computation on the rows) takes
> 10 hours on a powerful server (quad core Linux with 8G of RAM)!
> 
> Looping over the first 100 rows takes 12 seconds:
> 
>   > system.time(for (key in row.names(dat)[1:100]) { row <- dat[key, ] })
>  user  system elapsed
>12.637   0.120  12.756
> 
> But if, instead of the above, I do this:
> 
>   > for (i in nrow(dat)) { row <- sapply(dat, function(col) col[i]) }
> 
> then it's 20 times faster!!
> 
>   > system.time(for (i in 1:100) { row <- sapply(dat, function(col) col[i]) })
>  user  system elapsed
> 0.576   0.096   0.673
> 
> I hope you will agree that this second form is much less natural.
> 
> So I was wondering why the "idiomatic form" is so slow? Shouldn't the 
> idiomatic
> form be, not only elegant and easy to read, but also efficient?
> 
> 
> Thanks,
> H.
> 
> 
>> sessionInfo()
> R version 2.5.0 Under development (unstable) (2007-01-05 r40386)
> x86_64-unknown-linux-gnu
> 
> locale:
> LC_CTYPE=en_US;LC_NUMERIC=C;LC_TIME=en_US;LC_COLLATE=en_US;LC_MONETARY=en_US;LC_MESSAGES=en_US;LC_PAPER=en_US;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US;LC_IDENTIFICATION=C
> 
> attached base packages:
> [1] "stats" "graphics"  "grDevices" "utils" "datasets"  "methods"
> [7] "base"
> 
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
> 

-- 
Roger D. Peng  |  http://www.biostat.jhsph.edu/~rpeng/

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] extracting rows from a data frame by looping over the row names: performance issues

2007-03-02 Thread Greg Snow
Your 2 examples have 2 differences and they are therefore confounded in
their effects.

What are your results for:

system.time(for (i in 1:100) {row <-  dat[i, ] })



-- 
Gregory (Greg) L. Snow Ph.D.
Statistical Data Center
Intermountain Healthcare
[EMAIL PROTECTED]
(801) 408-8111
 
 

> -Original Message-
> From: [EMAIL PROTECTED] 
> [mailto:[EMAIL PROTECTED] On Behalf Of Herve Pages
> Sent: Friday, March 02, 2007 11:40 AM
> To: r-devel@r-project.org
> Subject: [Rd] extracting rows from a data frame by looping 
> over the row names: performance issues
> 
> Hi,
> 
> 
> I have a big data frame:
> 
>   > mat <- matrix(rep(paste(letters, collapse=""), 5*30), ncol=5)
>   > dat <- as.data.frame(mat)
> 
> and I need to do some computation on each row. Currently I'm 
> doing this:
> 
>   > for (key in row.names(dat)) { row <- dat[key, ]; ... do 
> some computation on row... }
> 
> which could probably considered a very natural (and R'ish) 
> way of doing it (but maybe I'm wrong and the real idiom for 
> doing this is something different).
> 
> The problem with this "idiomatic form" is that it is _very_ 
> slow. The loop itself + the simple extraction of the rows (no 
> computation on the rows) takes 10 hours on a powerful server 
> (quad core Linux with 8G of RAM)!
> 
> Looping over the first 100 rows takes 12 seconds:
> 
>   > system.time(for (key in row.names(dat)[1:100]) { row <- 
> dat[key, ] })
>  user  system elapsed
>12.637   0.120  12.756
> 
> But if, instead of the above, I do this:
> 
>   > for (i in nrow(dat)) { row <- sapply(dat, function(col) col[i]) }
> 
> then it's 20 times faster!!
> 
>   > system.time(for (i in 1:100) { row <- sapply(dat, 
> function(col) col[i]) })
>  user  system elapsed
> 0.576   0.096   0.673
> 
> I hope you will agree that this second form is much less natural.
> 
> So I was wondering why the "idiomatic form" is so slow? 
> Shouldn't the idiomatic form be, not only elegant and easy to 
> read, but also efficient?
> 
> 
> Thanks,
> H.
> 
> 
> > sessionInfo()
> R version 2.5.0 Under development (unstable) (2007-01-05 
> r40386) x86_64-unknown-linux-gnu
> 
> locale:
> LC_CTYPE=en_US;LC_NUMERIC=C;LC_TIME=en_US;LC_COLLATE=en_US;LC_
> MONETARY=en_US;LC_MESSAGES=en_US;LC_PAPER=en_US;LC_NAME=C;LC_A
> DDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US;LC_IDENTIFICATION=C
> 
> attached base packages:
> [1] "stats" "graphics"  "grDevices" "utils" 
> "datasets"  "methods"
> [7] "base"
> 
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] extracting rows from a data frame by looping over the row names: performance issues

2007-03-02 Thread Wolfgang Huber

Hi Hervé

depending on your problem, using "mapply" might help, as in the code 
example below:

a = data.frame(matrix(1:3e4, ncol=3))

print(system.time({
r1 = numeric(nrow(a))
for(i in seq_len(nrow(a))) {
   g = a[i,]
   r1[i] = mean(c(g$X1, g$X2, g$X3))
}}))

print(system.time({
f = function(X1,X2,X3) mean(c(X1, X2, X3))
r2 = do.call("mapply", args=append(f, a))
}))

print(identical(r1, r2))

#   user  system elapsed
   6.049   0.200   6.987
user  system elapsed
   0.508   0.000   0.509
[1] TRUE

  Best wishes
   Wolfgang

Roger D. Peng wrote:
> Extracting rows from data frames is tricky, since each of the columns could 
> be 
> of a different class.  For your toy example, it seems a matrix would be a 
> more 
> reasonable option.
> 
> R-devel has some improvements to row extraction, if I remember correctly.  
> You 
> might want to try your example there.
> 
> -roger
> 
> Herve Pages wrote:
>> Hi,
>>
>>
>> I have a big data frame:
>>
>>   > mat <- matrix(rep(paste(letters, collapse=""), 5*30), ncol=5)
>>   > dat <- as.data.frame(mat)
>>
>> and I need to do some computation on each row. Currently I'm doing this:
>>
>>   > for (key in row.names(dat)) { row <- dat[key, ]; ... do some computation 
>> on row... }
>>
>> which could probably considered a very natural (and R'ish) way of doing it
>> (but maybe I'm wrong and the real idiom for doing this is something 
>> different).
>>
>> The problem with this "idiomatic form" is that it is _very_ slow. The loop
>> itself + the simple extraction of the rows (no computation on the rows) takes
>> 10 hours on a powerful server (quad core Linux with 8G of RAM)!
>>
>> Looping over the first 100 rows takes 12 seconds:
>>
>>   > system.time(for (key in row.names(dat)[1:100]) { row <- dat[key, ] })
>>  user  system elapsed
>>12.637   0.120  12.756
>>
>> But if, instead of the above, I do this:
>>
>>   > for (i in nrow(dat)) { row <- sapply(dat, function(col) col[i]) }
>>
>> then it's 20 times faster!!
>>
>>   > system.time(for (i in 1:100) { row <- sapply(dat, function(col) col[i]) 
>> })
>>  user  system elapsed
>> 0.576   0.096   0.673
>>
>> I hope you will agree that this second form is much less natural.
>>
>> So I was wondering why the "idiomatic form" is so slow? Shouldn't the 
>> idiomatic
>> form be, not only elegant and easy to read, but also efficient?
>>
>>
>> Thanks,
>> H.
>>
>>
>>> sessionInfo()
>> R version 2.5.0 Under development (unstable) (2007-01-05 r40386)
>> x86_64-unknown-linux-gnu
>>
>> locale:
>> LC_CTYPE=en_US;LC_NUMERIC=C;LC_TIME=en_US;LC_COLLATE=en_US;LC_MONETARY=en_US;LC_MESSAGES=en_US;LC_PAPER=en_US;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US;LC_IDENTIFICATION=C
>>
>> attached base packages:
>> [1] "stats" "graphics"  "grDevices" "utils" "datasets"  "methods"
>> [7] "base"
>>
>> __
>> R-devel@r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>
> 


-- 

Best wishes
   Wolfgang

--
Wolfgang Huber  EBI/EMBL  Cambridge UK  http://www.ebi.ac.uk/huber

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] Patch for format.pval limitation in format.R

2007-03-02 Thread Charles Dupont

'format.pval' has a major limitation in its implementation for example
suppose a person had a vector like 'a' and the error being ±0.001.

> a <- c(0.1, 0.3, 0.4, 0.5, 0.3, 0.0001)
> format.pval(a, eps=0.001)

The person wants to have the 'format.pval' output with 2 digits always
showing like this

[1] "0.10"   "0.30"   "0.40"   "0.50"   "0.30"   "<0.001"

How ever format.pval can only display this

[1] "0.1""0.3""0.4""0.5""0.3""<0.001"

If this was the 'format' function this could be corrected by setting the
'nsmall' argument to 2.  But 'format.pval' has no ability to pass
arguments to format.


I think that the best solution would be to give 'format.pval' a '...'
argument that would get passed to all the 'format' function calls in
'format.pval'.

I have attached a patch that does this.  This patch is against svn
r-release-branch, but it also works with r-devel.


Charles Dupont
--
Charles Dupont  Computer System Analyst School of Medicine
Department of Biostatistics Vanderbilt University

Index: src/library/base/R/format.R
===
--- src/library/base/R/format.R	(revision 40768)
+++ src/library/base/R/format.R	(working copy)
@@ -43,7 +43,7 @@
 }
 
 format.pval <- function(pv, digits = max(1, getOption("digits")-2),
-			eps = .Machine$double.eps, na.form = "NA")
+			eps = .Machine$double.eps, na.form = "NA", ...)
 {
 ## Format  P values; auxiliary for print.summary.[g]lm(.)
 
@@ -55,8 +55,8 @@
 	## be smart -- differ for fixp. and expon. display:
 	expo <- floor(log10(ifelse(pv > 0, pv, 1e-50)))
 	fixp <- expo >= -3 | (expo == -4 & digits>1)
-	if(any( fixp)) rr[ fixp] <- format(pv[ fixp], dig=digits)
-	if(any(!fixp)) rr[!fixp] <- format(pv[!fixp], dig=digits)
+	if(any( fixp)) rr[ fixp] <- format(pv[ fixp], dig=digits, ...)
+	if(any(!fixp)) rr[!fixp] <- format(pv[!fixp], dig=digits, ...)
 	r[!is0]<- rr
 }
 if(any(is0)) {
@@ -67,7 +67,7 @@
 		digits <- max(1, nc - 7)
 	sep <- if(digits==1 && nc <= 6) "" else " "
 	} else sep <- if(digits==1) "" else " "
-	r[is0] <- paste("<", format(eps, digits=digits), sep = sep)
+	r[is0] <- paste("<", format(eps, digits=digits, ...), sep = sep)
 }
 if(has.na) { ## rarely
 	rok <- r
Index: src/library/base/man/format.pval.Rd
===
--- src/library/base/man/format.pval.Rd	(revision 40768)
+++ src/library/base/man/format.pval.Rd	(working copy)
@@ -6,13 +6,14 @@
 \alias{format.pval}
 \usage{
 format.pval(pv, digits = max(1, getOption("digits") - 2),
-eps = .Machine$double.eps, na.form = "NA")
+eps = .Machine$double.eps, na.form = "NA", \dots)
 }
 \arguments{
   \item{pv}{a numeric vector.}
   \item{digits}{how many significant digits are to be used.}
   \item{eps}{a numerical tolerance: see Details.}
   \item{na.form}{character representation of \code{NA}s.}
+  \item{\dots}{arguments passed to the \code{\link{format}} function.}
 }
 \value{
   A character vector.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] extracting rows from a data frame by looping over the row names: performance issues

2007-03-02 Thread Herve Pages
Roger D. Peng wrote:
> Extracting rows from data frames is tricky, since each of the columns
> could be of a different class.  For your toy example, it seems a matrix
> would be a more reasonable option.

There is no doubt about this ;-)

  > mat <- matrix(rep(paste(letters, collapse=""), 5*30), ncol=5)
  > dat <- as.data.frame(mat)

With the matrix:

  > system.time(for (i in 1:100) { row <- mat[i, ] })
 user  system elapsed
0   0   0

With the data frame:

  > system.time(for (key in row.names(dat)[1:100]) { row <- dat[key, ] })
 user  system elapsed
   12.565   0.296  12.859


And even with a mixed-type data frame, it's very tempting to convert it
to a matrix before to do any loop on it:

  > dat2 <- as.data.frame(mat, stringsAsFactors=FALSE)
  > dat2 <- cbind(dat2, ii=1:30)
  > sapply(dat2, typeof)
   V1  V2  V3  V4  V5  ii
  "character" "character" "character" "character" "character"   "integer"

  > system.time(for (key in row.names(dat2)[1:100]) { row <- dat2[key, ] })
 user  system elapsed
   13.201   0.144  13.360

  > system.time({mat2 <- as.matrix(dat2); for (i in 1:100) { row <- mat2[i, ] 
}})
 user  system elapsed
0.128   0.036   0.163

Big win isn't it? (only if you have enough memory for it though...)

Cheers,
H.



> 
> R-devel has some improvements to row extraction, if I remember
> correctly.  You might want to try your example there.
> 
> -roger
> 
> Herve Pages wrote:
>> Hi,
>>
>>
>> I have a big data frame:
>>
>>   > mat <- matrix(rep(paste(letters, collapse=""), 5*30), ncol=5)
>>   > dat <- as.data.frame(mat)
>>
>> and I need to do some computation on each row. Currently I'm doing this:
>>
>>   > for (key in row.names(dat)) { row <- dat[key, ]; ... do some
>> computation on row... }
>>
>> which could probably considered a very natural (and R'ish) way of
>> doing it
>> (but maybe I'm wrong and the real idiom for doing this is something
>> different).
>>
>> The problem with this "idiomatic form" is that it is _very_ slow. The
>> loop
>> itself + the simple extraction of the rows (no computation on the
>> rows) takes
>> 10 hours on a powerful server (quad core Linux with 8G of RAM)!
>>
>> Looping over the first 100 rows takes 12 seconds:
>>
>>   > system.time(for (key in row.names(dat)[1:100]) { row <- dat[key, ] })
>>  user  system elapsed
>>12.637   0.120  12.756
>>
>> But if, instead of the above, I do this:
>>
>>   > for (i in nrow(dat)) { row <- sapply(dat, function(col) col[i]) }
>>
>> then it's 20 times faster!!
>>
>>   > system.time(for (i in 1:100) { row <- sapply(dat, function(col)
>> col[i]) })
>>  user  system elapsed
>> 0.576   0.096   0.673
>>
>> I hope you will agree that this second form is much less natural.
>>
>> So I was wondering why the "idiomatic form" is so slow? Shouldn't the
>> idiomatic
>> form be, not only elegant and easy to read, but also efficient?
>>
>>
>> Thanks,
>> H.
>>
>>
>>> sessionInfo()
>> R version 2.5.0 Under development (unstable) (2007-01-05 r40386)
>> x86_64-unknown-linux-gnu
>>
>> locale:
>> LC_CTYPE=en_US;LC_NUMERIC=C;LC_TIME=en_US;LC_COLLATE=en_US;LC_MONETARY=en_US;LC_MESSAGES=en_US;LC_PAPER=en_US;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US;LC_IDENTIFICATION=C
>>
>>
>> attached base packages:
>> [1] "stats" "graphics"  "grDevices" "utils" "datasets"  "methods"
>> [7] "base"
>>
>> __
>> R-devel@r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>
>

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] extracting rows from a data frame by looping over the row names: performance issues

2007-03-02 Thread Ulf Martin
Here is an even faster one; the general point is to create a properly
vectorized custom function/expression:

mymean <- function(x, y, z) (x+y+z)/3

a = data.frame(matrix(1:3e4, ncol=3))
attach(a)
print(system.time({r3 = mymean(X1,X2,X3)}))
detach(a)

# Yields:
# [1] 0.000 0.010 0.005 0.000 0.000

print(identical(r2, r3))
# [1] TRUE

# May values for version 1 and 2 resp. were
# time for r1:
[1] 29.420 23.090 60.093  0.000  0.000

# time for r2:
[1] 1.400 0.050 1.505 0.000 0.000

Best wishes
Ulf


P.S. A somewhat more meaningful comparison of version 2 and 3:

a = data.frame(matrix(1:3e5, ncol=3))
# time r2e5:
[1] 12.04  0.15 12.92  0.00  0.00

# time r3e5:
[1] 0.030 0.020 0.051 0.000 0.000

> depending on your problem, using "mapply" might help, as in the code 
> example below:
> 
> a = data.frame(matrix(1:3e4, ncol=3))
> 
> print(system.time({
> r1 = numeric(nrow(a))
> for(i in seq_len(nrow(a))) {
>g = a[i,]
>r1[i] = mean(c(g$X1, g$X2, g$X3))
> }}))
> 
> print(system.time({
> f = function(X1,X2,X3) mean(c(X1, X2, X3))
> r2 = do.call("mapply", args=append(f, a))
> }))
> 
> print(identical(r1, r2))
> 
> #   user  system elapsed
>6.049   0.200   6.987
> user  system elapsed
>0.508   0.000   0.509
> [1] TRUE
> 
>   Best wishes
>Wolfgang
> 
> Roger D. Peng wrote:
>> Extracting rows from data frames is tricky, since each of the columns could 
>> be 
>> of a different class.  For your toy example, it seems a matrix would be a 
>> more 
>> reasonable option.
>>
>> R-devel has some improvements to row extraction, if I remember correctly.  
>> You 
>> might want to try your example there.
>>
>> -roger
>>
>> Herve Pages wrote:
>>> Hi,
>>>
>>>
>>> I have a big data frame:
>>>
>>>   > mat <- matrix(rep(paste(letters, collapse=""), 5*30), ncol=5)
>>>   > dat <- as.data.frame(mat)
>>>
>>> and I need to do some computation on each row. Currently I'm doing this:
>>>
>>>   > for (key in row.names(dat)) { row <- dat[key, ]; ... do some 
>>> computation on row... }
>>>
>>> which could probably considered a very natural (and R'ish) way of doing it
>>> (but maybe I'm wrong and the real idiom for doing this is something 
>>> different).
>>>
>>> The problem with this "idiomatic form" is that it is _very_ slow. The loop
>>> itself + the simple extraction of the rows (no computation on the rows) 
>>> takes
>>> 10 hours on a powerful server (quad core Linux with 8G of RAM)!
>>>
>>> Looping over the first 100 rows takes 12 seconds:
>>>
>>>   > system.time(for (key in row.names(dat)[1:100]) { row <- dat[key, ] })
>>>  user  system elapsed
>>>12.637   0.120  12.756
>>>
>>> But if, instead of the above, I do this:
>>>
>>>   > for (i in nrow(dat)) { row <- sapply(dat, function(col) col[i]) }
>>>
>>> then it's 20 times faster!!
>>>
>>>   > system.time(for (i in 1:100) { row <- sapply(dat, function(col) col[i]) 
>>> })
>>>  user  system elapsed
>>> 0.576   0.096   0.673
>>>
>>> I hope you will agree that this second form is much less natural.
>>>
>>> So I was wondering why the "idiomatic form" is so slow? Shouldn't the 
>>> idiomatic
>>> form be, not only elegant and easy to read, but also efficient?
>>>
>>>
>>> Thanks,
>>> H.
>>>
>>>
 sessionInfo()
>>> R version 2.5.0 Under development (unstable) (2007-01-05 r40386)
>>> x86_64-unknown-linux-gnu
>>>
>>> locale:
>>> LC_CTYPE=en_US;LC_NUMERIC=C;LC_TIME=en_US;LC_COLLATE=en_US;LC_MONETARY=en_US;LC_MESSAGES=en_US;LC_PAPER=en_US;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US;LC_IDENTIFICATION=C
>>>
>>> attached base packages:
>>> [1] "stats" "graphics"  "grDevices" "utils" "datasets"  "methods"
>>> [7] "base"
>>>
>>> __
>>> R-devel@r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>>
> 
>

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] extracting rows from a data frame by looping over the row names: performance issues

2007-03-02 Thread Herve Pages
Ulf Martin wrote:
> Here is an even faster one; the general point is to create a properly
> vectorized custom function/expression:
> 
> mymean <- function(x, y, z) (x+y+z)/3
> 
> a = data.frame(matrix(1:3e4, ncol=3))
> attach(a)
> print(system.time({r3 = mymean(X1,X2,X3)}))
> detach(a)
> 
> # Yields:
> # [1] 0.000 0.010 0.005 0.000 0.000
> 

Very fast indeed! And you don't need the attach/detach trick to make your point
since it is (almost) as fast without it:

  a = data.frame(matrix(1:3e4, ncol=3))
  print(system.time({r3 = mymean(a$X1,a$X2,a$X3)}))

However, you are lucky here because in this example (the "mean" example), you 
can
use vectorized arithmetic which is of course very fast.
What about the general case? Unfortunately situations where you can "properly 
vectorize"
tend to be much more frequent in tutorials and demos than in the real world.
Maybe the "mean" example is a little bit too specific to answer the
general question of "what's the best way to _efficiently_ step on a data
frame row by row".

Cheers,
H.



> print(identical(r2, r3))
> # [1] TRUE
> 
> # May values for version 1 and 2 resp. were
> # time for r1:
> [1] 29.420 23.090 60.093  0.000  0.000
> 
> # time for r2:
> [1] 1.400 0.050 1.505 0.000 0.000
> 
> Best wishes
> Ulf
> 
> 
> P.S. A somewhat more meaningful comparison of version 2 and 3:
> 
> a = data.frame(matrix(1:3e5, ncol=3))
> # time r2e5:
> [1] 12.04  0.15 12.92  0.00  0.00
> 
> # time r3e5:
> [1] 0.030 0.020 0.051 0.000 0.000
> 
>> depending on your problem, using "mapply" might help, as in the code 
>> example below:
>>
>> a = data.frame(matrix(1:3e4, ncol=3))
>>
>> print(system.time({
>> r1 = numeric(nrow(a))
>> for(i in seq_len(nrow(a))) {
>>g = a[i,]
>>r1[i] = mean(c(g$X1, g$X2, g$X3))
>> }}))
>>
>> print(system.time({
>> f = function(X1,X2,X3) mean(c(X1, X2, X3))
>> r2 = do.call("mapply", args=append(f, a))
>> }))
>>
>> print(identical(r1, r2))
>>
>> #   user  system elapsed
>>6.049   0.200   6.987
>> user  system elapsed
>>0.508   0.000   0.509
>> [1] TRUE
>>
>>   Best wishes
>>Wolfgang
>>
>> Roger D. Peng wrote:
>>> Extracting rows from data frames is tricky, since each of the columns could 
>>> be 
>>> of a different class.  For your toy example, it seems a matrix would be a 
>>> more 
>>> reasonable option.
>>>
>>> R-devel has some improvements to row extraction, if I remember correctly.  
>>> You 
>>> might want to try your example there.
>>>
>>> -roger
>>>
>>> Herve Pages wrote:
 Hi,


 I have a big data frame:

   > mat <- matrix(rep(paste(letters, collapse=""), 5*30), ncol=5)
   > dat <- as.data.frame(mat)

 and I need to do some computation on each row. Currently I'm doing this:

   > for (key in row.names(dat)) { row <- dat[key, ]; ... do some 
 computation on row... }

 which could probably considered a very natural (and R'ish) way of doing it
 (but maybe I'm wrong and the real idiom for doing this is something 
 different).

 The problem with this "idiomatic form" is that it is _very_ slow. The loop
 itself + the simple extraction of the rows (no computation on the rows) 
 takes
 10 hours on a powerful server (quad core Linux with 8G of RAM)!

 Looping over the first 100 rows takes 12 seconds:

   > system.time(for (key in row.names(dat)[1:100]) { row <- dat[key, ] })
  user  system elapsed
12.637   0.120  12.756

 But if, instead of the above, I do this:

   > for (i in nrow(dat)) { row <- sapply(dat, function(col) col[i]) }

 then it's 20 times faster!!

   > system.time(for (i in 1:100) { row <- sapply(dat, function(col) 
 col[i]) })
  user  system elapsed
 0.576   0.096   0.673

 I hope you will agree that this second form is much less natural.

 So I was wondering why the "idiomatic form" is so slow? Shouldn't the 
 idiomatic
 form be, not only elegant and easy to read, but also efficient?


 Thanks,
 H.


> sessionInfo()
 R version 2.5.0 Under development (unstable) (2007-01-05 r40386)
 x86_64-unknown-linux-gnu

 locale:
 LC_CTYPE=en_US;LC_NUMERIC=C;LC_TIME=en_US;LC_COLLATE=en_US;LC_MONETARY=en_US;LC_MESSAGES=en_US;LC_PAPER=en_US;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US;LC_IDENTIFICATION=C

 attached base packages:
 [1] "stats" "graphics"  "grDevices" "utils" "datasets"  "methods"
 [7] "base"

 __
 R-devel@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-devel

>>
> 
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] extracting rows from a data frame by looping over the row names: performance issues

2007-03-02 Thread Herve Pages
Hi Wolfgang,


Wolfgang Huber wrote:
> 
> Hi Hervé
> 
> depending on your problem, using "mapply" might help, as in the code
> example below:
> 
> a = data.frame(matrix(1:3e4, ncol=3))
> 
> print(system.time({
> r1 = numeric(nrow(a))
> for(i in seq_len(nrow(a))) {
>   g = a[i,]
>   r1[i] = mean(c(g$X1, g$X2, g$X3))
> }}))
> 
> print(system.time({
> f = function(X1,X2,X3) mean(c(X1, X2, X3))
> r2 = do.call("mapply", args=append(f, a))
> }))
> 
> print(identical(r1, r2))
> 
> #   user  system elapsed
>   6.049   0.200   6.987
>user  system elapsed
>   0.508   0.000   0.509
> [1] TRUE

Thanks for the tip! It's good to know about the mapply function (which I just
realize is mentioned in the "See Also" section of the lapply man page).

Cheers,
H.


> 
>  Best wishes
>   Wolfgang
> 
> Roger D. Peng wrote:
>> Extracting rows from data frames is tricky, since each of the columns
>> could be of a different class.  For your toy example, it seems a
>> matrix would be a more reasonable option.
>>
>> R-devel has some improvements to row extraction, if I remember
>> correctly.  You might want to try your example there.
>>
>> -roger
>>
>> Herve Pages wrote:
>>> Hi,
>>>
>>>
>>> I have a big data frame:
>>>
>>>   > mat <- matrix(rep(paste(letters, collapse=""), 5*30), ncol=5)
>>>   > dat <- as.data.frame(mat)
>>>
>>> and I need to do some computation on each row. Currently I'm doing this:
>>>
>>>   > for (key in row.names(dat)) { row <- dat[key, ]; ... do some
>>> computation on row... }
>>>
>>> which could probably considered a very natural (and R'ish) way of
>>> doing it
>>> (but maybe I'm wrong and the real idiom for doing this is something
>>> different).
>>>
>>> The problem with this "idiomatic form" is that it is _very_ slow. The
>>> loop
>>> itself + the simple extraction of the rows (no computation on the
>>> rows) takes
>>> 10 hours on a powerful server (quad core Linux with 8G of RAM)!
>>>
>>> Looping over the first 100 rows takes 12 seconds:
>>>
>>>   > system.time(for (key in row.names(dat)[1:100]) { row <- dat[key,
>>> ] })
>>>  user  system elapsed
>>>12.637   0.120  12.756
>>>
>>> But if, instead of the above, I do this:
>>>
>>>   > for (i in nrow(dat)) { row <- sapply(dat, function(col) col[i]) }
>>>
>>> then it's 20 times faster!!
>>>
>>>   > system.time(for (i in 1:100) { row <- sapply(dat, function(col)
>>> col[i]) })
>>>  user  system elapsed
>>> 0.576   0.096   0.673
>>>
>>> I hope you will agree that this second form is much less natural.
>>>
>>> So I was wondering why the "idiomatic form" is so slow? Shouldn't the
>>> idiomatic
>>> form be, not only elegant and easy to read, but also efficient?
>>>
>>>
>>> Thanks,
>>> H.
>>>
>>>
 sessionInfo()
>>> R version 2.5.0 Under development (unstable) (2007-01-05 r40386)
>>> x86_64-unknown-linux-gnu
>>>
>>> locale:
>>> LC_CTYPE=en_US;LC_NUMERIC=C;LC_TIME=en_US;LC_COLLATE=en_US;LC_MONETARY=en_US;LC_MESSAGES=en_US;LC_PAPER=en_US;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US;LC_IDENTIFICATION=C
>>>
>>>
>>> attached base packages:
>>> [1] "stats" "graphics"  "grDevices" "utils" "datasets" 
>>> "methods"
>>> [7] "base"
>>>
>>> __
>>> R-devel@r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>>
>>
> 
>

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] extracting rows from a data frame by looping over the row names: performance issues

2007-03-02 Thread Herve Pages
Hi Greg,

Greg Snow wrote:
> Your 2 examples have 2 differences and they are therefore confounded in
> their effects.
> 
> What are your results for:
> 
> system.time(for (i in 1:100) {row <-  dat[i, ] })
> 
> 
> 

Right. What you suggest is even faster (and more simple):

  > mat <- matrix(rep(paste(letters, collapse=""), 5*30), ncol=5)
  > dat <- as.data.frame(mat)

  > system.time(for (key in row.names(dat)[1:100]) { row <- dat[key, ] })
 user  system elapsed
   13.241   0.460  13.702

  > system.time(for (i in 1:100) { row <- sapply(dat, function(col) col[i]) })
 user  system elapsed
0.280   0.372   0.650

  > system.time(for (i in 1:100) {row <-  dat[i, ] })
 user  system elapsed
0.044   0.088   0.130

So apparently here extracting with dat[i, ] is 300 times faster than
extracting with dat[key, ] !

> system.time(for (i in 1:100) dat["1", ])
   user  system elapsed
 12.680   0.396  13.075

> system.time(for (i in 1:100) dat[1, ])
   user  system elapsed
  0.060   0.076   0.137

Good to know!

Thanks a lot,
H.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] extracting rows from a data frame by looping over the row names: performance issues

2007-03-02 Thread Seth Falcon
Herve Pages <[EMAIL PROTECTED]> writes:
> So apparently here extracting with dat[i, ] is 300 times faster than
> extracting with dat[key, ] !
>
>> system.time(for (i in 1:100) dat["1", ])
>user  system elapsed
>  12.680   0.396  13.075
>
>> system.time(for (i in 1:100) dat[1, ])
>user  system elapsed
>   0.060   0.076   0.137
>
> Good to know!

I think what you are seeing here has to do with the space efficient
storage of row.names of a data.frame.  The example data you are
working with has no specified row names and so they get stored in a
compact fashion:

mat <- matrix(rep(paste(letters, collapse=""), 5*30), ncol=5)
dat <- as.data.frame(mat)

> typeof(attr(dat, "row.names"))
[1] "integer"

In the call to [.data.frame when i is character, the appropriate index
is found using pmatch and this requires that the row names be
converted to character.  So in a loop, you get to convert the integer
vector to character vector at each iteration.

If you assign character row names, things will be a bit faster:

# before
system.time(for (i in 1:25) dat["2", ])
   user  system elapsed 
  9.337   0.404  10.731 

# this looks funny, but has the desired result
rownames(dat) <- rownames(dat)
typeof(attr(dat, "row.names")

# after
system.time(for (i in 1:25) dat["2", ])
   user  system elapsed 
  0.343   0.226   0.608 

And you probably would have seen this if you had looked at the the
profiling data:

Rprof()
for (i in 1:25) dat["2", ]
Rprof(NULL)
summaryRprof()


+ seth

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel