[Rd] execution time of .packages
Hello, The first time in a session I call .packages( all.available = T ), it takes a long time (I have many packages installed from CRAN): > system.time( packs <- .packages( all = T ) ) user system elapsed 0.738 0.276 43.787 When I call it again, the time is now much reduced, so there must be some caching somewhere. I would like to try to reduce the time it takes the first time, but I have not been able to identify where the caching takes place, and so how I can remove it to try to improve the running time without the caching. Without this, I have to restart my computer each time to vanish the caching to test a new version of the function (this is not going to happen) Here is the .packages function, I am suspicious about this part : "ans <- c(ans, nam)" which grows the ans vector each time a suitable package is found, this does not sound right. > .packages function (all.available = FALSE, lib.loc = NULL) { if (is.null(lib.loc)) lib.loc <- .libPaths() if (all.available) { ans <- character(0L) lib.loc <- lib.loc[file.exists(lib.loc)] valid_package_version_regexp <- .standard_regexps()$valid_package_version for (lib in lib.loc) { a <- list.files(lib, all.files = FALSE, full.names = FALSE) for (nam in a) { pfile <- file.path(lib, nam, "Meta", "package.rds") if (file.exists(pfile)) info <- .readRDS(pfile)$DESCRIPTION[c("Package", "Version")] else next if ((length(info) != 2L) || any(is.na(info))) next if (!grepl(valid_package_version_regexp, info["Version"])) next ans <- c(ans, nam) ## suspicious about this } } return(unique(ans)) } s <- search() return(invisible(substring(s[substr(s, 1L, 8L) == "package:"], 9))) } > version _ platform i686-pc-linux-gnu arch i686 os linux-gnu system i686, linux-gnu status Under development (unstable) major 2 minor 9.0 year 2009 month 02 day08 svn rev47879 language R version.string R version 2.9.0 Under development (unstable) (2009-02-08 r47879) -- Romain Francois Independent R Consultant +33(0) 6 28 91 30 30 http://romainfrancois.blog.free.fr __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] execution time of .packages
The caching is in the disc system: you need to find and read the package metadata for every package. AFAIK it is not easy to flush the disc cache, but quite easy to overwrite it with later reads. (Google for more info.) If you are not concerned about validity of the installed packages you could skip the tests and hence the reads. Your times are quite a bit slower than mine, so a faster disc system might help. Since my server has just been rebooted (for a new kernel), with all of CRAN and most of BioC I get system.time( packs <- .packages( all = T ) ) user system elapsed 0.518 0.262 25.042 system.time( packs <- .packages( all = T ) ) user system elapsed 0.442 0.080 0.522 length(packs) [1] 2096 There's a similar issue when installing packages: the Perl code reads the indices from every visible package to resolve links, and that can be slow the first time. On Tue, 3 Mar 2009, Romain Francois wrote: Hello, The first time in a session I call .packages( all.available = T ), it takes a long time (I have many packages installed from CRAN): system.time( packs <- .packages( all = T ) ) user system elapsed 0.738 0.276 43.787 When I call it again, the time is now much reduced, so there must be some caching somewhere. I would like to try to reduce the time it takes the first time, but I have not been able to identify where the caching takes place, and so how I can remove it to try to improve the running time without the caching. Without this, I have to restart my computer each time to vanish the caching to test a new version of the function (this is not going to happen) Here is the .packages function, I am suspicious about this part : "ans <- c(ans, nam)" which grows the ans vector each time a suitable package is found, this does not sound right. It's OK as there are only going to be ca 2000 packages. Try profiling this: .readRDS and grepl take most of the time. .packages function (all.available = FALSE, lib.loc = NULL) { if (is.null(lib.loc)) lib.loc <- .libPaths() if (all.available) { ans <- character(0L) lib.loc <- lib.loc[file.exists(lib.loc)] valid_package_version_regexp <- .standard_regexps()$valid_package_version for (lib in lib.loc) { a <- list.files(lib, all.files = FALSE, full.names = FALSE) for (nam in a) { pfile <- file.path(lib, nam, "Meta", "package.rds") if (file.exists(pfile)) info <- .readRDS(pfile)$DESCRIPTION[c("Package", "Version")] else nextif ((length(info) != 2L) || any(is.na(info))) next if (!grepl(valid_package_version_regexp, info["Version"])) next ans <- c(ans, nam) ## suspicious about this } } return(unique(ans)) } s <- search() return(invisible(substring(s[substr(s, 1L, 8L) == "package:"], 9))) } version _ platform i686-pc-linux-gnu arch i686 os linux-gnu system i686, linux-gnu status Under development (unstable) major 2 minor 9.0 year 2009 month 02 day08 svn rev47879 language R version.string R version 2.9.0 Under development (unstable) (2009-02-08 r47879) -- Romain Francois Independent R Consultant +33(0) 6 28 91 30 30 http://romainfrancois.blog.free.fr __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel -- Brian D. Ripley, rip...@stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UKFax: +44 1865 272595 __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] execution time of .packages
Prof Brian Ripley wrote: The caching is in the disc system: you need to find and read the package metadata for every package. AFAIK it is not easy to flush the disc cache, but quite easy to overwrite it with later reads. (Google for more info.) Thanks for the info, I'll try to find my way with these directions. If you are not concerned about validity of the installed packages you could skip the tests and hence the reads. Your times are quite a bit slower than mine, so a faster disc system might help. Since my server has just been rebooted (for a new kernel), with all of CRAN and most of BioC I get system.time( packs <- .packages( all = T ) ) user system elapsed 0.518 0.262 25.042 system.time( packs <- .packages( all = T ) ) user system elapsed 0.442 0.080 0.522 length(packs) [1] 2096 There's a similar issue when installing packages: the Perl code reads the indices from every visible package to resolve links, and that can be slow the first time. On Tue, 3 Mar 2009, Romain Francois wrote: Hello, The first time in a session I call .packages( all.available = T ), it takes a long time (I have many packages installed from CRAN): system.time( packs <- .packages( all = T ) ) user system elapsed 0.738 0.276 43.787 When I call it again, the time is now much reduced, so there must be some caching somewhere. I would like to try to reduce the time it takes the first time, but I have not been able to identify where the caching takes place, and so how I can remove it to try to improve the running time without the caching. Without this, I have to restart my computer each time to vanish the caching to test a new version of the function (this is not going to happen) Here is the .packages function, I am suspicious about this part : "ans <- c(ans, nam)" which grows the ans vector each time a suitable package is found, this does not sound right. It's OK as there are only going to be ca 2000 packages. Try profiling this: .readRDS and grepl take most of the time. I usually do not trust the result of the profiler when a for loop is involved, as it tends to miss the point (or maybe I am). Consider this script below, the profiler reports 0.22 seconds when the actual time spent is about 6 seconds, and would blame rnorm as the bottleneck when the inefficiency is in with growing the data structure. Rprof( ) x <- numeric( ) for( i in 1:1){ x <- c( x, rnorm(10) ) } Rprof( NULL ) print( summaryRprof( ) ) $ time Rscript --vanilla profexample.R $by.self self.time self.pct total.time total.pct "rnorm" 0.22 100 0.22 100 $by.total total.time total.pct self.time self.pct "rnorm" 0.22 100 0.22 100 $sampling.time [1] 0.22 real0m6.164s user0m5.156s sys 0m0.737s $ time Rscript --vanilla -e "rnorm(10)" [1] 0.836411851 1.762081444 1.076305644 2.063515383 0.643254750 [6] 1.698620443 -1.774479062 -0.432886214 -0.007949533 0.284089832 real0m0.224s user0m0.187s sys 0m0.024s Now, if i replace the for loop with a similar silly lapply construct, profiler tells me a rather different story: Rprof( ) x <- numeric( ) y <- lapply( 1:1, function(i){ x <<- c( x, rnorm(10) ) NULL } ) Rprof( NULL ) print( summaryRprof( ) ) $ time Rscript --vanilla prof2.R $by.self self.time self.pct total.time total.pct "FUN" 6.48 96.1 6.68 99.1 "rnorm" 0.20 3.0 0.20 3.0 "lapply" 0.06 0.9 6.74 100.0 $by.total total.time total.pct self.time self.pct "lapply" 6.74 100.0 0.06 0.9 "FUN" 6.68 99.1 6.48 96.1 "rnorm"0.20 3.0 0.20 3.0 $sampling.time [1] 6.74 real0m8.352s user0m4.762s sys 0m2.574s Or let us wrap the for loop of the first example in a function: Rprof( ) x <- numeric( ) ffor <- function(){ for( i in 1:1){ x <- c( x, rnorm(10) ) } } ffor() Rprof( NULL ) print( summaryRprof( ) ) $ time Rscript --vanilla prof3.R $by.self self.time self.pct total.time total.pct "ffor"5.4 96.45.6 100.0 "rnorm" 0.2 3.60.2 3.6 $by.total total.time total.pct self.time self.pct "ffor" 5.6 100.0 5.4 96.4 "rnorm"0.2 3.6 0.2 3.6 $sampling.time [1] 5.6 real0m6.379s user0m5.408s sys 0m0.717s Maybe I get this all wrong, maybe the global assignment operator is responsible for some of the time in the second example. But how can I analyse the result of profiler in the first example when it seems to only be interested in the .22 seconds when I want to know what is going on with the rest of the time. Is it possible to treat "for" as a function when writing the profiler data so that I can trust it more ? Romain -- Romain Francois Independent R Consulta
Re: [Rd] execution time of .packages
On Tue, 3 Mar 2009, Romain Francois wrote: Prof Brian Ripley wrote: The caching is in the disc system: you need to find and read the package metadata for every package. AFAIK it is not easy to flush the disc cache, but quite easy to overwrite it with later reads. (Google for more info.) Thanks for the info, I'll try to find my way with these directions. If you are not concerned about validity of the installed packages you could skip the tests and hence the reads. Your times are quite a bit slower than mine, so a faster disc system might help. Since my server has just been rebooted (for a new kernel), with all of CRAN and most of BioC I get system.time( packs <- .packages( all = T ) ) user system elapsed 0.518 0.262 25.042 system.time( packs <- .packages( all = T ) ) user system elapsed 0.442 0.080 0.522 length(packs) [1] 2096 There's a similar issue when installing packages: the Perl code reads the indices from every visible package to resolve links, and that can be slow the first time. On Tue, 3 Mar 2009, Romain Francois wrote: Hello, The first time in a session I call .packages( all.available = T ), it takes a long time (I have many packages installed from CRAN): system.time( packs <- .packages( all = T ) ) user system elapsed 0.738 0.276 43.787 When I call it again, the time is now much reduced, so there must be some caching somewhere. I would like to try to reduce the time it takes the first time, but I have not been able to identify where the caching takes place, and so how I can remove it to try to improve the running time without the caching. Without this, I have to restart my computer each time to vanish the caching to test a new version of the function (this is not going to happen) Here is the .packages function, I am suspicious about this part : "ans <- c(ans, nam)" which grows the ans vector each time a suitable package is found, this does not sound right. It's OK as there are only going to be ca 2000 packages. Try profiling this: .readRDS and grepl take most of the time. I usually do not trust the result of the profiler when a for loop is involved, as it tends to miss the point (or maybe I am). Here are the data for the actual example (repeated for this message): Rprof() system.time( packs <- .packages( all = T ) ) user system elapsed 0.447 0.078 0.525 Rprof(NULL) summaryRprof() $by.self self.time self.pct total.time total.pct "grepl" 0.18 34.6 0.18 34.6 ".readRDS" 0.12 23.1 0.20 38.5 ".packages" 0.08 15.4 0.50 96.2 "close.connection" 0.04 7.7 0.04 7.7 "close" 0.02 3.8 0.06 11.5 "file.exists" 0.02 3.8 0.02 3.8 "gc"0.02 3.8 0.02 3.8 "gzfile"0.02 3.8 0.02 3.8 "list" 0.02 3.8 0.02 3.8 "system.time" 0.00 0.0 0.52 100.0 "file.path" 0.00 0.0 0.02 3.8 $by.total total.time total.pct self.time self.pct "system.time"0.52 100.0 0.00 0.0 ".packages" 0.50 96.2 0.08 15.4 ".readRDS" 0.20 38.5 0.12 23.1 "grepl" 0.18 34.6 0.18 34.6 "close" 0.06 11.5 0.02 3.8 "close.connection" 0.04 7.7 0.04 7.7 "file.exists"0.02 3.8 0.02 3.8 "gc" 0.02 3.8 0.02 3.8 "gzfile" 0.02 3.8 0.02 3.8 "list" 0.02 3.8 0.02 3.8 "file.path" 0.02 3.8 0.00 0.0 $sampling.time [1] 0.52 there is little tiime unaccounted for, and 0.38 sec is going in .readRDS and grepl. Whereas system.time({ ans <- character(0) for(i in 1:2096) ans <- c(ans, "foo") }) takes 0.024 secs, negligible here (one profiler tick). Consider this script below, Whether profiling works in other examples is beside the point here. [...] -- Brian D. Ripley, rip...@stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UKFax: +44 1865 272595 __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] execution time of .packages
Prof Brian Ripley wrote: On Tue, 3 Mar 2009, Romain Francois wrote: Prof Brian Ripley wrote: The caching is in the disc system: you need to find and read the package metadata for every package. AFAIK it is not easy to flush the disc cache, but quite easy to overwrite it with later reads. (Google for more info.) Thanks for the info, I'll try to find my way with these directions. If you are not concerned about validity of the installed packages you could skip the tests and hence the reads. Your times are quite a bit slower than mine, so a faster disc system might help. Since my server has just been rebooted (for a new kernel), with all of CRAN and most of BioC I get system.time( packs <- .packages( all = T ) ) user system elapsed 0.518 0.262 25.042 system.time( packs <- .packages( all = T ) ) user system elapsed 0.442 0.080 0.522 length(packs) [1] 2096 There's a similar issue when installing packages: the Perl code reads the indices from every visible package to resolve links, and that can be slow the first time. On Tue, 3 Mar 2009, Romain Francois wrote: Hello, The first time in a session I call .packages( all.available = T ), it takes a long time (I have many packages installed from CRAN): system.time( packs <- .packages( all = T ) ) user system elapsed 0.738 0.276 43.787 When I call it again, the time is now much reduced, so there must be some caching somewhere. I would like to try to reduce the time it takes the first time, but I have not been able to identify where the caching takes place, and so how I can remove it to try to improve the running time without the caching. Without this, I have to restart my computer each time to vanish the caching to test a new version of the function (this is not going to happen) Here is the .packages function, I am suspicious about this part : "ans <- c(ans, nam)" which grows the ans vector each time a suitable package is found, this does not sound right. It's OK as there are only going to be ca 2000 packages. Try profiling this: .readRDS and grepl take most of the time. I usually do not trust the result of the profiler when a for loop is involved, as it tends to miss the point (or maybe I am). Here are the data for the actual example (repeated for this message): Rprof() system.time( packs <- .packages( all = T ) ) user system elapsed 0.447 0.078 0.525 Rprof(NULL) summaryRprof() $by.self self.time self.pct total.time total.pct "grepl" 0.18 34.6 0.18 34.6 ".readRDS" 0.12 23.1 0.20 38.5 ".packages" 0.08 15.4 0.50 96.2 "close.connection" 0.04 7.7 0.04 7.7 "close" 0.02 3.8 0.06 11.5 "file.exists" 0.02 3.8 0.02 3.8 "gc"0.02 3.8 0.02 3.8 "gzfile"0.02 3.8 0.02 3.8 "list" 0.02 3.8 0.02 3.8 "system.time" 0.00 0.0 0.52 100.0 "file.path" 0.00 0.0 0.02 3.8 $by.total total.time total.pct self.time self.pct "system.time"0.52 100.0 0.00 0.0 ".packages" 0.50 96.2 0.08 15.4 ".readRDS" 0.20 38.5 0.12 23.1 "grepl" 0.18 34.6 0.18 34.6 "close" 0.06 11.5 0.02 3.8 "close.connection" 0.04 7.7 0.04 7.7 "file.exists"0.02 3.8 0.02 3.8 "gc" 0.02 3.8 0.02 3.8 "gzfile" 0.02 3.8 0.02 3.8 "list" 0.02 3.8 0.02 3.8 "file.path" 0.02 3.8 0.00 0.0 $sampling.time [1] 0.52 there is little tiime unaccounted for, and 0.38 sec is going in .readRDS and grepl. Whereas system.time({ ans <- character(0) for(i in 1:2096) ans <- c(ans, "foo") }) takes 0.024 secs, negligible here (one profiler tick). Here is what happens to me if I restart the computer: > Rprof( ) > system.time( packs <- .packages( all = T ) ) user system elapsed 0.888 0.342 35.589 > Rprof(NULL) > summaryRprof() $by.self self.time self.pct total.time total.pct ".readRDS" 0.34 28.8 0.64 54.2 ".packages" 0.14 11.9 1.16 98.3 "file.exists" 0.14 11.9 0.14 11.9 "gzfile"0.12 10.2 0.16 13.6 "close" 0.10 8.5 0.14 11.9 "grepl" 0.08 6.8 0.10 8.5 "$" 0.08 6.8 0.08 6.8 "file.path" 0.06 5.1 0.06 5.1 "close.connection" 0.04 3.4 0.04 3.4 "getOption" 0.02 1.7 0.04 3.4 "as.cha
[Rd] profiler and loops
Hello, (This is follow up from this thread: http://www.nabble.com/execution-time-of-.packages-td22304833.html but with a different focus) I am often confused by the result of the profiler, when a loop is involved. Consider these two scripts: script1: Rprof( ) x <- numeric( ) for( i in 1:1){ x <- c( x, rnorm(10) ) } Rprof( NULL ) print( summaryRprof( ) ) script2: Rprof( ) ffor <- function(){ x <- numeric( ) for( i in 1:1){ x <- c( x, rnorm(10) ) } } ffor() Rprof( NULL ) print( summaryRprof( ) ) []$ time Rscript --vanilla script1.R $by.self self.time self.pct total.time total.pct "rnorm" 0.22 100 0.22 100 $by.total total.time total.pct self.time self.pct "rnorm" 0.22 100 0.22 100 $sampling.time [1] 0.22 real0m7.786s user0m5.192s sys 0m0.735s []$$ time Rscript --vanilla script2.R $by.self self.time self.pct total.time total.pct "ffor" 4.94 92.5 5.34 100.0 "rnorm" 0.40 7.5 0.40 7.5 $by.total total.time total.pct self.time self.pct "ffor"5.34 100.0 4.94 92.5 "rnorm" 0.40 7.5 0.40 7.5 $sampling.time [1] 5.34 real0m7.841s user0m5.152s sys 0m0.712s In the first one, I call a for loop from the top level and in the second one, the loop is wrapped in a function call. This shows the inability of the profiler to point loops as responsible for bottlenecks. The coder of script1 would not know what to do to improve on the script. I have had a quick look in the code, and here are a few thoughts: in the function "doprof" in eval.c, this loop write the call stack on the profiler file: for (cptr = R_GlobalContext; cptr; cptr = cptr->nextcontext) { if ((cptr->callflag & (CTXT_FUNCTION | CTXT_BUILTIN)) && TYPEOF(cptr->call) == LANGSXP) { SEXP fun = CAR(cptr->call); if (!newline) newline = 1; fprintf(R_ProfileOutfile, "\"%s\" ", TYPEOF(fun) == SYMSXP ? CHAR(PRINTNAME(fun)) : ""); } } so we can see it only cares about context CTXT_FUNCTION and CTXT_BUILTIN, when for loops play with CTXT_LOOP (this is again in eval.c within the do_for function) begincontext(&cntxt, CTXT_LOOP, R_NilValue, rho, R_BaseEnv, R_NilValue, R_NilValue); which as the name implies, begins the context of the for loop. The begincontext function looks like this : void begincontext(RCNTXT * cptr, int flags, SEXP syscall, SEXP env, SEXP sysp, SEXP promargs, SEXP callfun) { cptr->nextcontext = R_GlobalContext; cptr->cstacktop = R_PPStackTop; cptr->evaldepth = R_EvalDepth; cptr->callflag = flags; cptr->call = syscall; cptr->cloenv = env; cptr->sysparent = sysp; cptr->conexit = R_NilValue; cptr->cend = NULL; cptr->promargs = promargs; cptr->callfun = callfun; cptr->vmax = vmaxget(); cptr->intsusp = R_interrupts_suspended; cptr->handlerstack = R_HandlerStack; cptr->restartstack = R_RestartStack; cptr->prstack = R_PendingPromises; #ifdef BYTECODE cptr->nodestack = R_BCNodeStackTop; # ifdef BC_INT_STACK cptr->intstack = R_BCIntStackTop; # endif #endif R_GlobalContext = cptr; } So it could be possible to set the last argument of the begincontext function to "for" and use this code in the doprof function: for (cptr = R_GlobalContext; cptr; cptr = cptr->nextcontext) { if ( ( cptr->callflag & (CTXT_FUNCTION | CTXT_BUILTIN ) ) && TYPEOF(cptr->call) == LANGSXP) { SEXP fun = CAR(cptr->call); if (!newline) newline = 1; fprintf(R_ProfileOutfile, "\"%s\" ", TYPEOF(fun) == SYMSXP ? CHAR(PRINTNAME(fun)) : ""); } else if( cptr->callflag & CTXT_LOOP){ SEXP fun = CAR(cptr->syscall); if (!newline) newline = 1; fprintf(R_ProfileOutfile, "\"%s\" ", CHAR(PRINTNAME(fun)) ); } } so that we see for in the list of "functions" that appear in the profiler file. Obviously I am taking some shortcuts here, because of the other loops, but I would like to make a formal patch with this. Before I do that, I'd like to know : - is this has a chance of breaking something else (does the CTXT_LOOP being R_NilValue is used elsewhere) - would this feature be welcome. - Should I differentiate real functions with loops in the output file, maybe I can write "[for]" instead of for to emphacize this is not a function. Romain -- Romain Francois Independent R Consultant +33(0) 6 28 91 30 30 http://romainfrancois.blog.free.fr -- Romain Francois Independent R Consultant +33(0) 6 28 91 30 30 http://romainfrancois.blog.free.fr __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] execution time of .packages
Let me repeat: what is happening for me in the equivalent of your 35.589 - 1.18 seconds is that R is waiting for my OS to read its discs (and they can be heard chuntering away). As the R process is not runniing at those times, the profiler is not running either (on a Unix-alike: on Windows the profiler does measure elapsed time). I expect it will be the same explanation for you. What I have already suggested is that if you want to save time, do not read and check the package.rds files. As far as I can see they were checked at installation in any recent version of R. Just check their existence. On Tue, 3 Mar 2009, Romain Francois wrote: Prof Brian Ripley wrote: On Tue, 3 Mar 2009, Romain Francois wrote: Prof Brian Ripley wrote: The caching is in the disc system: you need to find and read the package metadata for every package. AFAIK it is not easy to flush the disc cache, but quite easy to overwrite it with later reads. (Google for more info.) Thanks for the info, I'll try to find my way with these directions. If you are not concerned about validity of the installed packages you could skip the tests and hence the reads. Your times are quite a bit slower than mine, so a faster disc system might help. Since my server has just been rebooted (for a new kernel), with all of CRAN and most of BioC I get system.time( packs <- .packages( all = T ) ) user system elapsed 0.518 0.262 25.042 system.time( packs <- .packages( all = T ) ) user system elapsed 0.442 0.080 0.522 length(packs) [1] 2096 There's a similar issue when installing packages: the Perl code reads the indices from every visible package to resolve links, and that can be slow the first time. On Tue, 3 Mar 2009, Romain Francois wrote: Hello, The first time in a session I call .packages( all.available = T ), it takes a long time (I have many packages installed from CRAN): system.time( packs <- .packages( all = T ) ) user system elapsed 0.738 0.276 43.787 When I call it again, the time is now much reduced, so there must be some caching somewhere. I would like to try to reduce the time it takes the first time, but I have not been able to identify where the caching takes place, and so how I can remove it to try to improve the running time without the caching. Without this, I have to restart my computer each time to vanish the caching to test a new version of the function (this is not going to happen) Here is the .packages function, I am suspicious about this part : "ans <- c(ans, nam)" which grows the ans vector each time a suitable package is found, this does not sound right. It's OK as there are only going to be ca 2000 packages. Try profiling this: .readRDS and grepl take most of the time. I usually do not trust the result of the profiler when a for loop is involved, as it tends to miss the point (or maybe I am). Here are the data for the actual example (repeated for this message): Rprof() system.time( packs <- .packages( all = T ) ) user system elapsed 0.447 0.078 0.525 Rprof(NULL) summaryRprof() $by.self self.time self.pct total.time total.pct "grepl" 0.18 34.6 0.18 34.6 ".readRDS" 0.12 23.1 0.20 38.5 ".packages" 0.08 15.4 0.50 96.2 "close.connection" 0.04 7.7 0.04 7.7 "close" 0.02 3.8 0.06 11.5 "file.exists" 0.02 3.8 0.02 3.8 "gc"0.02 3.8 0.02 3.8 "gzfile"0.02 3.8 0.02 3.8 "list" 0.02 3.8 0.02 3.8 "system.time" 0.00 0.0 0.52 100.0 "file.path" 0.00 0.0 0.02 3.8 $by.total total.time total.pct self.time self.pct "system.time"0.52 100.0 0.00 0.0 ".packages" 0.50 96.2 0.08 15.4 ".readRDS" 0.20 38.5 0.12 23.1 "grepl" 0.18 34.6 0.18 34.6 "close" 0.06 11.5 0.02 3.8 "close.connection" 0.04 7.7 0.04 7.7 "file.exists"0.02 3.8 0.02 3.8 "gc" 0.02 3.8 0.02 3.8 "gzfile" 0.02 3.8 0.02 3.8 "list" 0.02 3.8 0.02 3.8 "file.path" 0.02 3.8 0.00 0.0 $sampling.time [1] 0.52 there is little tiime unaccounted for, and 0.38 sec is going in .readRDS and grepl. Whereas system.time({ ans <- character(0) for(i in 1:2096) ans <- c(ans, "foo") }) takes 0.024 secs, negligible here (one profiler tick). Here is what happens to me if I restart the computer: Rprof( ) system.time( packs <- .packages( all = T ) ) user system elapsed 0.888 0.342 35.589 Rprof(NULL) summaryRprof() $by.self
Re: [Rd] execution time of .packages
Prof Brian Ripley wrote: Let me repeat: what is happening for me in the equivalent of your 35.589 - 1.18 seconds is that R is waiting for my OS to read its discs (and they can be heard chuntering away). As the R process is not runniing at those times, the profiler is not running either (on a Unix-alike: on Windows the profiler does measure elapsed time). I expect it will be the same explanation for you. Thank you. I get it this time. What I have already suggested is that if you want to save time, do not read and check the package.rds files. As far as I can see they were checked at installation in any recent version of R. Just check their existence. On Tue, 3 Mar 2009, Romain Francois wrote: Prof Brian Ripley wrote: On Tue, 3 Mar 2009, Romain Francois wrote: Prof Brian Ripley wrote: The caching is in the disc system: you need to find and read the package metadata for every package. AFAIK it is not easy to flush the disc cache, but quite easy to overwrite it with later reads. (Google for more info.) Thanks for the info, I'll try to find my way with these directions. If you are not concerned about validity of the installed packages you could skip the tests and hence the reads. Your times are quite a bit slower than mine, so a faster disc system might help. Since my server has just been rebooted (for a new kernel), with all of CRAN and most of BioC I get system.time( packs <- .packages( all = T ) ) user system elapsed 0.518 0.262 25.042 system.time( packs <- .packages( all = T ) ) user system elapsed 0.442 0.080 0.522 length(packs) [1] 2096 There's a similar issue when installing packages: the Perl code reads the indices from every visible package to resolve links, and that can be slow the first time. On Tue, 3 Mar 2009, Romain Francois wrote: Hello, The first time in a session I call .packages( all.available = T ), it takes a long time (I have many packages installed from CRAN): system.time( packs <- .packages( all = T ) ) user system elapsed 0.738 0.276 43.787 When I call it again, the time is now much reduced, so there must be some caching somewhere. I would like to try to reduce the time it takes the first time, but I have not been able to identify where the caching takes place, and so how I can remove it to try to improve the running time without the caching. Without this, I have to restart my computer each time to vanish the caching to test a new version of the function (this is not going to happen) Here is the .packages function, I am suspicious about this part : "ans <- c(ans, nam)" which grows the ans vector each time a suitable package is found, this does not sound right. It's OK as there are only going to be ca 2000 packages. Try profiling this: .readRDS and grepl take most of the time. I usually do not trust the result of the profiler when a for loop is involved, as it tends to miss the point (or maybe I am). Here are the data for the actual example (repeated for this message): Rprof() system.time( packs <- .packages( all = T ) ) user system elapsed 0.447 0.078 0.525 Rprof(NULL) summaryRprof() $by.self self.time self.pct total.time total.pct "grepl" 0.18 34.6 0.18 34.6 ".readRDS" 0.12 23.1 0.20 38.5 ".packages" 0.08 15.4 0.50 96.2 "close.connection" 0.04 7.7 0.04 7.7 "close" 0.02 3.8 0.06 11.5 "file.exists" 0.02 3.8 0.02 3.8 "gc"0.02 3.8 0.02 3.8 "gzfile"0.02 3.8 0.02 3.8 "list" 0.02 3.8 0.02 3.8 "system.time" 0.00 0.0 0.52 100.0 "file.path" 0.00 0.0 0.02 3.8 $by.total total.time total.pct self.time self.pct "system.time"0.52 100.0 0.00 0.0 ".packages" 0.50 96.2 0.08 15.4 ".readRDS" 0.20 38.5 0.12 23.1 "grepl" 0.18 34.6 0.18 34.6 "close" 0.06 11.5 0.02 3.8 "close.connection" 0.04 7.7 0.04 7.7 "file.exists"0.02 3.8 0.02 3.8 "gc" 0.02 3.8 0.02 3.8 "gzfile" 0.02 3.8 0.02 3.8 "list" 0.02 3.8 0.02 3.8 "file.path" 0.02 3.8 0.00 0.0 $sampling.time [1] 0.52 there is little tiime unaccounted for, and 0.38 sec is going in .readRDS and grepl. Whereas system.time({ ans <- character(0) for(i in 1:2096) ans <- c(ans, "foo") }) takes 0.024 secs, negligible here (one profiler tick). Here is what happens to me if I restart the computer: Rprof( ) system.time( packs <- .packages( all = T ) ) user system elapsed 0
[Rd] R 2.9.0 devel: package installation with configure-args option
Hi, trying to install a package containing C code and requiring non-default configure argument settings the incantation (this has worked for R <= 2.8.1 on the same architectures) R CMD INSTALL --configure-args="--with-opt1 --with-opt2" packname does always result in a warning Warning: unknown option '--with-opt2' and consequently the option is ignored. Reverting the order of options results in the now last option to be ignored. Alternative quoting has not provided a solution. Using R CMD INSTALL --configure-args=--with-opt1 --configure-args=--with-opt2 packname does provide a workaround, though. Is this the (new to me) and only intended way to provide more than one configure argument? I checked ?INSTALL and the referenced R-admin sec. 'Configuration variables' but still am not clear on this. Regards, Matthias R version 2.9.0 Under development (unstable) (2009-03-02 r48041) on Ubuntu 8.04, 8.10 -- Matthias Burger Project Manager/ Biostatistician Epigenomics AGKleine Praesidentenstr. 110178 Berlin, Germany phone:+49-30-24345-0fax:+49-30-24345-555 http://www.epigenomics.com matthias.bur...@epigenomics.com -- Epigenomics AG Berlin Amtsgericht Charlottenburg HRB 75861 Vorstand: Geert Nygaard (CEO/Vorsitzender) Oliver Schacht PhD (CFO) Aufsichtsrat: Prof. Dr. Dr. hc. Rolf Krebs (Chairman/Vorsitzender) __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] profiler and loops
Hello, Please find attached a patch against svn implementing this proposal. The part I don't fully understand is the part involving the function loopWithContect, so I've put "[loop]" in there instead of "[for]", "[while]" or "[repeat]" because I don't really know how to extract the information. With the script1 from my previous post, summaryRprof produces this: []$ /home/romain/workspace/R-trunk/bin/Rscript script1.R $by.self self.time self.pct total.time total.pct "[for]" 5.32 98.9 5.38 100.0 "rnorm" 0.06 1.1 0.06 1.1 $by.total total.time total.pct self.time self.pct "[for]" 5.38 100.0 5.32 98.9 "rnorm" 0.06 1.1 0.06 1.1 $sampling.time [1] 5.38 Romain Romain Francois wrote: Hello, (This is follow up from this thread: http://www.nabble.com/execution-time-of-.packages-td22304833.html but with a different focus) I am often confused by the result of the profiler, when a loop is involved. Consider these two scripts: script1: Rprof( ) x <- numeric( ) for( i in 1:1){ x <- c( x, rnorm(10) ) } Rprof( NULL ) print( summaryRprof( ) ) script2: Rprof( ) ffor <- function(){ x <- numeric( ) for( i in 1:1){ x <- c( x, rnorm(10) ) } } ffor() Rprof( NULL ) print( summaryRprof( ) ) []$ time Rscript --vanilla script1.R $by.self self.time self.pct total.time total.pct "rnorm" 0.22 100 0.22 100 $by.total total.time total.pct self.time self.pct "rnorm" 0.22 100 0.22 100 $sampling.time [1] 0.22 real0m7.786s user0m5.192s sys 0m0.735s []$$ time Rscript --vanilla script2.R $by.self self.time self.pct total.time total.pct "ffor" 4.94 92.5 5.34 100.0 "rnorm" 0.40 7.5 0.40 7.5 $by.total total.time total.pct self.time self.pct "ffor"5.34 100.0 4.94 92.5 "rnorm" 0.40 7.5 0.40 7.5 $sampling.time [1] 5.34 real0m7.841s user0m5.152s sys 0m0.712s In the first one, I call a for loop from the top level and in the second one, the loop is wrapped in a function call. This shows the inability of the profiler to point loops as responsible for bottlenecks. The coder of script1 would not know what to do to improve on the script. I have had a quick look in the code, and here are a few thoughts: in the function "doprof" in eval.c, this loop write the call stack on the profiler file: for (cptr = R_GlobalContext; cptr; cptr = cptr->nextcontext) { if ((cptr->callflag & (CTXT_FUNCTION | CTXT_BUILTIN)) && TYPEOF(cptr->call) == LANGSXP) { SEXP fun = CAR(cptr->call); if (!newline) newline = 1; fprintf(R_ProfileOutfile, "\"%s\" ", TYPEOF(fun) == SYMSXP ? CHAR(PRINTNAME(fun)) : ""); } } so we can see it only cares about context CTXT_FUNCTION and CTXT_BUILTIN, when for loops play with CTXT_LOOP (this is again in eval.c within the do_for function) begincontext(&cntxt, CTXT_LOOP, R_NilValue, rho, R_BaseEnv, R_NilValue, R_NilValue); which as the name implies, begins the context of the for loop. The begincontext function looks like this : void begincontext(RCNTXT * cptr, int flags, SEXP syscall, SEXP env, SEXP sysp, SEXP promargs, SEXP callfun) { cptr->nextcontext = R_GlobalContext; cptr->cstacktop = R_PPStackTop; cptr->evaldepth = R_EvalDepth; cptr->callflag = flags; cptr->call = syscall; cptr->cloenv = env; cptr->sysparent = sysp; cptr->conexit = R_NilValue; cptr->cend = NULL; cptr->promargs = promargs; cptr->callfun = callfun; cptr->vmax = vmaxget(); cptr->intsusp = R_interrupts_suspended; cptr->handlerstack = R_HandlerStack; cptr->restartstack = R_RestartStack; cptr->prstack = R_PendingPromises; #ifdef BYTECODE cptr->nodestack = R_BCNodeStackTop; # ifdef BC_INT_STACK cptr->intstack = R_BCIntStackTop; # endif #endif R_GlobalContext = cptr; } So it could be possible to set the last argument of the begincontext function to "for" and use this code in the doprof function: for (cptr = R_GlobalContext; cptr; cptr = cptr->nextcontext) { if ( ( cptr->callflag & (CTXT_FUNCTION | CTXT_BUILTIN ) ) && TYPEOF(cptr->call) == LANGSXP) { SEXP fun = CAR(cptr->call); if (!newline) newline = 1; fprintf(R_ProfileOutfile, "\"%s\" ", TYPEOF(fun) == SYMSXP ? CHAR(PRINTNAME(fun)) : ""); } else if( cptr->callflag & CTXT_LOOP){ SEXP fun = CAR(cptr->syscall); if (!newline) newline = 1; fprintf(R_ProfileOutfile, "\"%s\" ", CHAR(PRINTNAME(fun)) ); } } so that we see for in the list of "functions" that appear in the profiler file. Obviously I am taking some shortcuts here, because of the other loops, but I would like to make a formal patch with this. Before I do that, I'd like to know : - is this has a chance of breaking something else (does the CTXT_LOOP be
Re: [Rd] S4 data dump or?
Prof Brian Ripley wrote: On Mon, 2 Mar 2009, Paul Gilbert wrote: I am trying to dump some data in a file that I will add to a package. The data has an attribute which is a S4 object, and this seems to cause problems. What is the preferred way to write a file with a dataset that has some S4 parts, so that it can be included in a package? Using save() seems almost always preferable to dump(): usually a smaller result, avoids representation error changes for numeric types and encoding issues for some character vectors, works for almost all objects. Ok. I thought I was having a problem with save()/load() too, but it seems that problem was something else. I have this working now. I am guessing that the note on ?dump about objects of type S4 is the issue here. Yes, the S4 object causes source() of the dump()ed file to fail. Thanks, Paul Paul Gilbert La version française suit le texte anglais. This email may contain privileged and/or confidential information, and the Bank of Canada does not waive any related rights. Any distribution, use, or copying of this email or the information it contains by other than the intended recipient is unauthorized. If you received this email in error please delete it immediately from your system and notify the sender promptly by email that you have done so. Le présent courriel peut contenir de l'information privilégiée ou confidentielle. La Banque du Canada ne renonce pas aux droits qui s'y rapportent. Toute diffusion, utilisation ou copie de ce courriel ou des renseignements qu'il contient par une personne autre que le ou les destinataires désignés est interdite. Si vous recevez ce courriel par erreur, veuillez le supprimer immédiatement et envoyer sans délai à l'expéditeur un message électronique pour l'aviser que vous avez éliminé de votre ordinateur toute copie du courriel reçu. __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] make check reg-tests-1.R error on solaris
R 2.5.1 compiled, passed the make check and has been successfully running for a couple years on a Sun Fire V490 running Solaris 9. I need a newer version of R, but can't get a newer version of R to pass the make check. I've tried 2.8.1, 2.7.2, 2.6.2 and 2.6.0. (2.5.1 still passes on this server) At this point I thought I'd try to compile it on another Sun server (Solaris 10), but it had the same problem. Configuring with no options didn't help. I commented out the failed test from the Makefile to see if it would pass the rest of the tests. It passes all the rest of the tests. Here is the failure error from make check. make[2]: Entering directory `/usr/local/src/R-2.8.1/tests' running regression tests make[3]: Entering directory `/usr/local/src/R-2.8.1/tests' running code in 'reg-tests-1.R' ...make[3]: *** [reg-tests-1.Rout] Error 1 make[3]: Leaving directory `/usr/local/src/R-2.8.1/tests' make[2]: *** [test-Reg] Error 2 make[2]: Leaving directory `/usr/local/src/R-2.8.1/tests' make[1]: *** [test-all-basics] Error 1 make[1]: Leaving directory `/usr/local/src/R-2.8.1/tests' make: *** [check] Error 2 bash-2.05# Here is output from reg-tests-1.Rout.fail. [1] "41c6167e" "dir1" "dir2" "dirs" "file275c23f2" [6] "file33f963f2" "moredirs" > file.create(file.path(dd, "somefile")) [1] TRUE TRUE TRUE TRUE > dir(".", recursive=TRUE) [1] "41c6167e" "dir1/somefile" "dir2/somefile" [4] "dirs/somefile" "file275c23f2" "file33f963f2" [7] "moredirs/somefile" > stopifnot(unlink("dir?") == 1) # not an error Error: unlink("dir?") == 1 is not TRUE Execution halted rm: Cannot remove any directory in the path of the current working directory /tmp/RtmprBjF6W Looking through the archives I did find a couple other people with this error, both running Solaris 10. PR#10501 and PR#11738 have quite a lot of information about this error, but I don't see any resolution for them. This looks like it could possibly be enough of a problem that I haven't put 2.8.1 in production. Can you help me with a resolution or let me know if it is safe to ignore? I'd appreciate it. Thank you! Karen __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] profiler and loops
Please ignore the previous patch which did not take into account the conditional compilation of doprof on windows. This one does, but was not tested on windows. Romain Romain Francois wrote: Hello, Please find attached a patch against svn implementing this proposal. The part I don't fully understand is the part involving the function loopWithContect, so I've put "[loop]" in there instead of "[for]", "[while]" or "[repeat]" because I don't really know how to extract the information. With the script1 from my previous post, summaryRprof produces this: []$ /home/romain/workspace/R-trunk/bin/Rscript script1.R $by.self self.time self.pct total.time total.pct "[for]" 5.32 98.9 5.38 100.0 "rnorm" 0.06 1.1 0.06 1.1 $by.total total.time total.pct self.time self.pct "[for]" 5.38 100.0 5.32 98.9 "rnorm" 0.06 1.1 0.06 1.1 $sampling.time [1] 5.38 Romain Romain Francois wrote: Hello, (This is follow up from this thread: http://www.nabble.com/execution-time-of-.packages-td22304833.html but with a different focus) I am often confused by the result of the profiler, when a loop is involved. Consider these two scripts: script1: Rprof( ) x <- numeric( ) for( i in 1:1){ x <- c( x, rnorm(10) ) } Rprof( NULL ) print( summaryRprof( ) ) script2: Rprof( ) ffor <- function(){ x <- numeric( ) for( i in 1:1){ x <- c( x, rnorm(10) ) } } ffor() Rprof( NULL ) print( summaryRprof( ) ) []$ time Rscript --vanilla script1.R $by.self self.time self.pct total.time total.pct "rnorm" 0.22 100 0.22 100 $by.total total.time total.pct self.time self.pct "rnorm" 0.22 100 0.22 100 $sampling.time [1] 0.22 real0m7.786s user0m5.192s sys 0m0.735s []$$ time Rscript --vanilla script2.R $by.self self.time self.pct total.time total.pct "ffor" 4.94 92.5 5.34 100.0 "rnorm" 0.40 7.5 0.40 7.5 $by.total total.time total.pct self.time self.pct "ffor"5.34 100.0 4.94 92.5 "rnorm" 0.40 7.5 0.40 7.5 $sampling.time [1] 5.34 real0m7.841s user0m5.152s sys 0m0.712s In the first one, I call a for loop from the top level and in the second one, the loop is wrapped in a function call. This shows the inability of the profiler to point loops as responsible for bottlenecks. The coder of script1 would not know what to do to improve on the script. I have had a quick look in the code, and here are a few thoughts: in the function "doprof" in eval.c, this loop write the call stack on the profiler file: for (cptr = R_GlobalContext; cptr; cptr = cptr->nextcontext) { if ((cptr->callflag & (CTXT_FUNCTION | CTXT_BUILTIN)) && TYPEOF(cptr->call) == LANGSXP) { SEXP fun = CAR(cptr->call); if (!newline) newline = 1; fprintf(R_ProfileOutfile, "\"%s\" ", TYPEOF(fun) == SYMSXP ? CHAR(PRINTNAME(fun)) : ""); } } so we can see it only cares about context CTXT_FUNCTION and CTXT_BUILTIN, when for loops play with CTXT_LOOP (this is again in eval.c within the do_for function) begincontext(&cntxt, CTXT_LOOP, R_NilValue, rho, R_BaseEnv, R_NilValue, R_NilValue); which as the name implies, begins the context of the for loop. The begincontext function looks like this : void begincontext(RCNTXT * cptr, int flags, SEXP syscall, SEXP env, SEXP sysp, SEXP promargs, SEXP callfun) { cptr->nextcontext = R_GlobalContext; cptr->cstacktop = R_PPStackTop; cptr->evaldepth = R_EvalDepth; cptr->callflag = flags; cptr->call = syscall; cptr->cloenv = env; cptr->sysparent = sysp; cptr->conexit = R_NilValue; cptr->cend = NULL; cptr->promargs = promargs; cptr->callfun = callfun; cptr->vmax = vmaxget(); cptr->intsusp = R_interrupts_suspended; cptr->handlerstack = R_HandlerStack; cptr->restartstack = R_RestartStack; cptr->prstack = R_PendingPromises; #ifdef BYTECODE cptr->nodestack = R_BCNodeStackTop; # ifdef BC_INT_STACK cptr->intstack = R_BCIntStackTop; # endif #endif R_GlobalContext = cptr; } So it could be possible to set the last argument of the begincontext function to "for" and use this code in the doprof function: for (cptr = R_GlobalContext; cptr; cptr = cptr->nextcontext) { if ( ( cptr->callflag & (CTXT_FUNCTION | CTXT_BUILTIN ) ) && TYPEOF(cptr->call) == LANGSXP) { SEXP fun = CAR(cptr->call); if (!newline) newline = 1; fprintf(R_ProfileOutfile, "\"%s\" ", TYPEOF(fun) == SYMSXP ? CHAR(PRINTNAME(fun)) : ""); } else if( cptr->callflag & CTXT_LOOP){ SEXP fun = CAR(cptr->syscall); if (!newline) newline = 1; fprintf(R_ProfileOutfile, "\"%s\" ", CHAR(PRINTNAME(fun)) ); } } so that we see for in the list of "functions" that appear in the profiler file. Obviously I am taking some shortcuts h
Re: [Rd] R 2.9.0 devel: package installation with configure-args option
That version of R is 'under development' and the INSTALL file says ## FIXME: this loses quotes, so filepaths with spaces in get broken up so it is I think the same as a known issue. The whole package installation process has been completely reconstructed for R-devel, and the process is not quite finished. And this is a low priority as there are effective workarounds. On Tue, 3 Mar 2009, ml-it-r-de...@epigenomics.com wrote: Hi, trying to install a package containing C code and requiring non-default configure argument settings the incantation (this has worked for R <= 2.8.1 on the same architectures) R CMD INSTALL --configure-args="--with-opt1 --with-opt2" packname does always result in a warning Warning: unknown option '--with-opt2' and consequently the option is ignored. Reverting the order of options results in the now last option to be ignored. Alternative quoting has not provided a solution. Using R CMD INSTALL --configure-args=--with-opt1 --configure-args=--with-opt2 packname does provide a workaround, though. Is this the (new to me) and only intended way to provide more than one configure argument? I checked ?INSTALL and the referenced R-admin sec. 'Configuration variables' but still am not clear on this. Regards, Matthias R version 2.9.0 Under development (unstable) (2009-03-02 r48041) on Ubuntu 8.04, 8.10 -- Matthias Burger Project Manager/ Biostatistician Epigenomics AGKleine Praesidentenstr. 110178 Berlin, Germany phone:+49-30-24345-0fax:+49-30-24345-555 http://www.epigenomics.com matthias.bur...@epigenomics.com -- Epigenomics AG Berlin Amtsgericht Charlottenburg HRB 75861 Vorstand: Geert Nygaard (CEO/Vorsitzender) Oliver Schacht PhD (CFO) Aufsichtsrat: Prof. Dr. Dr. hc. Rolf Krebs (Chairman/Vorsitzender) __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel -- Brian D. Ripley, rip...@stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UKFax: +44 1865 272595 __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] X11.Rd has a dead link
In X11.Rd the Resources section has the following dead link: http://web.mit.edu/answers/xwindows/xwindows_resources.html I never saw the target document but is this its new URL: http://kb.mit.edu/confluence/pages/viewpage.action?pageId=3907291 Thank you, Stephen -- Rochester, Minn. USA __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] 'anova.gls' in 'nlme' (PR#13567)
There is a bug in 'anova.gls' in the 'nlme' package (3.1-90). The=20 bug is triggered by calling the function with a single 'gls' object=20 and specifying the 'Terms' argument but not the 'L' argument: > library(nlme) > fm1Orth.gls <- gls(distance ~ Sex * I(age - 11), Orthodont, + correlation =3D corSymm(form =3D ~ 1 | Subject), + weights =3D varIdent(form =3D ~ 1 | age)) > anova(fm1Orth.gls) Denom. DF: 104=20 numDF F-value p-value (Intercept) 1 4246.041 <.0001 Sex 17.718 0.0065 I(age - 11) 1 116.806 <.0001 Sex:I(age - 11) 17.402 0.0076 > anova(fm1Orth.gls, Terms=3D"Sex") Error in anova.gls(fm1Orth.gls, Terms =3D "Sex") :=20 object "noZeroColL" not found > The bug is in the following lines near the end: if (!missing(L)) { if (nrow(L) > 1) attr(aod, "L") <- L[, noZeroColL, drop =3D FALSE] else attr(aod, "L") <- L[, noZeroColL] } where the problem is that when 'Terms' is provided, earlier code=20 sets 'L' (so it is no longer missing) but does not set 'noZeroColL'. In the similar function 'anova.lme' the problem is avoided by the=20 first line Lmiss <- missing(L) and then testing whether 'Lmiss' is TRUE in the rest of the=20 function, rather than 'missing(L)'. Rich Raubertas Merck & Co. =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D > sessionInfo() R version 2.8.1 (2008-12-22)=20 i386-pc-mingw32=20 locale: LC_COLLATE=3DEnglish_United States.1252;LC_CTYPE=3DEnglish_United States.1252;LC_MONETARY=3DEnglish_United States.1252;LC_NUMERIC=3DC;LC_TIME=3DEnglish_United States.1252 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] nlme_3.1-90 loaded via a namespace (and not attached): [1] grid_2.8.1 lattice_0.17-20 tools_2.8.1 =20 >=20 Notice: This e-mail message, together with any attachme...{{dropped:12}} __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] S4 helper functions: regular or generic?
Dear Martin, Thanks a lot for your help, apologies for this very late reply. I decided to go with your suggestion, write a regular function. I guess this avoids doing obj <- as(foo(as(obj, 'Base')), 'Derived') and then repopulating the extra slots of of the 'Derived' class. Regards, gopi. On Wed, Feb 25, 2009 at 9:36 AM, Martin Morgan wrote: > Hi Gopi -- > > Gopi Goswami writes: > >> Hi there, >> >> >> I want to write helper functions for a base class, which will be used >> by its subclasses in the S4 world. This function ___will___ update >> certain slots of its argument object. Please help me decide which one >> of the following is a better approach with respect to coding style, >> memory usage and speed: >> > > My opinion: > >> o Write a regular function. > > memory and speed > >> o Declare a generic and implement it just for the base class. > > coding 'style', but style is subjective. > > There are other aspects of S4, e.g., type checking, method dispatch, > programmatically defined and discoverable API, ... (positives), > cumbersome documentation (negative). > > My usual pattern of development is to be seduced by the siren of > speed, only to regret boxing myself in. > > I find that my S4 objects typically serve as containers for > coordinating other entities. The important methods typically extract > R 'base' objects from the S4 class, manipulate them, and repackage the > result as S4. The time and speed issues are in the manipulation, not > in the extraction / repackaging. This is contrast to, say, an > implementation of a tree-like data structure with a collection of > 'Node' objects, where tree operations would require access to each > object and would be horribly slow in S4 (and perhaps R when nodes were > represented as a list, say, at least compared to a C-level > representation, or an alternative representation that took advantage > of R's language characteristics). > > Martin > >> >> Thanks for sharing your insight and time, >> gopi. >> http://gopi-goswami.net/ >> >> __ >> R-devel@r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-devel > > -- > Martin Morgan > Computational Biology / Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N. > PO Box 19024 Seattle, WA 98109 > > Location: Arnold Building M2 B169 > Phone: (206) 667-2793 > __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] callNextMethod() doesn't pass down arguments
Hi, According to its man page, callNextMethod() (called with no argument) should call the next method "with the arguments to the current method passed down to the next method". But, for the "[[" and "[" generics (primitives), the argument after the dots doesn't seem to be passed down: setClass("A", representation(thing="ANY")) setClass("B", contains="A") setMethod("[[", "A", function(x, i, j, ..., exact=TRUE) return(exact)) setMethod("[[", "B", function(x, i, j, ..., exact=TRUE) callNextMethod()) b <- new("B") b[[3]] [1] TRUE b[[3, exact=FALSE]] [1] TRUE setMethod("[", "A", function(x, i, j, ..., drop=TRUE) return(drop)) setMethod("[", "B", function(x, i, j, ..., drop=TRUE) callNextMethod()) b[3] [1] TRUE b[3, drop=FALSE] [1] TRUE I tried this with R 2.8.0 and 2.9.0 (r47727). Cheers, H. __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel