Re: [Rd] [patch] Documentation for list.files when no matches found
Thanks for the report, fixed in documentation in R-devel. Best Tomas On 1/7/19 3:03 AM, Jonathan Carroll wrote: Apologies in advance if this is already known but a search of the r-devel archive did not immediately turn up any mentions. list.files() (and thus dir()) returns character(0) when no files are found in the requested path. This is useful and expected behaviour as length(dir()) can be tested for success. The Value documentation, however, indicates otherwise A character vector containing the names of the files in the specified directories, or "" if there were no files. which would be less useful and does not match current behaviour. This appears to have been the case for the majority the lifetime of the software so I'm not sure it's terribly important, but for the sake of consistency, I propose the following simple patch. Kind regards, - Jonathan. --- a/src/library/base/man/list.files.Rd +++ b/src/library/base/man/list.files.Rd @@ -45,7 +45,7 @@ list.dirs(path = ".", full.names = TRUE, recursive = TRUE) } \value{ A character vector containing the names of the files in the - specified directories, or \code{""} if there were no files. If a + specified directories, or \code{character(0)} if there were no files. If a path does not exist or is not a directory or is unreadable it is skipped, with a warning. __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] Bug report with patch: `stats:::regularize.values()` always creates full copies of `x` and `y`
This is intended to be a bug report with proposed patch. I am posting to this mailing list as described in NOTE in "Bug Reporting in R". Function `stats:::regularize.values()` is meant to preprocess `x` and `y` arguments to have "proper" values for later use during interpolation. If input is already "proper", I would expect it to reuse the same objects without creating new ones. However, this isn't the case and is the source of unneccessary extra memory usage in `approx()` and others. The root cause of this seems to be a forceful reordering in lines 37-39 of 'approx.R' file. If reordering is done only if `x` is unsorted then no copies are created. Also this doesn't seem like breaking any existing code. There is a patch attached. Reproducable code: x <- seq(1, 100, 1) y <- seq(1, 100, 1) reg_xy <- stats:::regularize.values(x, y, mean) # Regularized versions of `x` and `y` are identical to input but are stored at # different places identical(x, reg_xy[["x"]]) #> [1] TRUE .Internal(inspect(x)) #> @15719b0 14 REALSXP g0c7 [NAM(3)] (len=100, tl=0) 1,2,3,4,5,... .Internal(inspect(reg_xy[["x"]])) #> @2b84130 14 REALSXP g0c7 [NAM(3)] (len=100, tl=0) 1,2,3,4,5,... identical(y, reg_xy[["y"]]) #> [1] TRUE .Internal(inspect(y)) #> @2c91be0 14 REALSXP g0c7 [NAM(3)] (len=100, tl=0) 1,2,3,4,5,... .Internal(inspect(reg_xy[["y"]])) #> @2bb4880 14 REALSXP g0c7 [NAM(3)] (len=100, tl=0) 1,2,3,4,5,... # Differs from original only by using `if (is.unsorted(x))` new_regularize.values <- function (x, y, ties) { x <- xy.coords(x, y, setLab = FALSE) y <- x$y x <- x$x if (any(na <- is.na(x) | is.na(y))) { ok <- !na x <- x[ok] y <- y[ok] } nx <- length(x) if (!identical(ties, "ordered")) { if (is.unsorted(x)) { o <- order(x) x <- x[o] y <- y[o] } if (length(ux <- unique(x)) < nx) { if (missing(ties)) warning("collapsing to unique 'x' values") y <- as.vector(tapply(y, match(x, x), ties)) x <- ux stopifnot(length(y) == length(x)) } } list(x = x, y = y) } new_reg_xy <- new_regularize.values(x, y, mean) # Output is still identical to input and also references to the same objects identical(x, new_reg_xy[["x"]]) #> [1] TRUE .Internal(inspect(x)) #> @15719b0 14 REALSXP g1c7 [MARK,NAM(3)] (len=100, tl=0) 1,2,3,4,5,... .Internal(inspect(new_reg_xy[["x"]])) #> @15719b0 14 REALSXP g1c7 [MARK,NAM(3)] (len=100, tl=0) 1,2,3,4,5,... identical(y, new_reg_xy[["y"]]) #> [1] TRUE .Internal(inspect(y)) #> @2c91be0 14 REALSXP g1c7 [MARK,NAM(3)] (len=100, tl=0) 1,2,3,4,5,... .Internal(inspect(new_reg_xy[["y"]])) #> @2c91be0 14 REALSXP g1c7 [MARK,NAM(3)] (len=100, tl=0) 1,2,3,4,5,... # Current R version R.version #>_ #> platform x86_64-pc-linux-gnu #> arch x86_64 #> os linux-gnu #> system x86_64, linux-gnu #> status #> major 3 #> minor 5.2 #> year 2018 #> month 12 #> day20 #> svn rev75870 #> language R #> version.string R version 3.5.2 (2018-12-20) #> nickname Eggshell Igloo -- Best regards, Evgeni Chasnovski Index: src/library/stats/R/approx.R === --- src/library/stats/R/approx.R (revision 75926) +++ src/library/stats/R/approx.R (working copy) @@ -34,9 +34,11 @@ } nx <- length(x) if (!identical(ties, "ordered")) { - o <- order(x) - x <- x[o] - y <- y[o] + if (is.unsorted(x)) { + o <- order(x) + x <- x[o] + y <- y[o] + } if (length(ux <- unique(x)) < nx) { if (missing(ties)) warning("collapsing to unique 'x' values") __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] Bug report with patch: `stats:::regularize.values()` always creates full copies of `x` and `y`
This is intended to be a bug report with proposed patch. I am posting to this mailing list as described in NOTE in "Bug Reporting in R". Function `stats:::regularize.values()` is meant to preprocess `x` and `y` arguments to have "proper" values for later use during interpolation. If input is already "proper", I would expect it to reuse the same objects without creating new ones. However, this isn't the case and is the source of unneccessary extra memory usage in `approx()` and others. The root cause of this seems to be a forceful reordering in lines 37-39 of 'approx.R' file. If reordering is done only if `x` is unsorted then no copies are created. Also this doesn't seem like breaking any existing code. There is a patch attached. Reproducable code: x <- seq(1, 100, 1) y <- seq(1, 100, 1) reg_xy <- stats:::regularize.values(x, y, mean) # Regularized versions of `x` and `y` are identical to input but are stored at # different places identical(x, reg_xy[["x"]]) #> [1] TRUE .Internal(inspect(x)) #> @15719b0 14 REALSXP g0c7 [NAM(3)] (len=100, tl=0) 1,2,3,4,5,... .Internal(inspect(reg_xy[["x"]])) #> @2b84130 14 REALSXP g0c7 [NAM(3)] (len=100, tl=0) 1,2,3,4,5,... identical(y, reg_xy[["y"]]) #> [1] TRUE .Internal(inspect(y)) #> @2c91be0 14 REALSXP g0c7 [NAM(3)] (len=100, tl=0) 1,2,3,4,5,... .Internal(inspect(reg_xy[["y"]])) #> @2bb4880 14 REALSXP g0c7 [NAM(3)] (len=100, tl=0) 1,2,3,4,5,... # Differs from original only by using `if (is.unsorted(x))` new_regularize.values <- function (x, y, ties) { x <- xy.coords(x, y, setLab = FALSE) y <- x$y x <- x$x if (any(na <- is.na(x) | is.na(y))) { ok <- !na x <- x[ok] y <- y[ok] } nx <- length(x) if (!identical(ties, "ordered")) { if (is.unsorted(x)) { o <- order(x) x <- x[o] y <- y[o] } if (length(ux <- unique(x)) < nx) { if (missing(ties)) warning("collapsing to unique 'x' values") y <- as.vector(tapply(y, match(x, x), ties)) x <- ux stopifnot(length(y) == length(x)) } } list(x = x, y = y) } new_reg_xy <- new_regularize.values(x, y, mean) # Output is still identical to input and also references to the same objects identical(x, new_reg_xy[["x"]]) #> [1] TRUE .Internal(inspect(x)) #> @15719b0 14 REALSXP g1c7 [MARK,NAM(3)] (len=100, tl=0) 1,2,3,4,5,... .Internal(inspect(new_reg_xy[["x"]])) #> @15719b0 14 REALSXP g1c7 [MARK,NAM(3)] (len=100, tl=0) 1,2,3,4,5,... identical(y, new_reg_xy[["y"]]) #> [1] TRUE .Internal(inspect(y)) #> @2c91be0 14 REALSXP g1c7 [MARK,NAM(3)] (len=100, tl=0) 1,2,3,4,5,... .Internal(inspect(new_reg_xy[["y"]])) #> @2c91be0 14 REALSXP g1c7 [MARK,NAM(3)] (len=100, tl=0) 1,2,3,4,5,... # Current R version R.version #>_ #> platform x86_64-pc-linux-gnu #> arch x86_64 #> os linux-gnu #> system x86_64, linux-gnu #> status #> major 3 #> minor 5.2 #> year 2018 #> month 12 #> day20 #> svn rev75870 #> language R #> version.string R version 3.5.2 (2018-12-20) #> nickname Eggshell Igloo -- Best regards, Evgeni Chasnovski Index: src/library/stats/R/approx.R === --- src/library/stats/R/approx.R (revision 75926) +++ src/library/stats/R/approx.R (working copy) @@ -34,9 +34,11 @@ } nx <- length(x) if (!identical(ties, "ordered")) { - o <- order(x) - x <- x[o] - y <- y[o] + if (is.unsorted(x)) { + o <- order(x) + x <- x[o] + y <- y[o] + } if (length(ux <- unique(x)) < nx) { if (missing(ties)) warning("collapsing to unique 'x' values") __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] unsorted - suggestion for performance improvement and ALTREP support for POSIXct
I believe the performance of isUnsorted() in sort.c could be improved by calling REAL() once (outside of the for loop), rather than calling it twice inside the loop. As an aside, it is implemented in the faster way in doSort() (sort.c line 401). The example below shows the performance improvement for a vectors of double of moving REAL() outside the for loop. # example as implemented in isUnsorted body = " R_xlen_t n, i; n = XLENGTH(x); for(i = 0; i+1 < n ; i++) if(REAL(x)[i] > REAL(x)[i+1]) return ScalarLogical(TRUE); return ScalarLogical(FALSE);"; f1 = inline::cfunction(sig = signature(x='numeric'), body=body) # example updated with only one call to REAL() body = " R_xlen_t n, i; n = XLENGTH(x); double* real_x = REAL(x); for(i = 0; i+1 < n ; i++) if(real_x[i] > real_x[i+1]) return ScalarLogical(TRUE); return ScalarLogical(FALSE);"; f2 = inline::cfunction(sig = signature(x='numeric'), body=body) # unsorted x.double = as.double(1:1e7) + 0 x.posixct = Sys.time() + x.double microbenchmark::microbenchmark( f1(x.double), f2(x.double), # faster due to one REAL() f1(x.posixct), f2(x.posixct), # faster due to one REAL() unit='ms', times=10) Unit: milliseconds expr minlq meanmedianuq max neval f1(x.double) 35.737629 37.991785 43.004432 38.575525 39.198533 80.85625 10 f2(x.double) 6.053373 6.064323 7.238750 6.092453 8.438550 10.69384 10 f1(x.posixct) 36.315705 36.542253 42.349745 38.355395 39.378262 81.59857 10 f2(x.posixct) 6.063946 6.070741 7.579176 6.138518 7.063024 13.94141 10 I would also like to suggest ALTREP support for POSIXct vectors, which are interpreted as type REAL in the c code, but do not gain the performance benefits of real vectors. Sorted vectors of timestamps are important for joining time series and in calls to findInterval(). # unsorted vectors x.double = as.double(1:1e7) + 0 x.posixct = Sys.time() + x.double # sort for altrep benefit x.double.sort <- sort(x.double) x.posixct.sort <- sort(x.posixct) microbenchmark::microbenchmark( is.unsorted(x.double), is.unsorted(x.double.sort), # faster due to altrep is.unsorted(x.posixct), is.unsorted(x.posixct.sort), # no altrep benefit unit='ms', times=10) Unit: milliseconds expr minlq mean median uqmax neval is.unsorted(x.double) 16.987730 17.010008 17.1577173 17.0862785 17.308674 17.47443210 is.unsorted(x.double.sort) 0.000378 0.000756 0.0065327 0.0075525 0.010195 0.01170610 is.unsorted(x.posixct) 36.925876 37.084837 43.4125593 37.4695915 41.858589 78.74217410 is.unsorted(x.posixct.sort) 36.966654 37.031975 51.1228686 37.1235380 37.777319 153.27017010 Since there do not appear to be any tests for is.unsorted() these are some tests to be added for some types. # integer sequence x <- -10L:10L stopifnot(!is.unsorted(x, na.rm = F, strictly = T)) stopifnot(!is.unsorted(x, na.rm = F, strictly = F)) # integer not strictly x <- -10L:10L x[2] <- x[3] stopifnot( is.unsorted(x, na.rm = F, strictly = T)) stopifnot(!is.unsorted(x, na.rm = F, strictly = F)) # integer with NA x <- -10L:10L x[2] <- NA stopifnot(!is.unsorted(x, na.rm = T, strictly = F)) stopifnot(is.na(is.unsorted(x, na.rm = F, strictly = F))) # double x <- seq(from = -10, to = 10, by=0.01) stopifnot(!is.unsorted(x, na.rm = F, strictly = T)) stopifnot(!is.unsorted(x, na.rm = F, strictly = F)) # double not strictly x <- seq(from = -10, to = 10, by=0.01) x[2] <- x[3] stopifnot( is.unsorted(x, na.rm = F, strictly = T)) stopifnot(!is.unsorted(x, na.rm = F, strictly = F)) # double with NA x <- seq(from = -10, to = 10, by=0.01) x[length(x)] <- NA stopifnot(!is.unsorted(x, na.rm = T, strictly = F)) stopifnot(is.na(is.unsorted(x, na.rm = F, strictly = F))) # logical stopifnot(!is.unsorted( c(F, T, T), strictly = F)) stopifnot( is.unsorted( c(F, T, T), strictly = T)) stopifnot( is.unsorted( c(T, T, F), strictly = F)) stopifnot( is.unsorted( c(T, T, F), strictly = T)) # POSIXct x <- seq(from=as.POSIXct('2018-1-1'), to=as.POSIXct('2019-1-1'), by='day') stopifnot(!is.unsorted(x, na.rm = T, strictly = F)) stopifnot(!is.unsorted(x, na.rm = F, strictly = F)) # POSIXct not strictly x <- seq(from=as.POSIXct('2018-1-1'), to=as.POSIXct('2019-1-1'), by='day') x[2] <- x[3] stopifnot( is.unsorted(x, na.rm = F, strictly = T)) stopifnot(!is.unsorted(x, na.rm = F, strictly = F)) # POSIXct with NA x <- seq(from=as.POSIXct('2018-1-1'), to=as.POSIXct('2019-1-1'), by='day') x[length(x)] <- NA stopifnot(!is.unsorted(x, na.rm = T, strictly = F)) stopifnot(is.na(is.unsorted(x, na.rm = F, strictly = F))) [[alternative HTML version deleted]] __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] Runnable R packages
Dear all, I’m working as a data scientist in a major tech company. I have been using R for almost 20 years now and there’s one issue that’s been bugging me of late. I apologize in advance if this has been discussed before. R has traditionally been used for running short scripts or data analysis notebooks, but there’s recently been a growing interest in developing full applications in the language. Three examples come to mind: 1) The Shiny web application framework, which facilitates the developent of rich, interactive web applications 2) The httr package, which provides lower-level facilities than Shiny for writing web services 3) Batch jobs run by data scientists according to, say, a cron schedule Compared with other languages, R’s support for such applications is rather poor. The Rscript program is generally used to run an R script or an arbitrary R expression, but I feel it suffers from a few problems: 1) It encourages developers of batch jobs to provide their code in a single R file (bad for code structure and unit-testability) 2) It provides no way to deal with dependencies on other packages 3) It provides no way to "run" an application provided as an R package For example, let’s say I want to run a Shiny application that I provide as an R package (to keep the code modular, to benefit from unit tests, and to declare dependencies properly). I would then need to a) uncompress my R package, b) somehow, ensure my dependencies are installed, and c) call runApp(). This can get tedious, fast. Other languages let the developer package their code in "runnable" artefacts, and let the developer specify the main entry point. The mechanics depend on the language but are remarkably similar, and suggest a way to implement this in R. Through declarations in some file, the developer can often specify dependencies and declare where the program’s "main" function resides. Consider Java: Artefact: .jar file Declarations file: Manifest file Entry point: declared as 'Main-Class' Executed as: java -jar Or Python: Artefact: Python package, typically as .tar.gz source distribution file Declarations file: setup.py (which specifies dependencies) Entry point: special __main__() function Executed as: python -m R has already much of this machinery: Artefact: R package Declarations file: DESCRIPTION Entry point: ? Executed as: ? I feel that R could benefit from letting the developer specify, possibly in DESCRIPTION, how to "run" the package. The package could then be run through, for example, a new R CMD command, for example: R CMD RUN I’m sure there are plenty of wrinkles in this idea that need to be ironed out, but is this something that has ever been considered, or that is on R’s roadmap? Thanks for reading so far, David Lindelöf, Ph.D. +41 (0)79 415 66 41 or skype:david.lindelof http://computersandbuildings.com Follow me on Twitter: http://twitter.com/dlindelof [[alternative HTML version deleted]] __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Runnable R packages
Dear David, sharing some related (subjective) thoughts below. On Mon, Jan 7, 2019 at 9:53 PM David Lindelof wrote: > > Dear all, > > I’m working as a data scientist in a major tech company. I have been using > R for almost 20 years now and there’s one issue that’s been bugging me of > late. I apologize in advance if this has been discussed before. > > R has traditionally been used for running short scripts or data analysis > notebooks, but there’s recently been a growing interest in developing full > applications in the language. Three examples come to mind: > > 1) The Shiny web application framework, which facilitates the developent of > rich, interactive web applications > 2) The httr package, which provides lower-level facilities than Shiny for > writing web services > 3) Batch jobs run by data scientists according to, say, a cron schedule > > Compared with other languages, R’s support for such applications is rather > poor. The Rscript program is generally used to run an R script or an > arbitrary R expression, but I feel it suffers from a few problems: > > 1) It encourages developers of batch jobs to provide their code in a single > R file (bad for code structure and unit-testability) I think it rather encourages developers to create (internal) R packages and use those from the batch jobs. This way the structure is pretty clean, sharing code between scripts is easy, unit testing can be done within the package etc. > 2) It provides no way to deal with dependencies on other packages See above: create R package(s) and use those from the scripts. > 3) It provides no way to "run" an application provided as an R package > > For example, let’s say I want to run a Shiny application that I provide as > an R package (to keep the code modular, to benefit from unit tests, and to > declare dependencies properly). I would then need to a) uncompress my R > package, b) somehow, ensure my dependencies are installed, and c) call > runApp(). This can get tedious, fast. You can provide your app as a Docker image, so that the end-user simply calls a "docker pull" and then "docker run" -- that can be done from a user-friendly script as well. Of course, this requires Docker to be installed, but if that's a problem, probably better to "ship" the app as a web application and share a URL with the user, eg backed by shinyproxy.io > > Other languages let the developer package their code in "runnable" > artefacts, and let the developer specify the main entry point. The > mechanics depend on the language but are remarkably similar, and suggest a > way to implement this in R. Through declarations in some file, the > developer can often specify dependencies and declare where the program’s > "main" function resides. Consider Java: > > Artefact: .jar file > Declarations file: Manifest file > Entry point: declared as 'Main-Class' > Executed as: java -jar > > Or Python: > > Artefact: Python package, typically as .tar.gz source distribution file > Declarations file: setup.py (which specifies dependencies) > Entry point: special __main__() function > Executed as: python -m > > R has already much of this machinery: > > Artefact: R package > Declarations file: DESCRIPTION > Entry point: ? > Executed as: ? > > I feel that R could benefit from letting the developer specify, possibly in > DESCRIPTION, how to "run" the package. The package could then be run > through, for example, a new R CMD command, for example: > > R CMD RUN > > I’m sure there are plenty of wrinkles in this idea that need to be ironed > out, but is this something that has ever been considered, or that is on R’s > roadmap? > > Thanks for reading so far, > > > > David Lindelöf, Ph.D. > +41 (0)79 415 66 41 or skype:david.lindelof > http://computersandbuildings.com > Follow me on Twitter: > http://twitter.com/dlindelof > > [[alternative HTML version deleted]] > > __ > R-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Runnable R packages
On 3 January 2019 at 11:43, David Lindelof wrote: | Dear all, | | I’m working as a data scientist in a major tech company. I have been using | R for almost 20 years now and there’s one issue that’s been bugging me of | late. I apologize in advance if this has been discussed before. | | R has traditionally been used for running short scripts or data analysis | notebooks, but there’s recently been a growing interest in developing full | applications in the language. Three examples come to mind: | | 1) The Shiny web application framework, which facilitates the developent of | rich, interactive web applications | 2) The httr package, which provides lower-level facilities than Shiny for | writing web services | 3) Batch jobs run by data scientists according to, say, a cron schedule That is a bit of a weird classification of "full applications". I have done this about as long as you but I also provided (at least as tests and demos) i) GUI apps using tcl/tk (which comes with R) and ii) GUI apps with Qt (or even Wt), see my RInside package. But my main weapon for 3) is littler. See https://cran.r-project.org/package=littler and particularly the many examples at https://github.com/eddelbuettel/littler/tree/master/inst/examples | Compared with other languages, R’s support for such applications is rather | poor. The Rscript program is generally used to run an R script or an | arbitrary R expression, but I feel it suffers from a few problems: | | 1) It encourages developers of batch jobs to provide their code in a single | R file (bad for code structure and unit-testability) | 2) It provides no way to deal with dependencies on other packages | 3) It provides no way to "run" an application provided as an R package Err, no. See the examples/ directory above. About every single one uses packages. As illustrations I have long-running and somewhat visible cronjobs that are implemented the same way: CRANberries (since 2007, now running hourly) and CRAN Policy Watch (running once a day). Because both are 'hacks' I never published the code but there is not that much to it. CRANberries just queries CRAN, compares to what it had last, and writes out variants of the DESCRIPTION file to text where a static blog engine (like Hugo, but older) makes a feed and html pagaes out of it. Oh, and we tweet because "why not?". | For example, let’s say I want to run a Shiny application that I provide as | an R package (to keep the code modular, to benefit from unit tests, and to | declare dependencies properly). I would then need to a) uncompress my R | package, b) somehow, ensure my dependencies are installed, and c) call | runApp(). This can get tedious, fast. Disagree here too. At work, I just write my code, organize it in packages, update the packages and have shiny expose whatever makes sense. | Other languages let the developer package their code in "runnable" | artefacts, and let the developer specify the main entry point. The | mechanics depend on the language but are remarkably similar, and suggest a | way to implement this in R. Through declarations in some file, the | developer can often specify dependencies and declare where the program’s | "main" function resides. Consider Java: | | Artefact: .jar file | Declarations file: Manifest file | Entry point: declared as 'Main-Class' | Executed as: java -jar | | Or Python: | | Artefact: Python package, typically as .tar.gz source distribution file | Declarations file: setup.py (which specifies dependencies) | Entry point: special __main__() function | Executed as: python -m | | R has already much of this machinery: | | Artefact: R package | Declarations file: DESCRIPTION | Entry point: ? | Executed as: ? | | I feel that R could benefit from letting the developer specify, possibly in | DESCRIPTION, how to "run" the package. The package could then be run | through, for example, a new R CMD command, for example: | | R CMD RUN | | I’m sure there are plenty of wrinkles in this idea that need to be ironed | out, but is this something that has ever been considered, or that is on R’s | roadmap? Hm. If _you_ have an itch to scratch here why don't _you_ implement a draft. Dirk -- http://dirk.eddelbuettel.com | @eddelbuettel | e...@debian.org __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Runnable R packages
Some other major tech companies have in the past widely use Runnable R Archives (".Rar" files), similar to Python .par files [1], and integrate them completely into the proprietary R package build system in use there. I thought there were a few systems like this that had made their way to CRAN or the UseR conferences, but I don't have a link. Building something specific to your organization on top of the python .par framework to archive up R, your needed packages/shared libraries, and other dependencies with a runner script to R CMD RUN your entry point in a sandbox is pretty straightforward way to have control in a way that makes sense for your environment. - Murray [1] https://google.github.io/subpar/subpar.html On Mon, Jan 7, 2019 at 12:53 PM David Lindelof wrote: > Dear all, > > I’m working as a data scientist in a major tech company. I have been using > R for almost 20 years now and there’s one issue that’s been bugging me of > late. I apologize in advance if this has been discussed before. > > R has traditionally been used for running short scripts or data analysis > notebooks, but there’s recently been a growing interest in developing full > applications in the language. Three examples come to mind: > > 1) The Shiny web application framework, which facilitates the developent of > rich, interactive web applications > 2) The httr package, which provides lower-level facilities than Shiny for > writing web services > 3) Batch jobs run by data scientists according to, say, a cron schedule > > Compared with other languages, R’s support for such applications is rather > poor. The Rscript program is generally used to run an R script or an > arbitrary R expression, but I feel it suffers from a few problems: > > 1) It encourages developers of batch jobs to provide their code in a single > R file (bad for code structure and unit-testability) > 2) It provides no way to deal with dependencies on other packages > 3) It provides no way to "run" an application provided as an R package > > For example, let’s say I want to run a Shiny application that I provide as > an R package (to keep the code modular, to benefit from unit tests, and to > declare dependencies properly). I would then need to a) uncompress my R > package, b) somehow, ensure my dependencies are installed, and c) call > runApp(). This can get tedious, fast. > > Other languages let the developer package their code in "runnable" > artefacts, and let the developer specify the main entry point. The > mechanics depend on the language but are remarkably similar, and suggest a > way to implement this in R. Through declarations in some file, the > developer can often specify dependencies and declare where the program’s > "main" function resides. Consider Java: > > Artefact: .jar file > Declarations file: Manifest file > Entry point: declared as 'Main-Class' > Executed as: java -jar > > Or Python: > > Artefact: Python package, typically as .tar.gz source distribution file > Declarations file: setup.py (which specifies dependencies) > Entry point: special __main__() function > Executed as: python -m > > R has already much of this machinery: > > Artefact: R package > Declarations file: DESCRIPTION > Entry point: ? > Executed as: ? > > I feel that R could benefit from letting the developer specify, possibly in > DESCRIPTION, how to "run" the package. The package could then be run > through, for example, a new R CMD command, for example: > > R CMD RUN > > I’m sure there are plenty of wrinkles in this idea that need to be ironed > out, but is this something that has ever been considered, or that is on R’s > roadmap? > > Thanks for reading so far, > > > > David Lindelöf, Ph.D. > +41 (0)79 415 66 41 or skype:david.lindelof > http://computersandbuildings.com > Follow me on Twitter: > http://twitter.com/dlindelof > > [[alternative HTML version deleted]] > > __ > R-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel > [[alternative HTML version deleted]] __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Runnable R packages
On 7 January 2019 at 22:09, Gergely Daróczi wrote: | You can provide your app as a Docker image, so that the end-user | simply calls a "docker pull" and then "docker run" -- that can be done | from a user-friendly script as well. | Of course, this requires Docker to be installed, but if that's a | problem, probably better to "ship" the app as a web application and | share a URL with the user, eg backed by shinyproxy.io Excellent suggestion. Dirk -- http://dirk.eddelbuettel.com | @eddelbuettel | e...@debian.org __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Runnable R packages
On Mon, 7 Jan 2019 at 22:09, Gergely Daróczi wrote: > > Dear David, sharing some related (subjective) thoughts below. > > You can provide your app as a Docker image, so that the end-user > simply calls a "docker pull" and then "docker run" -- that can be done > from a user-friendly script as well. > Of course, this requires Docker to be installed, but if that's a > problem, probably better to "ship" the app as a web application and > share a URL with the user, eg backed by shinyproxy.io If Docker is a problem, you can also try podman: same usage, compatible with Dockerfiles and daemon-less, no admin rights required. https://podman.io/ Iñaki __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] unsorted - suggestion for performance improvement and ALTREP support for POSIXct
Hi Harvey, Its exciting to see people thinking about and looking at ALTREP speedups "in the wild" :). You're absolutely right that pulling out the REAL call will give you a significant speedup, but ALTREP does add a little wrinkle (and a solution to it!). Detailed responses and comments inline: On Mon, Jan 7, 2019 at 11:58 AM Harvey Smith wrote: > I believe the performance of isUnsorted() in sort.c could be improved by > calling REAL() once (outside of the for loop), rather than calling it twice > inside the loop. As an aside, it is implemented in the faster way in > doSort() (sort.c line 401). The example below shows the performance > improvement for a vectors of double of moving REAL() outside the for loop. > > > > In light of ALTREP's inclusion in the R internals its best to avoid asking things for their full data vector when you don't need to. Instead, you can use the ITERATE_BY_REGION macro R (courtesy of Luke, I believe?) provides in /R_ext/Itermacros.h. This is particularly true of R's internals, which also preferably won't "explode"/invalidate an ALTREP (which asking for a writable pointer does) when they don't need to. Most internal functions haven't been converted to this yet, as you see with is.unsorted (and its not a high priority to do the conversion until it becomes an issue for any given case), but this is what, e.g., R's own sum function now does. ITERATE_BY_REGION is based on *_GET_REGION, which was added to the C API as part ALTREP, but works on ALTREP and normal vectors, and won't explode in corner cases where materializing a full ALTREP vector would be problematic. The core concept for ITERATE_BY_REGION is to grab regions (a quick glance tells me its 512 elements at a time) of a vector, copying them into a buffer, and using the same trick your code does by avoiding pointer lookup inside the inner tight loop. Do note that as of now I had to compile my function with language="C", rather than the default "C++" to avoid an error about initializing a const double * with a const void * value. On my machine, at least, you actually *nearly* the same speedup with all the added safety. Eyeballing it I'm not convinced the difference is statistically signfiicant, to be honest, but even if it is, you get most of the benefit... body = " R_xlen_t n, i; n = XLENGTH(x); for(i = 0; i+1 < n ; i++) if(REAL(x)[i] > REAL(x)[i+1]) return ScalarLogical(TRUE); return ScalarLogical(FALSE);"; f1 = inline::cfunction(sig = signature(x='numeric'), body=body) body = " R_xlen_t n, i; n = XLENGTH(x); double* real_x = REAL(x); for(i = 0; i+1 < n ; i++) if(real_x[i] > real_x[i+1]) return ScalarLogical(TRUE); return ScalarLogical(FALSE);"; f2 = inline::cfunction(sig = signature(x='numeric'), body=body) body = " double tmp = -DBL_MAX; // minimum possible double value ITERATE_BY_REGION(x, xptr, i, nbatch, double, REAL, { if(xptr[0] < tmp) //deal with batch barriers, tmp is end of last batch return ScalarLogical(TRUE); for(R_xlen_t k = 0; k < nbatch - 1; k++) { if(xptr[k] > xptr[k+1]) return ScalarLogical(TRUE); } tmp = xptr[nbatch - 1]; }); return ScalarLogical(FALSE);"; f3 = inline::cfunction(sig = signature(x='numeric'), body=body, includes = '#include "R_ext/Itermacros.h"', language = "C") x.double = as.double(1:1e7) + 0 x.posixct = Sys.time() + x.double microbenchmark::microbenchmark( f1(x.double), f2(x.double), # one REAL call f3(x.double), # ITERATE_BY_REGION f1(x.posixct), f2(x.posixct), # one REAL call f3(x.posixct), # ITERATE_BY_REGION unit='ms', times=100) Unit: milliseconds expr minlq meanmedianuqmax f1(x.double) 26.377432 27.234192 28.156993 27.774590 28.602643 32.213378 f2(x.double) 4.722712 4.854300 5.011549 4.991388 5.127996 5.523156 f3(x.double) 4.759537 4.788137 5.408925 5.373667 5.713877 6.694330 f1(x.posixct) 77.975030 78.853724 85.867995 82.530822 83.557849 123.546206 f2(x.posixct) 4.637912 4.660033 4.872892 4.750513 4.880569 5.907149 f3(x.posixct) 4.643806 4.665936 5.094212 5.085454 5.384414 5.778274 neval 10 10 10 10 10 10 To be extra careful we can check that we're getting all the edges right just incase, since the code is admittedly harder to follow and a bit more arcane: > x.double2 = x.double > x.double2[512] = x.double[1] #unsorted at end of first batch > stopifnot(f3(x.double2)) > > x.double2a = x.double > x.double2a[513] = x.double[1] #unsorted at beginning of 2nd batch > stopifnot(f3(x.double2a)) > > > ##check edges > x.double3 = x.double > x.double3[length(x.double3)] = x.double3[1] #unsorted at last element > stopifnot(f3(x.double3)) > > x.double4 = x.double > x.double4[1] = x.double[5] #unsorted at first element > stopifnot(f3(x.double4)