Re: [Rd] Comments requested on "changedFiles" function
Hi Duncan, I think this functionality would be much easier to use and understand if you split it up the functionality of taking snapshots and comparing them into separate functions. In addition, the 'timestamp' functionality seems both confusing and brittle to me. I think it would be better to store file modification times in the snapshot and use those instead of an external file. Maybe: # Take a snapshot of the files. takeFileSnapshot(directory, file.info = TRUE, md5sum = FALSE, full.names = FALSE, recursive = TRUE, ...) # Take a snapshot using the same options as used for snapshot. retakeFileSnapshot(snapshot, directory = snapshot$directory) { takeFileSnapshot)(directory, file.info = snapshot$file.info, md5sum = snapshot$md5sum, etc) } compareFileSnapshots(snapshot1, snapshot2) - or - getNewFiles(snapshat1, snapshot2) # These names are probably too generic getDeletedFiles(snapshot1, snapshot2) getUpdatedFiles(snapshot1, snapshot2) -or- setdiff(snapshot1, snapshot2) # Unclear how this should treat updated files This approach does have the difficulty that users could attempt to compare snapshots that were taken with different options and that can't be compared, but that should be an easy error to detect. Karl On Wed, Sep 4, 2013 at 10:53 AM, Duncan Murdoch wrote: > In a number of places internal to R, we need to know which files have > changed (e.g. after building a vignette). I've just written a general > purpose function "changedFiles" that I'll probably commit to R-devel. > Comments on the design (or bug reports) would be appreciated. > > The source for the function and the Rd page for it are inline below. > > - changedFiles.R: > changedFiles <- function(snapshot, timestamp = tempfile("timestamp"), > file.info = NULL, > md5sum = FALSE, full.names = FALSE, ...) { > dosnapshot <- function(args) { > fullnames <- do.call(list.files, c(full.names = TRUE, args)) > names <- do.call(list.files, c(full.names = full.names, args)) > if (isTRUE(file.info) || (is.character(file.info) && length( > file.info))) { > info <- file.info(fullnames) > rownames(info) <- names > if (isTRUE(file.info)) > file.info <- c("size", "isdir", "mode", "mtime") > } else > info <- data.frame(row.names=names) > if (md5sum) > info <- data.frame(info, md5sum = tools::md5sum(fullnames)) > list(info = info, timestamp = timestamp, file.info = file.info, > md5sum = md5sum, full.names = full.names, args = args) > } > if (missing(snapshot) || !inherits(snapshot, "changedFilesSnapshot")) { > if (length(timestamp) == 1) > file.create(timestamp) > if (missing(snapshot)) snapshot <- "." > pre <- dosnapshot(list(path = snapshot, ...)) > pre$pre <- pre$info > pre$info <- NULL > pre$wd <- getwd() > class(pre) <- "changedFilesSnapshot" > return(pre) > } > > if (missing(timestamp)) timestamp <- snapshot$timestamp > if (missing(file.info) || isTRUE(file.info)) file.info <- snapshot$ > file.info > if (identical(file.info, FALSE)) file.info <- NULL > if (missing(md5sum))md5sum <- snapshot$md5sum > if (missing(full.names)) full.names <- snapshot$full.names > > pre <- snapshot$pre > savewd <- getwd() > on.exit(setwd(savewd)) > setwd(snapshot$wd) > > args <- snapshot$args > newargs <- list(...) > args[names(newargs)] <- newargs > post <- dosnapshot(args)$info > prenames <- rownames(pre) > postnames <- rownames(post) > > added <- setdiff(postnames, prenames) > deleted <- setdiff(prenames, postnames) > common <- intersect(prenames, postnames) > > if (length(file.info)) { > preinfo <- pre[common, file.info] > postinfo <- post[common, file.info] > changes <- preinfo != postinfo > } > else changes <- matrix(logical(0), nrow = length(common), ncol = 0, >dimnames = list(common, character(0))) > if (length(timestamp)) > changes <- cbind(changes, Newer = file_test("-nt", common, > timestamp)) > if (md5sum) { > premd5 <- pre[common, "md5sum"] > postmd5 <- post[common, "md5sum"] > changes <- cbind(changes, md5sum = premd5 != postmd5) > } > changes1 <- changes[rowSums(changes, na.rm = TRUE) > 0, , drop = FALSE] > changed <- rownames(changes1) > structure(list(added = added, deleted = deleted, changed = changed, > unchanged = setdiff(common, changed), changes = changes), class = > "changedFiles") > } > > print.changedFilesSnapshot <- function(x, ...) { > cat("changedFiles snapshot:\n timestamp = \"", x$timestamp, "\"\n > file.info = ", > if (length(x$file.info)) paste(paste0('"', x$file.info, '"'), > collapse=","), > "\n md5sum = ", x$md5sum, "\n args = ", deparse(x$args, control = > NULL), "\n", sep="") >
Re: [Rd] Comments requested on "changedFiles" function
Comments inline: On Wed, Sep 4, 2013 at 6:10 PM, Duncan Murdoch wrote: > > On 13-09-04 8:02 PM, Karl Millar wrote: >> >> Hi Duncan, >> >> I think this functionality would be much easier to use and understand if >> you split it up the functionality of taking snapshots and comparing them >> into separate functions. > > > Yes, that's another possibility. Some more comment below... > > > > In addition, the 'timestamp' functionality >> >> seems both confusing and brittle to me. I think it would be better to >> store file modification times in the snapshot and use those instead of >> an external file. Maybe: > > > You can do that, using file.info = "mtime", but the file.info snapshots are > quite a bit slower than using the timestamp file (when looking at a big > recursive directory of files). Sorry, I completely failed to explain what I was thinking here. There are a number of issues here, but the biggest one is that you're implicitly assuming that files that get modified will have mtimes that come after the timestamp file was created. This isn't always true, with the most notable exception being if you download a package from CRAN and untar it, the mtimes are usually well in the past (at least with GNU tar on a linux system), so this code won't notice that the files have changed. It may be a good idea to store the file sizes as well, which would help prevent false negatives in the (rare IIRC) cases where the contents have changed but the mtime values have not. Since you already need to call file.info() to get the mtime, this shouldn't increase the runtime, and the extra memory needed is fairly modest. >> >> # Take a snapshot of the files. >> takeFileSnapshot(directory, file.info <http://file.info> = TRUE, md5sum >> >> = FALSE, full.names = FALSE, recursive = TRUE, ...) >> >> # Take a snapshot using the same options as used for snapshot. >> retakeFileSnapshot(snapshot, directory = snapshot$directory) { >> takeFileSnapshot)(directory, file.info <http://file.info> = >> snapshot$file.info <http://file.info>, md5sum = snapshot$md5sum, etc) >> >> } >> >> compareFileSnapshots(snapshot1, snapshot2) >> - or - >> getNewFiles(snapshat1, snapshot2) # These names are probably too >> generic >> getDeletedFiles(snapshot1, snapshot2) >> getUpdatedFiles(snapshot1, snapshot2) >> -or- >> setdiff(snapshot1, snapshot2) # Unclear how this should treat updated files >> >> >> This approach does have the difficulty that users could attempt to >> compare snapshots that were taken with different options and that can't >> be compared, but that should be an easy error to detect. > > > I don't want to add too many new functions. The general R style is to have > functions that do a lot, rather than have a lot of different functions to > achieve different parts of related tasks. This is better for interactive use > (fewer functions to remember, a simpler help system to navigate), though it > probably results in less readable code. This is somewhat more nuanced and not particular to interactive use IMHO. Having functions that do a lot is good, _as long as the semantics are always consistent_. For example, lm() does a huge amount and has a wide variety of ways that you can specify your data, but it basically does the same thing no matter how you use it. On the other hand, if you have a function that does different things depending on how you call it (e.g. reshape()) then it's easy to remember the function name, but much harder to remember how to call it correctly, harder to understand the documentation and less readable. > > I can see an argument for two functions (a get and a compare), but I don't > think there are many cases where doing two gets and comparing the snapshots > would be worth the extra runtime. (It's extra because file.info is only a > little faster than list.files, and it would be unavoidable to call both twice > in that version. Using the timestamp file avoids one of those calls, and > replaces the other with file_test, which takes a similar amount of time. So > overall it's about 20-25% faster.) It also makes the code a bit more > complicated, i.e. three calls (get, get, compare) instead of two (get, > compare). I think a 'snapshotDirectory' and 'compareDirectoryToSnapshot' combination might work well. Thanks, Karl __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Comments requested on "changedFiles" function
Hi Duncan, I like the interface of this version a lot better, but there's still a bunch of implementation details that need fixing: * As previously mentioned, there are important cases where the mtime values change in ways that this code doesn't detect. * If the timestamp file (which is usually in the temp directory) gets deleted (which can happen after a moderate amount of time of inactivity on some systems), then the file_test('-nt', ...) will always return false, even if the file has changed. * If files get added or deleted between the two calls to list.files in fileSnapshot, it will fail with an error. * If the path is on a remote file system, tempdir is local, and there's significant clock skew, then you can get incorrect results. Unfortunately, these aren't just theoretical scenarios -- I've had the misfortune to run up against all of them in the past. I've attached code that's loosely based on your implementation that solves these problems AFAICT. Alternatively, Hadley's code handles all of these correctly, with the exception that compare_state doesn't handle the case where safe_digest returns NA very well. Regards, Karl On Fri, Sep 6, 2013 at 5:40 PM, Duncan Murdoch wrote: > On 13-09-06 7:40 PM, Scott Kostyshak wrote: >> >> On Fri, Sep 6, 2013 at 3:46 PM, Duncan Murdoch >> wrote: >>> >>> On 06/09/2013 2:20 PM, Duncan Murdoch wrote: I have now put the code into a temporary package for testing; if anyone is interested, for a few days it will be downloadable from fisher.stats.uwo.ca/faculty/murdoch/temp/testpkg_1.0.tar.gz >>> >>> >>> >>> Sorry, error in the URL. It should be >>> >>> http://www.stats.uwo.ca/faculty/murdoch/temp/testpkg_1.0.tar.gz >> >> >> Works well. A couple of things I noticed: >> >> (1) >> md5sum is being called on directories, which causes warnings. (If this >> is not viewed as undesirable, please ignore the rest of this comment.) >> Should this be the responsibility of the user (by passing arguments to >> list.files)? In the example, changing >> fileSnapshot(dir, file.info=TRUE, md5sum=TRUE) >> to >> fileSnapshot(dir, file.info=TRUE, md5sum=TRUE, include.dirs=FALSE, >> recursive=TRUE") >> >> gets rid of the warnings. But perhaps the user just wants to exclude >> directories for the md5sum calculations. This can't be controlled from >> fileSnapshot. > > > I don't see the warnings, I just get NA values. I'll try to see why there's > a difference. (One possibility is my platform (Windows); another is that > I'm generally testing in R-patched and R-devel rather than the 3.0.1 release > version.) I would rather suppress the warnings than make the user avoid > them. > > >> >> Or, should the "if (md5sum)" chunk subset "fullnames" using file_test >> or file.info to exclude directories (and then fill in the directories >> with NA)? >> >> (2) >> If I run example(changedFiles) several times, sometimes I get: >> >> chngdF> changedFiles(snapshot) >> File changes: >>mtime md5sum >> file2 TRUE TRUE >> >> and other times I get: >> >> chngdF> changedFiles(snapshot) >> File changes: >>md5sum >> file2 TRUE >> >> I wonder why. > > > Sometimes the example runs so quickly that the new version has exactly the > same modification time as the original. That's the risk of the mtime check. > If you put a delay between, you'll get consistent results. > > Duncan Murdoch > > >> >> Scott >> >>> sessionInfo() >> >> R Under development (unstable) (2013-08-31 r63780) >> Platform: x86_64-unknown-linux-gnu (64-bit) >> >> locale: >> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C >> [3] LC_TIME=en_US.UTF-8LC_COLLATE=en_US.UTF-8 >> [5] LC_MONETARY=en_US.UTF-8LC_MESSAGES=en_US.UTF-8 >> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C >> [9] LC_ADDRESS=C LC_TELEPHONE=C >> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >> >> attached base packages: >> [1] stats graphics grDevices utils datasets methods base >> >> other attached packages: >> [1] testpkg_1.0 >> >> loaded via a namespace (and not attached): >> [1] tools_3.1.0 >>> >>> >> >> >> -- >> Scott Kostyshak >> Economics PhD Candidate >> Princeton University >> > > __ > R-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Comments requested on "changedFiles" function
On Fri, Sep 6, 2013 at 7:03 PM, Duncan Murdoch wrote: > On 13-09-06 9:21 PM, Karl Millar wrote: >> >> Hi Duncan, >> >> I like the interface of this version a lot better, but there's still a >> bunch of implementation details that need fixing: >> >> * As previously mentioned, there are important cases where the mtime >> values change in ways that this code doesn't detect. >> * If the timestamp file (which is usually in the temp directory) gets >> deleted (which can happen after a moderate amount of time of >> inactivity on some systems), then the file_test('-nt', ...) will >> always return false, even if the file has changed. > > > If that happened without user intervention, I think it would break other > things in R -- the temp directory is supposed to last for the whole session. > But I should be checking anyway. Yes, it does break other things in R -- my experience has been that the help system seems to be the one that is impacted the most by this. FWIW, I've never seen the entire R temp directory deleted, just individual files and subdirectories in it, but even that probably depends on how the machine is configured. I suspect only a few users ever notice this, but my R use is probably somewhat anomalous and I think it only happens to R sessions that I haven't used for a few days. >> * If files get added or deleted between the two calls to list.files in >> fileSnapshot, it will fail with an error. > > > Yours won't work if path contains more than one directory. This is probably > a reasonable restriction, but it's inconsistent with list.files, so I'd like > to avoid it if I can find a way. I'm currently unsure what the behaviour when comparing snapshots with multiple directories should be. Presumably we should have the property that (horribly abusing notation for succinctness): compareSnapshots(c(a1, a2), c(a1, a2)) is the same as concatenating (in some form) compareSnapshots(a1, a1) and compareSnapshots(a2, a2) and there's a bunch of ways we could concatenate -- we could return a list of results, or a single result where each of the 'added, deleted, modified' fields are a list, or where we concatenate the 'added, deleted, modified' fields together into three simple vectors. Concatenating the vectors together like this is appealing, but unless you're using the full names, it doesn't include the information of which directory the changes are in, and using the full names doesn't work in the case where you're comparing different sets of directories, e.g. compareSnapshots(c(a1, a2), c(b1, b2)), where there is no sensible choice for a full name. The list options don't have this problem, but are harder to work with, particularly for the common case where there's only a single directory. You'd also have to be somewhat careful with filenames that occur in both directories. Maybe I'm just being dense, but I don't see a way to do this thats clear, easy to use and wouldn't confuse users at the moment. Karl > Duncan Murdoch > > >> * If the path is on a remote file system, tempdir is local, and >> there's significant clock skew, then you can get incorrect results. >> >> Unfortunately, these aren't just theoretical scenarios -- I've had the >> misfortune to run up against all of them in the past. >> >> I've attached code that's loosely based on your implementation that >> solves these problems AFAICT. Alternatively, Hadley's code handles >> all of these correctly, with the exception that compare_state doesn't >> handle the case where safe_digest returns NA very well. >> >> Regards, >> >> Karl >> >> On Fri, Sep 6, 2013 at 5:40 PM, Duncan Murdoch >> wrote: >>> >>> On 13-09-06 7:40 PM, Scott Kostyshak wrote: >>>> >>>> >>>> On Fri, Sep 6, 2013 at 3:46 PM, Duncan Murdoch >>>> >>>> wrote: >>>>> >>>>> >>>>> On 06/09/2013 2:20 PM, Duncan Murdoch wrote: >>>>>> >>>>>> >>>>>> >>>>>> I have now put the code into a temporary package for testing; if >>>>>> anyone >>>>>> is interested, for a few days it will be downloadable from >>>>>> >>>>>> fisher.stats.uwo.ca/faculty/murdoch/temp/testpkg_1.0.tar.gz >>>>> >>>>> >>>>> >>>>> >>>>> Sorry, error in the URL. It should be >>>>> >>>>> http://www.stats.uwo.ca/faculty/murdoch/temp/testpkg_1.0.tar.gz >>>
Re: [Rd] Using long long types in C++
Romain, Can you use int64_t and uint_t64 instead? IMHO that would be more useful than long long anyway. Karl On Sep 19, 2013 5:33 PM, "Patrick Welche" wrote: > On Fri, Sep 20, 2013 at 12:51:52AM +0200, rom...@r-enthusiasts.com wrote: > > In Rcpp we'd like to do something useful for types such as long long > > and unsigned long long. > ... > > But apparently this is still not enough and on some versions of gcc > > (e.g. 4.7 something), -pedantic still generates the warnings unless > > we also use -Wno-long-long > > Can you also add -std=c++0x or is that considered as bad as adding > -Wno-long-long? > > (and why not use autoconf's AC_TYPE_LONG_LONG_INT and > AC_TYPE_UNSIGNED_LONG_LONG_INT for the tests?) > > Cheers, > > Patrick > > __ > R-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel > [[alternative HTML version deleted]] __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] Status of reserved keywords and builtins
According to http://cran.r-project.org/doc/manuals/r-release/R-lang.html#Reserved-words if else repeat while function for in next break TRUE FALSE NULL Inf NaN NA NA_integer_ NA_real_ NA_complex_ NA_character_ ... ..1 ..2 etc. are all reserved keywords. However, in R 3.0.2 you can do things like: `if` <- function(cond, val1, val2) val2 after which if(TRUE) 1 else 2 returns 2. Similarly, users can change the implementation of `<-`, `(`, `{`, `||` and `&&`. Two questions: - Is this intended behaviour? - If so, would it be a good idea to change the language definition to prevent this? Doing so would both have the benefits that users could count on keywords having their normal interpretation, and allow R implementations to implement these more efficiently, including not having to lookup the symbol each time. It'd break any code that assumes that this is valid, but hopefully there's little or no code that does. Thanks Karl __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] [RFC] A case for freezing CRAN
I think what you really want here is the ability to easily identify and sync to CRAN snapshots. The easy way to do this is setup a CRAN mirror, but back it up with version control, so that it's easy to reproduce the exact state of CRAN at any given point in time. CRAN's not particularly large and doesn't churn a whole lot, so most version control systems should be able to handle that without difficulty. Using svn, mod_dav_svn and (maybe) mod_rewrite, you could setup the server so that e.g.: http://my.cran.mirror/repos/2013-01-01/ is a mirror of how CRAN looked at midnight 2013-01-01. Users can then set their repository to that URL, and will have a stable snapshot to work with, and can have all their packages built with that snapshot if they like. For reproducibility purposes, all users need to do is to agree on the same date to use. For publication purposes, the date of the snapshot should be sufficient. We'd need a version of update.packages() that force-syncs all the packages to the version in the repository, even if they're downgrades, but otherwise it ought to be fairly straight-forward. FWIW, we do something similar internally at Google. All the packages that a user has installed come from the same source control revision, where we know that all the package versions are mutually compatible. It saves a lot of headaches, and users can rollback to any previous point in time easily if they run into problems. On Wed, Mar 19, 2014 at 7:45 PM, Jeroen Ooms wrote: > On Wed, Mar 19, 2014 at 6:55 PM, Michael Weylandt > wrote: >> Reading this thread again, is it a fair summary of your position to say >> "reproducibility by default is more important than giving users access to >> the newest bug fixes and features by default?" It's certainly arguable, but >> I'm not sure I'm convinced: I'd imagine that the ratio of new work being >> done vs reproductions is rather high and the current setup optimizes for >> that already. > > I think that separating development from released branches can give us > both reliability/reproducibility (stable branch) as well as new > features (unstable branch). The user gets to pick (and you can pick > both!). The same is true for r-base: when using a 'released' version > you get 'stable' base packages that are up to 12 months old. If you > want to have the latest stuff you download a nightly build of r-devel. > For regular users and reproducible research it is recommended to use > the stable branch. However if you are a developer (e.g. package > author) you might want to develop/test/check your work with the latest > r-devel. > > I think that extending the R release cycle to CRAN would result both > in more stable released versions of R, as well as more freedom for > package authors to implement rigorous change in the unstable branch. > When writing a script that is part of a production pipeline, or sweave > paper that should be reproducible 10 years from now, or a book on > using R, you use stable version of R, which is guaranteed to behave > the same over time. However when developing packages that should be > compatible with the upcoming release of R, you use r-devel which has > the latest versions of other CRAN and base packages. > > >> What I'm trying to figure out is why the standard "install the following >> list of package versions" isn't good enough in your eyes? > > Almost nobody does this because it is cumbersome and impractical. We > can do so much better than this. Note that in order to install old > packages you also need to investigate which versions of dependencies > of those packages were used. On win/osx, users need to manually build > those packages which can be a pain. All in all it makes reproducible > research difficult and expensive and error prone. At the end of the > day most published results obtain with R just won't be reproducible. > > Also I believe that keeping it simple is essential for solutions to be > practical. If every script has to be run inside an environment with > custom libraries, it takes away much of its power. Running a bash or > python script in Linux is so easy and reliable that entire > distributions are based on it. I don't understand why we make our > lives so difficult in R. > > In my estimation, a system where stable versions of R pull packages > from a stable branch of CRAN will naturally resolve the majority of > the reproducibility and reliability problems with R. And in contrast > to what some people here are suggesting it does not introduce any > limitations. If you want to get the latest stuff, you either grab a > copy of r-devel, or just enable the testing branch and off you go. > Debian 'testing' works in a similar way, see > http://www.debian.org/devel/testing. > > __ > R-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/l
Re: [Rd] The case for freezing CRAN
Given the version / dated snapshots of CRAN, and an agreement that reproducibility is the responsibility of the study author, the author simply needs to sync all their packages to a chosen date, run the analysis and publish the chosen date. It is true that this doesn't include compilers, OS, system packages etc, but in my experience those are significantly more stable than CRAN packages. Also, my previous description of how to serve up a dated CRAN was way too complicated. Since most of the files on CRAN never change, they don't need version control. Only the metadata about which versions are current really needs to be tracked, and that's small enough that it could be stored in static files. On Thu, Mar 20, 2014 at 6:32 AM, Dirk Eddelbuettel wrote: > > No attempt to summarize the thread, but a few highlighted points: > > o Karl's suggestion of versioned / dated access to the repo by adding a >layer to webaccess is (as usual) nice. It works on the 'supply' side. > But >Jeroen's problem is on the demand side. Even when we know that an >analysis was done on 20xx-yy-zz, and we reconstruct CRAN that day, it > only >gives us a 'ceiling' estimate of what was on the machine. In production >or lab environments, installations get stale. Maybe packages were > already >a year old? To me, this is an issue that needs to be addressed on the >'demand' side of the user. But just writing out version numbers is not >good enough. > > o Roger correctly notes that R scripts and packages are just one issue. >Compilers, libraries and the OS matter. To me, the natural approach > these >days would be to think of something based on Docker or Vagrant or (if > you >must, VirtualBox). The newer alternatives make snapshotting very cheap >(eg by using Linux LXC). That approach reproduces a full environemnt as >best as we can while still ignoring the hardware layer (and some readers >may recall the infamous Pentium bug of two decades ago). > > o Reproduciblity will probably remain the responsibility of study >authors. If an investigator on a mega-grant wants to (or needs to) > freeze, >they do have the tools now. Requiring the need of a few to push work on >those already overloaded (ie CRAN) and changing the workflow of > everybody >is a non-starter. > > o As Terry noted, Jeroen made some strong claims about exactly how flawed >the existing system is and keeps coming back to the example of 'a JSS >paper that cannot be re-run'. I would really like to see empirics on >this. Studies of reproducibility appear to be publishable these days, > so >maybe some enterprising grad student wants to run with the idea of >actually _testing_ this. We maybe be above Terry's 0/30 and nearer to >Kevin's 'low'/30. But let's bring some data to the debate. > > o Overall, I would tend to think that our CRAN standards of releasing with >tests, examples, and checks on every build and release already do a much >better job of keeping things tidy and workable than in most if not all >other related / similar open source projects. I would of course welcome >contradictory examples. > > Dirk > > -- > Dirk Eddelbuettel | e...@debian.org | http://dirk.eddelbuettel.com > > __ > R-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel > [[alternative HTML version deleted]] __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] Patch for R to fix some buffer overruns and add a missing PROTECT().
This patch is against current svn and contains three classes of fix: - Ensure the result is properly terminated after calls to strncpy() - Replace calls of sprintf() with snprintf() - Added a PROTECT() call in do_while which could cause memory errors if evaluating the condition results in a warning. Thanks, Karl __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Patch for R to fix some buffer overruns and add a missing PROTECT().
Bug submitted. Thanks. On Tue, Sep 23, 2014 at 12:42 PM, Duncan Murdoch wrote: > On 23/09/2014 3:20 PM, Karl Millar wrote: >> >> This patch is against current svn and contains three classes of fix: >> - Ensure the result is properly terminated after calls to strncpy() >> - Replace calls of sprintf() with snprintf() >> - Added a PROTECT() call in do_while which could cause memory >> errors if evaluating the condition results in a warning. > > > Nothing was attached. > > Generally fixes like this are best sent to bugs.r-project.org, and they > receive highest priority if accompanied by code demonstrating why they are > needed, i.e. crashes or incorrect results in current R. Those will likely > be incorporated as regression tests. > > Duncan Murdoch __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] Making parent.env<- an error for package namespaces and package imports
I'd like to propose a change to the R language so that calling 'parent.env<-' on a package namespace or package imports is a runtime error. Currently the documentation warns that it's dangerous behaviour and might go away: The replacement function ‘parent.env<-’ is extremely dangerous as it can be used to destructively change environments in ways that violate assumptions made by the internal C code. It may be removed in the near future. This change would both eliminate some potential dangerous behaviours, and make it significantly easier for runtime compilation systems to optimize symbol lookups for code in packages. The following patch against current svn implements this functionality. It allows calls to 'parent.env<-' only until the namespace is locked, allowing the namespace to be built correctly while preventing user code from subsequently messing with it. I'd also like to make calling parent.env<- on an environment on the call stack an error, for the same reasons, but it's not so obvious to me how to implement that efficiently right now. Could we at least document that as being 'undefined behaviour'? Thanks, Karl Index: src/main/builtin.c === --- src/main/builtin.c (revision 66783) +++ src/main/builtin.c (working copy) @@ -356,6 +356,24 @@ return( ENCLOS(arg) ); } +static Rboolean R_IsImportsEnv(SEXP env) +{ +if (isNull(env) || !isEnvironment(env)) +return FALSE; +if (ENCLOS(env) != R_BaseNamespace) +return FALSE; +SEXP name = getAttrib(env, R_NameSymbol); +if (!isString(name) || length(name) != 1) +return FALSE; + +const char *imports_prefix = "imports:"; +const char *name_string = CHAR(STRING_ELT(name, 0)); +if (!strncmp(name_string, imports_prefix, strlen(imports_prefix))) +return TRUE; +else +return FALSE; +} + SEXP attribute_hidden do_parentenvgets(SEXP call, SEXP op, SEXP args, SEXP rho) { SEXP env, parent; @@ -371,6 +389,10 @@ error(_("argument is not an environment")); if( env == R_EmptyEnv ) error(_("can not set parent of the empty environment")); +if (R_EnvironmentIsLocked(env) && R_IsNamespaceEnv(env)) + error(_("can not set the parent environment of a namespace")); +if (R_EnvironmentIsLocked(env) && R_IsImportsEnv(env)) + error(_("can not set the parent environment of package imports")); parent = CADR(args); if (isNull(parent)) { error(_("use of NULL environment is defunct"));  __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] [patch] Support many columns in model.matrix
Generating a model matrix with very large numbers of columns overflows the stack and/or runs very slowly, due to the implementation of TrimRepeats(). This patch modifies it to use Rf_duplicated() to find the duplicates. This makes the running time linear in the number of columns and eliminates the recursive function calls. Thanks Index: src/library/stats/src/model.c === --- src/library/stats/src/model.c (revision 70230) +++ src/library/stats/src/model.c (working copy) @@ -1259,11 +1259,12 @@ static int TermZero(SEXP term) { -int i, val; -val = 1; -for (i = 0; i < nwords; i++) - val = val && (INTEGER(term)[i] == 0); -return val; +for (int i = 0; i < nwords; i++) { +if (INTEGER(term)[i] != 0) { +return 0; +} +} +return 1; } @@ -1271,11 +1272,12 @@ static int TermEqual(SEXP term1, SEXP term2) { -int i, val; -val = 1; -for (i = 0; i < nwords; i++) - val = val && (INTEGER(term1)[i] == INTEGER(term2)[i]); -return val; +for (int i = 0; i < nwords; i++) { +if (INTEGER(term1)[i] != INTEGER(term2)[i]) { +return 0; +} +} +return 1; } @@ -1303,18 +1305,37 @@ /* TrimRepeats removes duplicates of (bit string) terms - in a model formula by repeated use of ``StripTerm''. + in a model formula. Also drops zero terms. */ static SEXP TrimRepeats(SEXP list) { -if (list == R_NilValue) - return R_NilValue; -/* Highly recursive */ -R_CheckStack(); -if (TermZero(CAR(list))) - return TrimRepeats(CDR(list)); -SETCDR(list, TrimRepeats(StripTerm(CAR(list), CDR(list; +// Drop zero terms at the start of the list. +while (list != R_NilValue && TermZero(CAR(list))) { + list = CDR(list); +} +if (list == R_NilValue || CDR(list) == R_NilValue) + return list; + +// Find out which terms are duplicates. +SEXP all_terms = PROTECT(Rf_PairToVectorList(list)); +SEXP duplicate_sexp = PROTECT(Rf_duplicated(all_terms, FALSE)); +int* is_duplicate = LOGICAL(duplicate_sexp); +int i = 0; + +// Remove the zero terms and duplicates from the list. +for (SEXP current = list; CDR(current) != R_NilValue; i++) { + SEXP next = CDR(current); + + if (is_duplicate[i + 1] || TermZero(CAR(next))) { + // Remove the node from the list. + SETCDR(current, CDR(next)); + } else { + current = next; + } +} + +UNPROTECT(2); return list; } __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] [patch] Support many columns in model.matrix
Thanks. Couldn't you implement model.matrix(..., sparse = TRUE) with a small amount of R code similar to MatrixModels::model.Matrix ? On Mon, Feb 29, 2016 at 10:01 AM, Martin Maechler wrote: >>>>>> Karl Millar via R-devel >>>>>> on Fri, 26 Feb 2016 15:58:20 -0800 writes: > > > Generating a model matrix with very large numbers of > > columns overflows the stack and/or runs very slowly, due > > to the implementation of TrimRepeats(). > > > This patch modifies it to use Rf_duplicated() to find the > > duplicates. This makes the running time linear in the > > number of columns and eliminates the recursive function > > calls. > > Thank you, Karl. > I've committed this (very slightly modified) to R-devel, > > (also after looking for a an example that runs on a non-huge > computer and shows the difference) : > > nF <- 11 ; set.seed(1) > lff <- setNames(replicate(nF, as.factor(rpois(128, 1/4)), simplify=FALSE), > letters[1:nF]) > str(dd <- as.data.frame(lff)); prod(sapply(dd, nlevels)) > ## 'data.frame':128 obs. of 11 variables: > ## $ a: Factor w/ 3 levels "0","1","2": 1 1 1 2 1 2 2 1 1 1 ... > ## $ b: Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 2 1 1 1 ... > ## $ c: Factor w/ 3 levels "0","1","2": 1 1 1 2 1 1 1 2 1 1 ... > ## $ d: Factor w/ 3 levels "0","1","2": 1 1 2 2 1 2 1 1 2 1 ... > ## $ e: Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 2 1 ... > ## $ f: Factor w/ 2 levels "0","1": 2 1 2 1 2 1 1 2 1 2 ... > ## $ g: Factor w/ 4 levels "0","1","2","3": 2 1 1 2 1 3 1 1 1 1 ... > ## $ h: Factor w/ 4 levels "0","1","2","4": 1 1 1 1 2 1 1 1 1 1 ... > ## $ i: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 2 ... > ## $ j: Factor w/ 3 levels "0","1","2": 1 2 3 1 1 1 1 1 1 1 ... > ## $ k: Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ... > ## > ## [1] 139968 > > system.time(mff <- model.matrix(~ . ^ 11, dd, contrasts = list(a = > "contr.helmert"))) > ## user system elapsed > ## 0.255 0.033 0.287 --- *with* the patch on my desktop (16 GB) > ## 1.489 0.031 1.522 --- for R-patched (i.e. w/o the patch) > >> dim(mff) > [1]128 139968 >> object.size(mff) > 154791504 bytes > > --- > > BTW: These example would gain tremendously if I finally got > around to provide > >model.matrix(, sparse = TRUE) > > which would then produce a Matrix-package sparse matrix. > > Even for this somewhat small case, a sparse matrix is a factor > of 13.5 x smaller : > >> s1 <- object.size(mff); s2 <- object.size(M <- Matrix::Matrix(mff)); >> as.vector( s1/s2 ) > [1] 13.47043 > > I'm happy to collaborate with you on adding such a (C level) > interface to sparse matrices for this case. > > Martin Maechler __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] Undocumented 'use.names' argument to c()
'c' has an undocumented 'use.names' argument. I'm not sure if this is a documentation or implementation bug. > c(a = 1) a 1 > c(a = 1, use.names = F) [1] 1 Karl __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Undocumented 'use.names' argument to c()
I'd expect that a lot of the performance overhead could be eliminated by simply improving the underlying code. IMHO, we should ignore it in deciding the API that we want here. On Fri, Sep 23, 2016 at 10:54 AM, Henrik Bengtsson wrote: > I'd vote for it to stay. It could of course suprise someone who'd > expect c(list(a=1), b=2, use.names = FALSE) to generate list(a=1, b=2, > use.names=FALSE). On the upside, is the performance gain from using > use.names=FALSE. Below benchmarks show that the combining of the > names attributes themselves takes ~20-25 times longer than the > combining of the integers themselves. Also, at no surprise, > use.names=FALSE avoids some memory allocations. > >> options(digits = 2) >> >> a <- b <- c <- d <- 1:1e4 >> names(c) <- c >> names(d) <- d >> >> stats <- microbenchmark::microbenchmark( > + c(a, b, use.names=FALSE), > + c(c, d, use.names=FALSE), > + c(a, d, use.names=FALSE), > + c(a, b, use.names=TRUE), > + c(a, d, use.names=TRUE), > + c(c, d, use.names=TRUE), > + unit = "ms" > + ) >> >> stats > Unit: milliseconds >expr minlq mean medianuq max neval > c(a, b, use.names = FALSE) 0.031 0.032 0.049 0.034 0.036 1.474 100 > c(c, d, use.names = FALSE) 0.031 0.031 0.035 0.034 0.035 0.064 100 > c(a, d, use.names = FALSE) 0.031 0.031 0.049 0.034 0.035 1.452 100 > c(a, b, use.names = TRUE) 0.031 0.031 0.055 0.034 0.036 2.094 100 > c(a, d, use.names = TRUE) 0.510 0.526 0.588 0.549 0.617 1.998 100 > c(c, d, use.names = TRUE) 0.780 0.815 0.886 0.841 0.944 1.430 100 > >> profmem::profmem(c(c, d, use.names=FALSE)) > Rprofmem memory profiling of: > c(c, d, use.names = FALSE) > > Memory allocations: > bytes calls > 1 80040 > total 80040 > >> profmem::profmem(c(c, d, use.names=TRUE)) > Rprofmem memory profiling of: > c(c, d, use.names = TRUE) > > Memory allocations: >bytes calls > 1 80040 > 2 160040 > total 240080 > > /Henrik > > On Fri, Sep 23, 2016 at 10:25 AM, William Dunlap via R-devel > wrote: >> In Splus c() and unlist() called the same C code, but with a different >> 'sys_index' code (the last argument to .Internal) and c() did not consider >> an argument named 'use.names' special. >> >>> c >> function(..., recursive = F) >> .Internal(c(..., recursive = recursive), "S_unlist", TRUE, 1) >>> unlist >> function(data, recursive = T, use.names = T) >> .Internal(unlist(data, recursive = recursive, use.names = use.names), >> "S_unlist", TRUE, 2) >>> c(A=1,B=2,use.names=FALSE) >> A B use.names >> 1 2 0 >> >> The C code used sys_index==2 to mean 'the last argument is the 'use.names' >> argument, if sys_index==1 only the recursive argument was considered >> special. >> >> Sys.funs.c: >> 405 S_unlist(vector *ent, vector *arglist, s_evaluator *S_evaluator) >> 406 { >> 407 int which = sys_index; boolean named, recursive, names; >> ... >> 419 args = arglist->value.tree; n = arglist->length; >> ... >> 424 names = which==2 ? logical_value(args[--n], ent, S_evaluator) >> : (which == 1); >> >> Thus there is no historical reason for giving c() the use.names argument. >> >> >> Bill Dunlap >> TIBCO Software >> wdunlap tibco.com >> >> On Fri, Sep 23, 2016 at 9:37 AM, Suharto Anggono Suharto Anggono via >> R-devel wrote: >> >>> In S-PLUS 3.4 help on 'c' (http://www.uni-muenster.de/ >>> ZIV.BennoSueselbeck/s-html/helpfiles/c.html), there is no 'use.names' >>> argument. >>> >>> Because 'c' is a generic function, I don't think that changing formal >>> arguments is good. >>> >>> In R devel r71344, 'use.names' is not an argument of functions 'c.Date', >>> 'c.POSIXct' and 'c.difftime'. >>> >>> Could 'use.names' be documented to be accepted by the default method of >>> 'c', but not listed as a formal argument of 'c'? Or, could the code that >>> handles the argument name 'use.names' be removed? >>> >>> >>>>> David Winsemius >>> >>>>> on Tue, 20 Sep 2016 23:46:48 -0700 writes: >>> >>> >> On Sep 20, 2016, at 7:18 PM, Karl Millar via
[Rd] Is importMethodsFrom actually needed?
IIUC, loading a namespace automatically registers all the exported methods as long as the generic can be found when the namespace gets loaded. Generics can be exported and imported as regular functions. In that case, code in a package should be able to simply import the generic and the methods will automatically work correctly without any need for importMethodsFrom. Is there something that I'm missing here? What breaks if you don't explicitly import methods? Thanks, Karl __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Upgrading a package to which other packages are LinkingTo
A couple of points: - rebuilding dependent packages is needed if there is an ABI change, not just an API change. For packages like Rcpp which export inline functions or macros that might have changed, this is potentially any change to existing functions, but for packages like Matrix, it isn't really an issue at all IIUC. - If we're looking into a way to check if package APIs are compatible, then that's something that's relevant for all packages, since they all export an R API. I believe that CRAN only tests package compatibility with the most recent versions of packages on CRAN that import or depend on it. There's no guarantee that a package update won't contain API or behaviour changes that breaks older versions of packages, packages not on CRAN or any scripts that use the package, and these sorts of breakages do happen semi-regularly. - AFAICT, the only difference with packages like Rcpp is that you can potentially have all of your CRAN packages at the latest version, but some of them might have inlined code from an older version of Rcpp even after running update.packages(). While that is an issue, in my experience that's been a lot less trouble than the general case of backwards compatibility. Karl On Fri, Dec 16, 2016 at 8:19 AM, Dirk Eddelbuettel wrote: > > On 16 December 2016 at 11:00, Duncan Murdoch wrote: > | On 16/12/2016 10:40 AM, Dirk Eddelbuettel wrote: > | > On 16 December 2016 at 10:14, Duncan Murdoch wrote: > | > | On 16/12/2016 8:37 AM, Dirk Eddelbuettel wrote: > | > | > > | > | > On 16 December 2016 at 08:20, Duncan Murdoch wrote: > | > | > | Perhaps the solution is to recommend that packages which export > their > | > | > | C-level entry points either guarantee them not to change or offer > | > | > | (require?) version checks by user code. So dplyr should start out > by > | > | > | saying "I'm using Rcpp interface 0.12.8". If Rcpp has a new version > | > | > | with a compatible interface, it replies "that's fine". If Rcpp has > | > | > | changed its interface, it says "Sorry, I don't support that any > more." > | > | > > | > | > We try. But it's hard, and I'd argue, likely impossible. > | > | > > | > | > For example I even added a "frozen" package [1] in the sources / unit > tests > | > | > to test for just this. In practice you just cannot hit every possible > access > | > | > point of the (rich, in our case) API so the tests pass too often. > | > | > > | > | > Which is why we relentlessly test against reverse-depends to _at > least ensure > | > | > buildability_ from our releases. > | > > | > I meant to also add: "... against a large corpus of other packages." > | > The intent is to empirically answer this. > | > > | > | > As for seamless binary upgrade, I don't think in can work in > practice. Ask > | > | > Uwe one day we he rebuilds everything every time on Windows. And for > what it > | > | > is worth, we essentially do the same in Debian. > | > | > > | > | > Sometimes you just need to rebuild. That may be the price of > admission for > | > | > using the convenience of rich C++ interfaces. > | > | > > | > | > | > | Okay, so would you say that Kirill's suggestion is not overkill? Every > | > | time package B uses LinkingTo: A, R should assume it needs to rebuild B > | > | when A is updated? > | > > | > Based on my experience is a "halting problem" -- i.e. cannot know ex ante. > | > > | > So "every time" would be overkill to me. Sometimes you know you must > | > recompile (but try to be very prudent with public-facing API). Many times > | > you do not. It is hard to pin down. > | > > | > At work we have a bunch of servers with Rcpp and many packages against > them > | > (installed system-wide for all users). We _very really_ needs rebuild. > > Edit: "We _very rarely_ need rebuilds" is what was meant there. > > | So that comes back to my suggestion: you should provide a way for a > | dependent package to ask if your API has changed. If you say it hasn't, > | the package is fine. If you say it has, the package should abort, > | telling the user they need to reinstall it. (Because it's a hard > | question to answer, you might get it wrong and say it's fine when it's > | not. But that's easy to fix: just make a new release that does require > > Sure. > > We have always increased the higher-order version number when that is needed. > > One problem with your proposal is that the testing code may run after the > package load, and in the case where it matters ... that very code may not get > reached because the package didn't load. > > Dirk > > -- > http://dirk.eddelbuettel.com | @eddelbuettel | e...@debian.org > > __ > R-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Request: Increasing MAX_NUM_DLLS in Rdynload.c
It's not always clear when it's safe to remove the DLL. The main problem that I'm aware of is that native objects with finalizers might still exist (created by R_RegisterCFinalizer etc). Even if there are no live references to such objects (which would be hard to verify), it still wouldn't be safe to unload the DLL until a full garbage collection has been done. If the DLL is unloaded, then the function pointer that was registered now becomes a pointer into the memory where the DLL was, leading to an almost certain crash when such objects get garbage collected. A better approach would be to just remove the limit on the number of DLLs, dynamically expanding the array if/when needed. On Tue, Dec 20, 2016 at 3:40 AM, Jeroen Ooms wrote: > On Tue, Dec 20, 2016 at 7:04 AM, Henrik Bengtsson > wrote: >> On reason for hitting the MAX_NUM_DLLS (= 100) limit is because some >> packages don't unload their DLLs when they being unloaded themselves. > > I am surprised by this. Why does R not do this automatically? What is > the case for keeping the DLL loaded after the package has been > unloaded? What happens if you reload another version of the same > package from a different library after unloading? > > __ > R-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Request: Increasing MAX_NUM_DLLS in Rdynload.c
It does, but you'd still be relying on the R code ensuring that all of these objects are dead prior to unloading the DLL, otherwise they'll survive the GC. Maybe if the package counted how many such objects exist, it could work out when it's safe to remove the DLL. I'm not sure that it can be done automatically. What could be done is to to keep the DLL loaded, but remove it from R's table of loaded DLLs. That way, there's no risk of dangling function pointers and a new DLL of the same name could be loaded. You could still run into issues though as some DLLs assume that the associated namespace exists. Currently what I do is to never unload DLLs. If I need to replace one, then I just restart R. It's less convenient, but it's always correct. On Wed, Dec 21, 2016 at 9:10 AM, Henrik Bengtsson wrote: > On Tue, Dec 20, 2016 at 7:39 AM, Karl Millar wrote: >> It's not always clear when it's safe to remove the DLL. >> >> The main problem that I'm aware of is that native objects with >> finalizers might still exist (created by R_RegisterCFinalizer etc). >> Even if there are no live references to such objects (which would be >> hard to verify), it still wouldn't be safe to unload the DLL until a >> full garbage collection has been done. >> >> If the DLL is unloaded, then the function pointer that was registered >> now becomes a pointer into the memory where the DLL was, leading to an >> almost certain crash when such objects get garbage collected. > > Very good point. > > Does base::gc() perform such a *full* garbage collection and thereby > trigger all remaining finalizers to be called? In other words, do you > think an explicit call to base::gc() prior to cleaning out left-over > DLLs (e.g. R.utils::gcDLLs()) would be sufficient? > > /Henrik > >> >> A better approach would be to just remove the limit on the number of >> DLLs, dynamically expanding the array if/when needed. >> >> >> On Tue, Dec 20, 2016 at 3:40 AM, Jeroen Ooms >> wrote: >>> On Tue, Dec 20, 2016 at 7:04 AM, Henrik Bengtsson >>> wrote: >>>> On reason for hitting the MAX_NUM_DLLS (= 100) limit is because some >>>> packages don't unload their DLLs when they being unloaded themselves. >>> >>> I am surprised by this. Why does R not do this automatically? What is >>> the case for keeping the DLL loaded after the package has been >>> unloaded? What happens if you reload another version of the same >>> package from a different library after unloading? >>> >>> __ >>> R-devel@r-project.org mailing list >>> https://stat.ethz.ch/mailman/listinfo/r-devel __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] unlicense
Please don't use 'Unlimited' or 'Unlimited + ...'. Google's lawyers don't recognize 'Unlimited' as being open-source, so our policy doesn't allow us to use such packages due to lack of an acceptable license. To our lawyers, 'Unlimited + file LICENSE' means something very different than it presumably means to Uwe. Thanks, Karl On Sat, Jan 14, 2017 at 12:10 AM, Uwe Ligges wrote: > Dear all, > > from "Writing R Extensions": > > The string ‘Unlimited’, meaning that there are no restrictions on > distribution or use other than those imposed by relevant laws (including > copyright laws). > > If a package license restricts a base license (where permitted, e.g., using > GPL-3 or AGPL-3 with an attribution clause), the additional terms should be > placed in file LICENSE (or LICENCE), and the string ‘+ file LICENSE’ (or ‘+ > file LICENCE’, respectively) should be appended to the > corresponding individual license specification. > ... > Please note in particular that “Public domain” is not a valid license, since > it is not recognized in some jurisdictions." > > So perhaps you aim for > License: Unlimited > > Best, > Uwe Ligges > > > > > > On 14.01.2017 07:53, Deepayan Sarkar wrote: >> >> On Sat, Jan 14, 2017 at 5:49 AM, Duncan Murdoch >> wrote: >>> >>> On 13/01/2017 3:21 PM, Charles Geyer wrote: I would like the unlicense (http://unlicense.org/) added to R licenses. Does anyone else think that worthwhile? >>> >>> That's a question for you to answer, not to ask. Who besides you thinks >>> that it's a good license for open source software? >>> >>> If it is recognized by the OSF or FSF or some other authority as a FOSS >>> license, then CRAN would probably also recognize it. If not, then CRAN >>> doesn't have the resources to evaluate it and so is unlikely to recognize >>> it. >> >> >> Unlicense is listed in https://spdx.org/licenses/ >> >> Debian does include software "licensed" like this, and seems to think >> this is one way (not the only one) of declaring something to be >> "public domain". The first two examples I found: >> >> https://tracker.debian.org/media/packages/r/rasqal/copyright-0.9.29-1 >> >> https://tracker.debian.org/media/packages/w/wiredtiger/copyright-2.6.1%2Bds-1 >> >> This follows the format explained in >> >> https://www.debian.org/doc/packaging-manuals/copyright-format/1.0/#license-specification, >> which does not explicitly include Unlicense, but does include CC0, >> which AFAICT is meant to formally license something so that it is >> equivalent to being in the public domain. R does include CC0 as a >> shorthand (e.g., geoknife). >> >> https://www.debian.org/legal/licenses/ says that >> >> >> >> Licenses currently found in Debian main include: >> >> - ... >> - ... >> - public domain (not a license, strictly speaking) >> >> >> >> The equivalent for CRAN would probably be something like "License: >> public-domain + file LICENSE". >> >> -Deepayan >> >>> Duncan Murdoch >>> >>> >>> __ >>> R-devel@r-project.org mailing list >>> https://stat.ethz.ch/mailman/listinfo/r-devel >> >> >> __ >> R-devel@r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-devel >> > > __ > R-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] unlicense
Unfortunately, our lawyers say that they can't give legal advice in this context. My question would be, what are people looking for that the MIT or 2-clause BSD license don't provide? They're short, clear, widely accepted and very permissive. Another possibility might be to dual-license packages with both an OSI-approved license and whatever-else-you-like, e.g. 'MIT | ', but IIUC there's a bunch more complexity there than just using an OSI-approved license. Karl On Tue, Jan 17, 2017 at 3:35 PM, Uwe Ligges wrote: > > > On 18.01.2017 00:13, Karl Millar wrote: >> >> Please don't use 'Unlimited' or 'Unlimited + ...'. >> >> Google's lawyers don't recognize 'Unlimited' as being open-source, so >> our policy doesn't allow us to use such packages due to lack of an >> acceptable license. To our lawyers, 'Unlimited + file LICENSE' means >> something very different than it presumably means to Uwe. > > > > Karl, > > thanks for this comment. What we like to hear now is a suggestion what the > maintainer is supposed to do to get what he aims at, as we already know that > "freeware" does not work at all and was hard enough to get to the > "Unlimited" options. > > We have many CRAN requests asking for what they should write for "freeware". > Can we get an opinion from your layers which standard license comes closest > to what these maintainers probably aim at and will work more or less > globally, i.e. not only in the US? > > Best, > Uwe > > > > >> Thanks, >> >> Karl >> >> On Sat, Jan 14, 2017 at 12:10 AM, Uwe Ligges >> wrote: >>> >>> Dear all, >>> >>> from "Writing R Extensions": >>> >>> The string ‘Unlimited’, meaning that there are no restrictions on >>> distribution or use other than those imposed by relevant laws (including >>> copyright laws). >>> >>> If a package license restricts a base license (where permitted, e.g., >>> using >>> GPL-3 or AGPL-3 with an attribution clause), the additional terms should >>> be >>> placed in file LICENSE (or LICENCE), and the string ‘+ file LICENSE’ (or >>> ‘+ >>> file LICENCE’, respectively) should be appended to the >>> corresponding individual license specification. >>> ... >>> Please note in particular that “Public domain” is not a valid license, >>> since >>> it is not recognized in some jurisdictions." >>> >>> So perhaps you aim for >>> License: Unlimited >>> >>> Best, >>> Uwe Ligges >>> >>> >>> >>> >>> >>> On 14.01.2017 07:53, Deepayan Sarkar wrote: >>>> >>>> >>>> On Sat, Jan 14, 2017 at 5:49 AM, Duncan Murdoch >>>> wrote: >>>>> >>>>> >>>>> On 13/01/2017 3:21 PM, Charles Geyer wrote: >>>>>> >>>>>> >>>>>> >>>>>> I would like the unlicense (http://unlicense.org/) added to R >>>>>> licenses. Does anyone else think that worthwhile? >>>>>> >>>>> >>>>> That's a question for you to answer, not to ask. Who besides you >>>>> thinks >>>>> that it's a good license for open source software? >>>>> >>>>> If it is recognized by the OSF or FSF or some other authority as a FOSS >>>>> license, then CRAN would probably also recognize it. If not, then CRAN >>>>> doesn't have the resources to evaluate it and so is unlikely to >>>>> recognize >>>>> it. >>>> >>>> >>>> >>>> Unlicense is listed in https://spdx.org/licenses/ >>>> >>>> Debian does include software "licensed" like this, and seems to think >>>> this is one way (not the only one) of declaring something to be >>>> "public domain". The first two examples I found: >>>> >>>> https://tracker.debian.org/media/packages/r/rasqal/copyright-0.9.29-1 >>>> >>>> >>>> https://tracker.debian.org/media/packages/w/wiredtiger/copyright-2.6.1%2Bds-1 >>>> >>>> This follows the format explained in >>>> >>>> >>>> https://www.debian.org/doc/packaging-manuals/copyright-format/1.0/#license-specification, >>>> which does not explicitly include Unlicense, but does include CC0, >>>> which AFAICT is meant to formally license something so that it is >>>> equivalent to being in the public domain. R does include CC0 as a >>>> shorthand (e.g., geoknife). >>>> >>>> https://www.debian.org/legal/licenses/ says that >>>> >>>> >>>> >>>> Licenses currently found in Debian main include: >>>> >>>> - ... >>>> - ... >>>> - public domain (not a license, strictly speaking) >>>> >>>> >>>> >>>> The equivalent for CRAN would probably be something like "License: >>>> public-domain + file LICENSE". >>>> >>>> -Deepayan >>>> >>>>> Duncan Murdoch >>>>> >>>>> >>>>> __ >>>>> R-devel@r-project.org mailing list >>>>> https://stat.ethz.ch/mailman/listinfo/r-devel >>>> >>>> >>>> >>>> __ >>>> R-devel@r-project.org mailing list >>>> https://stat.ethz.ch/mailman/listinfo/r-devel >>>> >>> >>> __ >>> R-devel@r-project.org mailing list >>> https://stat.ethz.ch/mailman/listinfo/r-devel __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Control statements with condition with greater than one should give error (not just warning) [PATCH]
Is there anything that actually requires R core members to manually do significant amounts of work here? IIUC, you can do a CRAN run to detect the broken packages, and a simple script can collect the emails of the affected maintainers, so you can send a single email to them all. If authors don't respond by fixing their packages, then those packages should be archived, since there's high probability of those packages being buggy anyway. If you expect a non-trivial amount of questions regarding this change from the affected package maintainers, then you can create a FAQ page for it, which you can fill in as questions arrive, so you don't get too many duplicated questions. Karl On Mon, Mar 6, 2017 at 4:51 AM, Martin Maechler wrote: > > Michael Lawrence > > on Sat, 4 Mar 2017 12:20:45 -0800 writes: > > > Is there really a need for these complications? Packages > > emitting this warning are broken by definition and should be fixed. > > I agree and probably Henrik, too. > > (Others may disagree to some extent .. and find it convenient > that R does translate 'if(x)' to 'if(x[1])' for them albeit > with a warning .. ) > > > Perhaps we could "flip the switch" in a test > > environment and see how much havoc is wreaked and whether > > authors are sufficiently responsive? > > > Michael > > As we have > 10'000 packages on CRAN alonce, and people have > started (mis)using suppressWarnings(.) in many places, there > may be considerably more packages affected than we optimistically assume... > > As R core member who would "flip the switch" I'd typically then > have to be the one sending an e-mail to all package maintainers > affected and in this case I'm very reluctant to volunteer > for that and so, I'd prefer the environment variable where R > core and others can decide how to use it .. for a while .. until > the flip is switched for all. > > or have I overlooked an issue? > > Martin > > > On Sat, Mar 4, 2017 at 12:04 PM, Martin Maechler > > >> wrote: > > >> > Henrik Bengtsson > > >> on Fri, 3 Mar 2017 10:10:53 -0800 writes: > >> > >> > On Fri, Mar 3, 2017 at 9:55 AM, Hadley Wickham > > >> wrote: >>> But, how you propose a > >> warning-to-error transition >>> should be made without > >> wreaking havoc? Just flip the >>> switch in R-devel and > >> see CRAN and Bioconductor packages >>> break overnight? > >> Particularly Bioconductor devel might >>> become > >> non-functional (since at times it requires >>> R-devel). > >> For my own code / packages, I would be able >>> to handle > >> such a change, but I'm completely out of >>> control if > >> one of the package I'm depending on does not >>> provide > >> a quick fix (with the only option to remove >>> package > >> tests for those dependencies). > >> >> > >> >> Generally, a package can not be on CRAN if it has any > >> >> warnings, so I don't think this change would have any > >> >> impact on CRAN packages. Isn't this also true for >> > >> bioconductor? > >> > >> > Having a tests/warn.R file with: > >> > >> > warning("boom") > >> > >> > passes through R CMD check --as-cran unnoticed. > >> > >> Yes, indeed.. you are right Henrik that many/most R > >> warning()s would not produce R CMD check 'WARNING's .. > >> > >> I think Hadley and I fell into the same mental pit of > >> concluding that such warning()s from > >> if() ... would not currently happen > >> in CRAN / Bioc packages and hence turning them to errors > >> would not have a direct effect. > >> > >> With your 2nd e-mail of saying that you'd propose such an > >> option only for a few releases of R you've indeed > >> clarified your intent to me. OTOH, I would prefer using > >> an environment variable (as you've proposed as an > >> alternative) which is turned "active" at the beginning > >> only manually or for the "CRAN incoming" checks of the > >> CRAN team (and bioconductor submission checks?) and > >> later for '--as-cran' etc until it eventually becomes the > >> unconditional behavior of R (and the env.variable is no > >> longer used). > >> > >> Martin > >> > >> __ > >> R-devel@r-project.org mailing list > >> https://stat.ethz.ch/mailman/listinfo/r-devel > >> > > > [[alternative HTML version deleted]] > > __ > R-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel > [[alternative HTML version deleted]] __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] segfault when trying to allocate a large vector
Hi Pierrick, You're storing largevec on the stack, which is probably causing a stack overflow. Allocate largvec on the heap with malloc or one of the R memory allocation routines instead and it should work fine. Karl On Thu, Dec 18, 2014 at 12:00 AM, Pierrick Bruneau wrote: > > Dear R contributors, > > I'm running into trouble when trying to allocate some large (but in > theory viable) vector in the context of C code bound to R through > .Call(). Here is some sample code summarizing the problem: > > SEXP test() { > > int size = 1000; > double largevec[size]; > memset(largevec, 0, size*sizeof(double)); > return(R_NilValue); > > } > > If size if small enough (up to 10^6), everything is fine. When it > reaches 10^7 as above, I get a segfault. As far as I know, a double > value is represented with 8 bytes, which would make largevec above > approx. 80Mo -> this is certainly large for a single variable, but > should remain well below the limits of my machine... Also, doing a > calloc for the same vector size leads to the same outcome. > > In my package, I would use large vectors that cannot be assumed to be > sparse - so utilities for sparse matrices may not be considered. > > I run R on ubuntu 64-bit, with 8G RAM, and a 64-bit R build (3.1.2). > As my problem looks close to that seen in > http://r.789695.n4.nabble.com/allocMatrix-limits-td864864.html, > following what I have seen in ?"Memory-limits" I checked that ulimit > -v returns "unlimited". > > I guess I must miss something, like contiguity issues, or other. Does > anyone have a clue for me? > > Thanks by advance, > Pierrick > > __ > R-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel > [[alternative HTML version deleted]] __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] [PATCH] Makefile: add support for git svn clones
Fellipe, CXXR development has moved to github, and we haven't fixed up the build for using git yet. Could you send a pull request with your change to the repo at https://github.com/cxxr-devel/cxxr/? Also, this patch may be useful for pqR too. https://github.com/radfordneal/pqR Thanks On Mon, Jan 19, 2015 at 2:35 PM, Dirk Eddelbuettel wrote: > > On 19 January 2015 at 17:11, Duncan Murdoch wrote: > | The people who would have to maintain the patch can't test it. > > I don't understand this. > > The patch, as we may want to recall, was all of > >+GIT := $(shell if [ -d "$(top_builddir)/.git" ]; then \ >+echo "git"; fi) >+ > > and > >- (cd $(srcdir); LC_ALL=C TZ=GMT svn info || $(ECHO) "Revision: > -99") 2> /dev/null \ >+ (cd $(srcdir); LC_ALL=C TZ=GMT $(GIT) svn info || $(ECHO) > "Revision: -99") 2> /dev/null \ > > I believe you can test that builds works before applying the patch, and > afterwards---even when you do not have git, or in this case a git checkout. > The idiom of expanding a variable to "nothing" if not set is used all over > the R sources and can be assumed common. And if (hypothetically speaking) > the build failed when a .git directory was present? None of R Core's > concern > either as git was never supported. > > I really do not understand the excitement over this. The patch is short, > clean, simple, and removes an entirely unnecessary element of friction. > > Dirk > > -- > http://dirk.eddelbuettel.com | @eddelbuettel | e...@debian.org > > __ > R-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel > [[alternative HTML version deleted]] __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Recycling memory with a small free list
If you link to tcmalloc instead of the default malloc on your system, the performance of large allocations should improve. On unix machines you don't even need to recompile -- you can do this with LD_PRELOAD. The downside is that you'll almost certainly end up with higher average memory usage.as tcmalloc never returns memory to the OS. It would also be worth checking what jemalloc does with large allocations. It may well be worth tweaking the way that large allocations are handled in R -- most allocation libraries assume that large allocations are infrequent and that you won't be frequently requesting the same sized memory block. Those assumptions don't hold in R. On the other hand, I don't see much benefit to R having it's own logic for handling small allocations, as most malloc implementations handle those extremely efficiently. Karl On Thu, Feb 19, 2015 at 10:15 AM, wrote: > On Wed, 18 Feb 2015, Nathan Kurz wrote: > > On Wed, Feb 18, 2015 at 7:19 AM, Radford Neal >> wrote: >> >>> ... with assignments inside of loops like this: reweight = function(iter, w, Q) { for (i in 1:iter) { wT = w * Q } } ... before the RHS is executed, the LHS allocation would be added to a small fixed length list of available space which is checked before future allocations. If the same size is requested before the next garbage collection, the allocation is short-circuited and the allocation is reused. This list could be very small, possibly even only a single entry. Entries would only be put on the list if they have no other references. >>> >> Here's an article about the benefits of this approach in Go that might >> explain better than I was able: >> https://blog.cloudflare.com/recycling-memory-buffers-in-go/ >> Their charts explain the goal very clearly: stabilize at a smaller >> amount of memory to reduce churn, which improves performance in a >> myriad of ways. >> > > Thanks -- will have a look. > > > Reusing the LHS storage immediately isn't possible in general, because >>> evaluation of the RHS might produce an error, in which case the LHS >>> variable is supposed to be unchanged. >>> >> >> What's the guarantee R actually makes? What's an example of the use >> case where this behaviour would be required? More generally, can one >> not assume "a = NULL; a = func()" is equivalent to "a = func()" unless >> func() references 'a' or has it as an argument? Or is the difficulty >> that there is no way to know in advance it if will be referenced? >> >> Detecting special cases where >>> there is guaranteed to be no error, or at least no error after the >>> first modification to newly allocated memory, might be too >>> complicated. >>> >> >> Yes, if required, the complexity of guaranteeing this might well rule >> out the approach I suggested. >> >> Putting the LHS storage on a small free list for later reuse (only >>> after the old value of the variable will definitely be replaced) seems >>> more promising (then one would need only two copies for examples such >>> as above, with them being used in alternate iterations). >>> >> >> OK, let's consider that potentially easier option instead: do nothing >> immediately, but add a small queue for recycling from which the >> temporary might be drawn. It has slightly worse cache behavior, but >> should handle most of the issues with memory churn. >> >> However, >>> there's a danger of getting carried away and essentially rewriting >>> malloc. To avoid this, one might try just calling "free" on the >>> no-longer-needed object, letting "malloc" then figure out when it can >>> be re-used. >>> >> >> Yes, I think that's what I was anticipating: add a free() equivalent >> that does nothing if the object has multiple references/names, but >> adds the object to small fixed size "free list" if it does not. >> Perhaps this is only for certain types or for objects above a certain >> size. >> >> When requesting memory, allocvector() or perhaps R_alloc() does a >> quick check of that "free list" to see if it has anything of the exact >> requested size. If it does, it short circuits and recycles it. If it >> doesn't, normal allocation takes place. >> >> The "free list" is stored as two small fixed size arrays containing >> size/address pairs. Searching is done linearly using code that >> optimizes to SIMD comparisons. For 4/8/16 slots overhead of the >> search should be unmeasurably fast. >> >> The key to the approach would be keeping it simple, and realizing that >> the goal is only to get the lowest hanging fruit: repeated >> assignments of large arrays used in a loop. If it's complex, skip it >> --- the behavior will be no worse than current. >> >> By the way, what's happening with Luke's refcnt patches? From the >> outside, they seem like a great improvement. >> http://homepage.stat.uiowa.edu/~luke/talks/dsc2014.pdf >> http://developer.r-project.org/Refcnt.html >> Are they slated to beco