[Rd] [patch] Support many columns in model.matrix

2016-02-26 Thread Karl Millar via R-devel
Generating a model matrix with very large numbers of columns overflows
the stack and/or runs very slowly, due to the implementation of
TrimRepeats().

This patch modifies it to use Rf_duplicated() to find the duplicates.
This makes the running time linear in the number of columns and
eliminates the recursive function calls.

Thanks
Index: src/library/stats/src/model.c
===
--- src/library/stats/src/model.c	(revision 70230)
+++ src/library/stats/src/model.c	(working copy)
@@ -1259,11 +1259,12 @@
 
 static int TermZero(SEXP term)
 {
-int i, val;
-val = 1;
-for (i = 0; i < nwords; i++)
-	val = val && (INTEGER(term)[i] == 0);
-return val;
+for (int i = 0; i < nwords; i++) {
+if (INTEGER(term)[i] != 0) {
+return 0;
+}
+}
+return 1;
 }
 
 
@@ -1271,11 +1272,12 @@
 
 static int TermEqual(SEXP term1, SEXP term2)
 {
-int i, val;
-val = 1;
-for (i = 0; i < nwords; i++)
-	val = val && (INTEGER(term1)[i] == INTEGER(term2)[i]);
-return val;
+for (int i = 0; i < nwords; i++) {
+if (INTEGER(term1)[i] != INTEGER(term2)[i]) {
+return 0;
+}
+}
+return 1;
 }
 
 
@@ -1303,18 +1305,37 @@
 
 
 /* TrimRepeats removes duplicates of (bit string) terms 
-   in a model formula by repeated use of ``StripTerm''.
+   in a model formula.
Also drops zero terms. */
 
 static SEXP TrimRepeats(SEXP list)
 {
-if (list == R_NilValue)
-	return R_NilValue;
-/* Highly recursive */
-R_CheckStack();
-if (TermZero(CAR(list)))
-	return TrimRepeats(CDR(list));
-SETCDR(list, TrimRepeats(StripTerm(CAR(list), CDR(list;
+// Drop zero terms at the start of the list.
+while (list != R_NilValue && TermZero(CAR(list))) {
+	list = CDR(list);
+}
+if (list == R_NilValue || CDR(list) == R_NilValue)
+	return list;
+
+// Find out which terms are duplicates.
+SEXP all_terms = PROTECT(Rf_PairToVectorList(list));
+SEXP duplicate_sexp = PROTECT(Rf_duplicated(all_terms, FALSE));
+int* is_duplicate = LOGICAL(duplicate_sexp);
+int i = 0;
+
+// Remove the zero terms and duplicates from the list.
+for (SEXP current = list; CDR(current) != R_NilValue; i++) {
+	SEXP next = CDR(current);
+
+	if (is_duplicate[i + 1] || TermZero(CAR(next))) {
+	// Remove the node from the list.
+	SETCDR(current, CDR(next));
+	} else {
+	current = next;
+	}
+}
+
+UNPROTECT(2);
 return list;
 }
 
__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] [patch] Support many columns in model.matrix

2016-02-29 Thread Karl Millar via R-devel
Thanks.

Couldn't you implement model.matrix(..., sparse = TRUE)  with a small
amount of R code similar to MatrixModels::model.Matrix ?

On Mon, Feb 29, 2016 at 10:01 AM, Martin Maechler
 wrote:
>>>>>> Karl Millar via R-devel 
>>>>>> on Fri, 26 Feb 2016 15:58:20 -0800 writes:
>
> > Generating a model matrix with very large numbers of
> > columns overflows the stack and/or runs very slowly, due
> > to the implementation of TrimRepeats().
>
> > This patch modifies it to use Rf_duplicated() to find the
> > duplicates.  This makes the running time linear in the
> > number of columns and eliminates the recursive function
> > calls.
>
> Thank you, Karl.
> I've committed this (very slightly modified) to R-devel,
>
> (also after looking for a an example that runs on a non-huge
>  computer and shows the difference) :
>
> nF <- 11 ; set.seed(1)
> lff <- setNames(replicate(nF, as.factor(rpois(128, 1/4)), simplify=FALSE), 
> letters[1:nF])
> str(dd <- as.data.frame(lff)); prod(sapply(dd, nlevels))
> ## 'data.frame':128 obs. of  11 variables:
> ##  $ a: Factor w/ 3 levels "0","1","2": 1 1 1 2 1 2 2 1 1 1 ...
> ##  $ b: Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 2 1 1 1 ...
> ##  $ c: Factor w/ 3 levels "0","1","2": 1 1 1 2 1 1 1 2 1 1 ...
> ##  $ d: Factor w/ 3 levels "0","1","2": 1 1 2 2 1 2 1 1 2 1 ...
> ##  $ e: Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 2 1 ...
> ##  $ f: Factor w/ 2 levels "0","1": 2 1 2 1 2 1 1 2 1 2 ...
> ##  $ g: Factor w/ 4 levels "0","1","2","3": 2 1 1 2 1 3 1 1 1 1 ...
> ##  $ h: Factor w/ 4 levels "0","1","2","4": 1 1 1 1 2 1 1 1 1 1 ...
> ##  $ i: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 2 ...
> ##  $ j: Factor w/ 3 levels "0","1","2": 1 2 3 1 1 1 1 1 1 1 ...
> ##  $ k: Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
> ##
> ## [1] 139968
>
> system.time(mff <- model.matrix(~ . ^ 11, dd, contrasts = list(a = 
> "contr.helmert")))
> ##  user  system elapsed
> ## 0.255   0.033   0.287  --- *with* the patch on my desktop (16 GB)
> ## 1.489   0.031   1.522  --- for R-patched (i.e. w/o the patch)
>
>> dim(mff)
> [1]128 139968
>> object.size(mff)
> 154791504 bytes
>
> ---
>
> BTW: These example would gain tremendously if I finally got
>  around to provide
>
>model.matrix(, sparse = TRUE)
>
> which would then produce a Matrix-package sparse matrix.
>
> Even for this somewhat small case, a sparse matrix is a factor
> of 13.5 x smaller :
>
>> s1 <- object.size(mff); s2 <- object.size(M <- Matrix::Matrix(mff)); 
>> as.vector( s1/s2 )
> [1] 13.47043
>
> I'm happy to collaborate with you on adding such a (C level)
> interface to sparse matrices for this case.
>
> Martin Maechler

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] Undocumented 'use.names' argument to c()

2016-09-20 Thread Karl Millar via R-devel
'c' has an undocumented 'use.names' argument.  I'm not sure if this is
a documentation or implementation bug.

> c(a = 1)
a
1
> c(a = 1, use.names = F)
[1] 1

Karl

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Undocumented 'use.names' argument to c()

2016-09-23 Thread Karl Millar via R-devel
I'd expect that a lot of the performance overhead could be eliminated
by simply improving the underlying code.  IMHO, we should ignore it in
deciding the API that we want here.

On Fri, Sep 23, 2016 at 10:54 AM, Henrik Bengtsson
 wrote:
> I'd vote for it to stay.  It could of course suprise someone who'd
> expect c(list(a=1), b=2, use.names = FALSE) to generate list(a=1, b=2,
> use.names=FALSE).   On the upside, is the performance gain from using
> use.names=FALSE.  Below benchmarks show that the combining of the
> names attributes themselves takes ~20-25 times longer than the
> combining of the integers themselves.  Also, at no surprise,
> use.names=FALSE avoids some memory allocations.
>
>> options(digits = 2)
>>
>> a <- b <- c <- d <- 1:1e4
>> names(c) <- c
>> names(d) <- d
>>
>> stats <- microbenchmark::microbenchmark(
> +   c(a, b, use.names=FALSE),
> +   c(c, d, use.names=FALSE),
> +   c(a, d, use.names=FALSE),
> +   c(a, b, use.names=TRUE),
> +   c(a, d, use.names=TRUE),
> +   c(c, d, use.names=TRUE),
> +   unit = "ms"
> + )
>>
>> stats
> Unit: milliseconds
>expr   minlq  mean medianuq   max neval
>  c(a, b, use.names = FALSE) 0.031 0.032 0.049  0.034 0.036 1.474   100
>  c(c, d, use.names = FALSE) 0.031 0.031 0.035  0.034 0.035 0.064   100
>  c(a, d, use.names = FALSE) 0.031 0.031 0.049  0.034 0.035 1.452   100
>   c(a, b, use.names = TRUE) 0.031 0.031 0.055  0.034 0.036 2.094   100
>   c(a, d, use.names = TRUE) 0.510 0.526 0.588  0.549 0.617 1.998   100
>   c(c, d, use.names = TRUE) 0.780 0.815 0.886  0.841 0.944 1.430   100
>
>> profmem::profmem(c(c, d, use.names=FALSE))
> Rprofmem memory profiling of:
> c(c, d, use.names = FALSE)
>
> Memory allocations:
>   bytes  calls
> 1 80040 
> total 80040
>
>> profmem::profmem(c(c, d, use.names=TRUE))
> Rprofmem memory profiling of:
> c(c, d, use.names = TRUE)
>
> Memory allocations:
>bytes  calls
> 1  80040 
> 2 160040 
> total 240080
>
> /Henrik
>
> On Fri, Sep 23, 2016 at 10:25 AM, William Dunlap via R-devel
>  wrote:
>> In Splus c() and unlist() called the same C code, but with a different
>> 'sys_index'  code (the last argument to .Internal) and c() did not consider
>> an argument named 'use.names' special.
>>
>>> c
>> function(..., recursive = F)
>> .Internal(c(..., recursive = recursive), "S_unlist", TRUE, 1)
>>> unlist
>> function(data, recursive = T, use.names = T)
>> .Internal(unlist(data, recursive = recursive, use.names = use.names),
>> "S_unlist", TRUE, 2)
>>> c(A=1,B=2,use.names=FALSE)
>>  A B use.names
>>  1 2 0
>>
>> The C code used sys_index==2 to mean 'the last  argument is the 'use.names'
>> argument, if sys_index==1 only the recursive argument was considered
>> special.
>>
>> Sys.funs.c:
>>  405 S_unlist(vector *ent, vector *arglist, s_evaluator *S_evaluator)
>>  406 {
>>  407 int which = sys_index; boolean named, recursive, names;
>>  ...
>>  419 args = arglist->value.tree; n = arglist->length;
>>  ...
>>  424 names = which==2 ? logical_value(args[--n], ent, S_evaluator)
>> : (which == 1);
>>
>> Thus there is no historical reason for giving c() the use.names argument.
>>
>>
>> Bill Dunlap
>> TIBCO Software
>> wdunlap tibco.com
>>
>> On Fri, Sep 23, 2016 at 9:37 AM, Suharto Anggono Suharto Anggono via
>> R-devel  wrote:
>>
>>> In S-PLUS 3.4 help on 'c' (http://www.uni-muenster.de/
>>> ZIV.BennoSueselbeck/s-html/helpfiles/c.html), there is no 'use.names'
>>> argument.
>>>
>>> Because 'c' is a generic function, I don't think that changing formal
>>> arguments is good.
>>>
>>> In R devel r71344, 'use.names' is not an argument of functions 'c.Date',
>>> 'c.POSIXct' and 'c.difftime'.
>>>
>>> Could 'use.names' be documented to be accepted by the default method of
>>> 'c', but not listed as a formal argument of 'c'? Or, could the code that
>>> handles the argument name 'use.names' be removed?
>>> 
>>> >>>>> David Winsemius 
>>> >>>>> on Tue, 20 Sep 2016 23:46:48 -0700 writes:
>>>
>>> >> On Sep 20, 2016, at 7:18 PM, Karl Millar via

[Rd] Is importMethodsFrom actually needed?

2016-11-02 Thread Karl Millar via R-devel
IIUC, loading a namespace automatically registers all the exported
methods as long as the generic can be found when the namespace gets
loaded.  Generics can be exported and imported as regular functions.

In that case, code in a package should be able to simply import the
generic and the methods will automatically work correctly without any
need for importMethodsFrom.

Is there something that I'm missing here?  What breaks if you don't
explicitly import methods?

Thanks,

Karl

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Upgrading a package to which other packages are LinkingTo

2016-12-16 Thread Karl Millar via R-devel
A couple of points:
  - rebuilding dependent packages is needed if there is an ABI change,
not just an API change.  For packages like Rcpp which export inline
functions or macros that might have changed, this is potentially any
change to existing functions, but for packages like Matrix, it isn't
really an issue at all IIUC.

  - If we're looking into a way to check if package APIs are
compatible, then that's something that's relevant for all packages,
since they all export an R API.  I believe that CRAN only tests
package compatibility with the most recent versions of packages on
CRAN that import or depend on it.  There's no guarantee that a package
update won't contain API or behaviour changes that breaks older
versions of packages, packages not on CRAN or any scripts that use the
package, and these sorts of breakages do happen semi-regularly.

 - AFAICT, the only difference with packages like Rcpp is that you can
potentially have all of your CRAN packages at the latest version, but
some of them might have inlined code from an older version of Rcpp
even after running update.packages().  While that is an issue, in my
experience that's been a lot less trouble than the general case of
backwards compatibility.

Karl

On Fri, Dec 16, 2016 at 8:19 AM, Dirk Eddelbuettel  wrote:
>
> On 16 December 2016 at 11:00, Duncan Murdoch wrote:
> | On 16/12/2016 10:40 AM, Dirk Eddelbuettel wrote:
> | > On 16 December 2016 at 10:14, Duncan Murdoch wrote:
> | > | On 16/12/2016 8:37 AM, Dirk Eddelbuettel wrote:
> | > | >
> | > | > On 16 December 2016 at 08:20, Duncan Murdoch wrote:
> | > | > | Perhaps the solution is to recommend that packages which export 
> their
> | > | > | C-level entry points either guarantee them not to change or offer
> | > | > | (require?) version checks by user code.  So dplyr should start out 
> by
> | > | > | saying "I'm using Rcpp interface 0.12.8".  If Rcpp has a new version
> | > | > | with a compatible interface, it replies "that's fine".  If Rcpp has
> | > | > | changed its interface, it says "Sorry, I don't support that any 
> more."
> | > | >
> | > | > We try. But it's hard, and I'd argue, likely impossible.
> | > | >
> | > | > For example I even added a "frozen" package [1] in the sources / unit 
> tests
> | > | > to test for just this. In practice you just cannot hit every possible 
> access
> | > | > point of the (rich, in our case) API so the tests pass too often.
> | > | >
> | > | > Which is why we relentlessly test against reverse-depends to _at 
> least ensure
> | > | > buildability_ from our releases.
> | >
> | > I meant to also add:  "... against a large corpus of other packages."
> | > The intent is to empirically answer this.
> | >
> | > | > As for seamless binary upgrade, I don't think in can work in 
> practice.  Ask
> | > | > Uwe one day we he rebuilds everything every time on Windows. And for 
> what it
> | > | > is worth, we essentially do the same in Debian.
> | > | >
> | > | > Sometimes you just need to rebuild.  That may be the price of 
> admission for
> | > | > using the convenience of rich C++ interfaces.
> | > | >
> | > |
> | > | Okay, so would you say that Kirill's suggestion is not overkill?  Every
> | > | time package B uses LinkingTo: A, R should assume it needs to rebuild B
> | > | when A is updated?
> | >
> | > Based on my experience is a "halting problem" -- i.e. cannot know ex ante.
> | >
> | > So "every time" would be overkill to me.  Sometimes you know you must
> | > recompile (but try to be very prudent with public-facing API).  Many times
> | > you do not. It is hard to pin down.
> | >
> | > At work we have a bunch of servers with Rcpp and many packages against 
> them
> | > (installed system-wide for all users). We _very really_ needs rebuild.
>
> Edit:  "We _very rarely_ need rebuilds" is what was meant there.
>
> | So that comes back to my suggestion:  you should provide a way for a
> | dependent package to ask if your API has changed.  If you say it hasn't,
> | the package is fine.  If you say it has, the package should abort,
> | telling the user they need to reinstall it.  (Because it's a hard
> | question to answer, you might get it wrong and say it's fine when it's
> | not.  But that's easy to fix:  just make a new release that does require
>
> Sure.
>
> We have always increased the higher-order version number when that is needed.
>
> One problem with your proposal is that the testing code may run after the
> package load, and in the case where it matters ... that very code may not get
> reached because the package didn't load.
>
> Dirk
>
> --
> http://dirk.eddelbuettel.com | @eddelbuettel | e...@debian.org
>
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Request: Increasing MAX_NUM_DLLS in Rdynload.c

2016-12-20 Thread Karl Millar via R-devel
It's not always clear when it's safe to remove the DLL.

The main problem that I'm aware of is that native objects with
finalizers might still exist (created by R_RegisterCFinalizer etc).
Even if there are no live references to such objects (which would be
hard to verify), it still wouldn't be safe to unload the DLL until a
full garbage collection has been done.

If the DLL is unloaded, then the function pointer that was registered
now becomes a pointer into the memory where the DLL was, leading to an
almost certain crash when such objects get garbage collected.

A better approach would be to just remove the limit on the number of
DLLs, dynamically expanding the array if/when needed.


On Tue, Dec 20, 2016 at 3:40 AM, Jeroen Ooms  wrote:
> On Tue, Dec 20, 2016 at 7:04 AM, Henrik Bengtsson
>  wrote:
>> On reason for hitting the MAX_NUM_DLLS (= 100) limit is because some
>> packages don't unload their DLLs when they being unloaded themselves.
>
> I am surprised by this. Why does R not do this automatically? What is
> the case for keeping the DLL loaded after the package has been
> unloaded? What happens if you reload another version of the same
> package from a different library after unloading?
>
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Request: Increasing MAX_NUM_DLLS in Rdynload.c

2016-12-21 Thread Karl Millar via R-devel
It does, but you'd still be relying on the R code ensuring that all of
these objects are dead prior to unloading the DLL, otherwise they'll
survive the GC.  Maybe if the package counted how many such objects
exist, it could work out when it's safe to remove the DLL.  I'm not
sure that it can be done automatically.

What could be done is to to keep the DLL loaded, but remove it from
R's table of loaded DLLs.  That way, there's no risk of dangling
function pointers and a new DLL of the same name could be loaded.  You
could still run into issues though as some DLLs assume that the
associated namespace exists.

Currently what I do is to never unload DLLs.  If I need to replace
one, then I just restart R.  It's less convenient, but it's always
correct.


On Wed, Dec 21, 2016 at 9:10 AM, Henrik Bengtsson
 wrote:
> On Tue, Dec 20, 2016 at 7:39 AM, Karl Millar  wrote:
>> It's not always clear when it's safe to remove the DLL.
>>
>> The main problem that I'm aware of is that native objects with
>> finalizers might still exist (created by R_RegisterCFinalizer etc).
>> Even if there are no live references to such objects (which would be
>> hard to verify), it still wouldn't be safe to unload the DLL until a
>> full garbage collection has been done.
>>
>> If the DLL is unloaded, then the function pointer that was registered
>> now becomes a pointer into the memory where the DLL was, leading to an
>> almost certain crash when such objects get garbage collected.
>
> Very good point.
>
> Does base::gc() perform such a *full* garbage collection and thereby
> trigger all remaining finalizers to be called?  In other words, do you
> think an explicit call to base::gc() prior to cleaning out left-over
> DLLs (e.g. R.utils::gcDLLs()) would be sufficient?
>
> /Henrik
>
>>
>> A better approach would be to just remove the limit on the number of
>> DLLs, dynamically expanding the array if/when needed.
>>
>>
>> On Tue, Dec 20, 2016 at 3:40 AM, Jeroen Ooms  
>> wrote:
>>> On Tue, Dec 20, 2016 at 7:04 AM, Henrik Bengtsson
>>>  wrote:
 On reason for hitting the MAX_NUM_DLLS (= 100) limit is because some
 packages don't unload their DLLs when they being unloaded themselves.
>>>
>>> I am surprised by this. Why does R not do this automatically? What is
>>> the case for keeping the DLL loaded after the package has been
>>> unloaded? What happens if you reload another version of the same
>>> package from a different library after unloading?
>>>
>>> __
>>> R-devel@r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-devel

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] unlicense

2017-01-17 Thread Karl Millar via R-devel
Please don't use 'Unlimited' or 'Unlimited + ...'.

Google's lawyers don't recognize 'Unlimited' as being open-source, so
our policy doesn't allow us to use such packages due to lack of an
acceptable license.  To our lawyers, 'Unlimited + file LICENSE' means
something very different than it presumably means to Uwe.

Thanks,

Karl

On Sat, Jan 14, 2017 at 12:10 AM, Uwe Ligges
 wrote:
> Dear all,
>
> from "Writing R Extensions":
>
> The string ‘Unlimited’, meaning that there are no restrictions on
> distribution or use other than those imposed by relevant laws (including
> copyright laws).
>
> If a package license restricts a base license (where permitted, e.g., using
> GPL-3 or AGPL-3 with an attribution clause), the additional terms should be
> placed in file LICENSE (or LICENCE), and the string ‘+ file LICENSE’ (or ‘+
> file LICENCE’, respectively) should be appended to the
> corresponding individual license specification.
> ...
> Please note in particular that “Public domain” is not a valid license, since
> it is not recognized in some jurisdictions."
>
> So perhaps you aim for
> License: Unlimited
>
> Best,
> Uwe Ligges
>
>
>
>
>
> On 14.01.2017 07:53, Deepayan Sarkar wrote:
>>
>> On Sat, Jan 14, 2017 at 5:49 AM, Duncan Murdoch
>>  wrote:
>>>
>>> On 13/01/2017 3:21 PM, Charles Geyer wrote:


 I would like the unlicense (http://unlicense.org/) added to R
 licenses.  Does anyone else think that worthwhile?

>>>
>>> That's a question for you to answer, not to ask.  Who besides you thinks
>>> that it's a good license for open source software?
>>>
>>> If it is recognized by the OSF or FSF or some other authority as a FOSS
>>> license, then CRAN would probably also recognize it.  If not, then CRAN
>>> doesn't have the resources to evaluate it and so is unlikely to recognize
>>> it.
>>
>>
>> Unlicense is listed in https://spdx.org/licenses/
>>
>> Debian does include software "licensed" like this, and seems to think
>> this is one way (not the only one) of declaring something to be
>> "public domain".  The first two examples I found:
>>
>> https://tracker.debian.org/media/packages/r/rasqal/copyright-0.9.29-1
>>
>> https://tracker.debian.org/media/packages/w/wiredtiger/copyright-2.6.1%2Bds-1
>>
>> This follows the format explained in
>>
>> https://www.debian.org/doc/packaging-manuals/copyright-format/1.0/#license-specification,
>> which does not explicitly include Unlicense, but does include CC0,
>> which AFAICT is meant to formally license something so that it is
>> equivalent to being in the public domain. R does include CC0 as a
>> shorthand (e.g., geoknife).
>>
>> https://www.debian.org/legal/licenses/ says that
>>
>> 
>>
>> Licenses currently found in Debian main include:
>>
>> - ...
>> - ...
>> - public domain (not a license, strictly speaking)
>>
>> 
>>
>> The equivalent for CRAN would probably be something like "License:
>> public-domain + file LICENSE".
>>
>> -Deepayan
>>
>>> Duncan Murdoch
>>>
>>>
>>> __
>>> R-devel@r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>
>>
>> __
>> R-devel@r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>
>
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] unlicense

2017-01-17 Thread Karl Millar via R-devel
Unfortunately, our lawyers say that they can't give legal advice in
this context.

My question would be, what are people looking for that the MIT or
2-clause BSD license don't provide?  They're short, clear, widely
accepted and very permissive.  Another possibility might be to
dual-license packages with both an OSI-approved license and
whatever-else-you-like, e.g.  'MIT | ', but IIUC
there's a bunch more complexity there than just using an OSI-approved
license.

Karl


On Tue, Jan 17, 2017 at 3:35 PM, Uwe Ligges
 wrote:
>
>
> On 18.01.2017 00:13, Karl Millar wrote:
>>
>> Please don't use 'Unlimited' or 'Unlimited + ...'.
>>
>> Google's lawyers don't recognize 'Unlimited' as being open-source, so
>> our policy doesn't allow us to use such packages due to lack of an
>> acceptable license.  To our lawyers, 'Unlimited + file LICENSE' means
>> something very different than it presumably means to Uwe.
>
>
>
> Karl,
>
> thanks for this comment. What we like to hear now is a suggestion what the
> maintainer is supposed to do to get what he aims at, as we already know that
> "freeware" does not work at all and was hard enough to get to the
> "Unlimited" options.
>
> We have many CRAN requests asking for what they should write for "freeware".
> Can we get an opinion from your layers which standard license comes closest
> to what these maintainers probably aim at and will work more or less
> globally, i.e. not only in the US?
>
> Best,
> Uwe
>
>
>
>
>> Thanks,
>>
>> Karl
>>
>> On Sat, Jan 14, 2017 at 12:10 AM, Uwe Ligges
>>  wrote:
>>>
>>> Dear all,
>>>
>>> from "Writing R Extensions":
>>>
>>> The string ‘Unlimited’, meaning that there are no restrictions on
>>> distribution or use other than those imposed by relevant laws (including
>>> copyright laws).
>>>
>>> If a package license restricts a base license (where permitted, e.g.,
>>> using
>>> GPL-3 or AGPL-3 with an attribution clause), the additional terms should
>>> be
>>> placed in file LICENSE (or LICENCE), and the string ‘+ file LICENSE’ (or
>>> ‘+
>>> file LICENCE’, respectively) should be appended to the
>>> corresponding individual license specification.
>>> ...
>>> Please note in particular that “Public domain” is not a valid license,
>>> since
>>> it is not recognized in some jurisdictions."
>>>
>>> So perhaps you aim for
>>> License: Unlimited
>>>
>>> Best,
>>> Uwe Ligges
>>>
>>>
>>>
>>>
>>>
>>> On 14.01.2017 07:53, Deepayan Sarkar wrote:


 On Sat, Jan 14, 2017 at 5:49 AM, Duncan Murdoch
  wrote:
>
>
> On 13/01/2017 3:21 PM, Charles Geyer wrote:
>>
>>
>>
>> I would like the unlicense (http://unlicense.org/) added to R
>> licenses.  Does anyone else think that worthwhile?
>>
>
> That's a question for you to answer, not to ask.  Who besides you
> thinks
> that it's a good license for open source software?
>
> If it is recognized by the OSF or FSF or some other authority as a FOSS
> license, then CRAN would probably also recognize it.  If not, then CRAN
> doesn't have the resources to evaluate it and so is unlikely to
> recognize
> it.



 Unlicense is listed in https://spdx.org/licenses/

 Debian does include software "licensed" like this, and seems to think
 this is one way (not the only one) of declaring something to be
 "public domain".  The first two examples I found:

 https://tracker.debian.org/media/packages/r/rasqal/copyright-0.9.29-1


 https://tracker.debian.org/media/packages/w/wiredtiger/copyright-2.6.1%2Bds-1

 This follows the format explained in


 https://www.debian.org/doc/packaging-manuals/copyright-format/1.0/#license-specification,
 which does not explicitly include Unlicense, but does include CC0,
 which AFAICT is meant to formally license something so that it is
 equivalent to being in the public domain. R does include CC0 as a
 shorthand (e.g., geoknife).

 https://www.debian.org/legal/licenses/ says that

 

 Licenses currently found in Debian main include:

 - ...
 - ...
 - public domain (not a license, strictly speaking)

 

 The equivalent for CRAN would probably be something like "License:
 public-domain + file LICENSE".

 -Deepayan

> Duncan Murdoch
>
>
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel



 __
 R-devel@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-devel

>>>
>>> __
>>> R-devel@r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-devel

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] Control statements with condition with greater than one should give error (not just warning) [PATCH]

2017-03-07 Thread Karl Millar via R-devel
Is there anything that actually requires R core members to manually do
significant amounts of work here?

IIUC, you can do a CRAN run to detect the broken packages, and a simple
script can collect the emails of the affected maintainers, so you can send
a single email to them all.  If authors don't respond by fixing their
packages, then those packages should be archived, since there's high
probability of those packages being buggy anyway.

If you expect a non-trivial amount of questions regarding this change from
the affected package maintainers, then you can create a FAQ page for it,
which you can fill in as questions arrive, so you don't get too many
duplicated questions.

Karl

On Mon, Mar 6, 2017 at 4:51 AM, Martin Maechler 
wrote:

> > Michael Lawrence 
> > on Sat, 4 Mar 2017 12:20:45 -0800 writes:
>
> > Is there really a need for these complications? Packages
> > emitting this warning are broken by definition and should be fixed.
>
> I agree and probably Henrik, too.
>
> (Others may disagree to some extent .. and find it convenient
>  that R does translate 'if(x)'  to  'if(x[1])'  for them albeit
>  with a warning .. )
>
> > Perhaps we could "flip the switch" in a test
> > environment and see how much havoc is wreaked and whether
> > authors are sufficiently responsive?
>
> > Michael
>
> As we have > 10'000 packages on CRAN alonce,  and people have
> started (mis)using suppressWarnings(.) in many places,  there
> may be considerably more packages affected than we optimistically assume...
>
> As R core member who would  "flip the switch"  I'd typically then
> have to be the one sending an e-mail to all package maintainers
> affected and in this case I'm very reluctant to volunteer
> for that and so, I'd prefer the environment variable where R
> core and others can decide how to use it .. for a while .. until
> the flip is switched for all.
>
> or have I overlooked an issue?
>
> Martin
>
> > On Sat, Mar 4, 2017 at 12:04 PM, Martin Maechler
> >  >> wrote:
>
> >> > Henrik Bengtsson  >
> >> on Fri, 3 Mar 2017 10:10:53 -0800 writes:
> >>
> >> > On Fri, Mar 3, 2017 at 9:55 AM, Hadley Wickham >
> >>  wrote: >>> But, how you propose a
> >> warning-to-error transition >>> should be made without
> >> wreaking havoc?  Just flip the >>> switch in R-devel and
> >> see CRAN and Bioconductor packages >>> break overnight?
> >> Particularly Bioconductor devel might >>> become
> >> non-functional (since at times it requires >>> R-devel).
> >> For my own code / packages, I would be able >>> to handle
> >> such a change, but I'm completely out of >>> control if
> >> one of the package I'm depending on does not >>> provide
> >> a quick fix (with the only option to remove >>> package
> >> tests for those dependencies).
> >> >>
> >> >> Generally, a package can not be on CRAN if it has any
> >> >> warnings, so I don't think this change would have any
> >> >> impact on CRAN packages.  Isn't this also true for >>
> >> bioconductor?
> >>
> >> > Having a tests/warn.R file with:
> >>
> >> > warning("boom")
> >>
> >> > passes through R CMD check --as-cran unnoticed.
> >>
> >> Yes, indeed.. you are right Henrik that many/most R
> >> warning()s would not produce R CMD check 'WARNING's ..
> >>
> >> I think Hadley and I fell into the same mental pit of
> >> concluding that such warning()s from
> >> if() ...  would not currently happen
> >> in CRAN / Bioc packages and hence turning them to errors
> >> would not have a direct effect.
> >>
> >> With your 2nd e-mail of saying that you'd propose such an
> >> option only for a few releases of R you've indeed
> >> clarified your intent to me.  OTOH, I would prefer using
> >> an environment variable (as you've proposed as an
> >> alternative) which is turned "active" at the beginning
> >> only manually or for the "CRAN incoming" checks of the
> >> CRAN team (and bioconductor submission checks?)  and
> >> later for '--as-cran' etc until it eventually becomes the
> >> unconditional behavior of R (and the env.variable is no
> >> longer used).
> >>
> >> Martin
> >>
> >> __
> >> R-devel@r-project.org mailing list
> >> https://stat.ethz.ch/mailman/listinfo/r-devel
> >>
>
> >   [[alternative HTML version deleted]]
>
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] segfault when trying to allocate a large vector

2014-12-18 Thread Karl Millar via R-devel
Hi Pierrick,

You're storing largevec on the stack, which is probably causing a stack
overflow.  Allocate largvec on the heap with malloc or one of the R memory
allocation routines instead and it should work fine.

Karl

On Thu, Dec 18, 2014 at 12:00 AM, Pierrick Bruneau 
wrote:
>
> Dear R contributors,
>
> I'm running into trouble when trying to allocate some large (but in
> theory viable) vector in the context of C code bound to R through
> .Call(). Here is some sample code summarizing the problem:
>
> SEXP test() {
>
> int size = 1000;
> double largevec[size];
> memset(largevec, 0, size*sizeof(double));
> return(R_NilValue);
>
> }
>
> If size if small enough (up to 10^6), everything is fine. When it
> reaches 10^7 as above, I get a segfault. As far as I know, a double
> value is represented with 8 bytes, which would make largevec above
> approx. 80Mo -> this is certainly large for a single variable, but
> should remain well below the limits of my machine... Also, doing a
> calloc for the same vector size leads to the same outcome.
>
> In my package, I would use large vectors that cannot be assumed to be
> sparse - so utilities for sparse matrices may not be considered.
>
> I run R on ubuntu 64-bit, with 8G RAM, and a 64-bit R build (3.1.2).
> As my problem looks close to that seen in
> http://r.789695.n4.nabble.com/allocMatrix-limits-td864864.html,
> following what I have seen in ?"Memory-limits" I checked that ulimit
> -v returns "unlimited".
>
> I guess I must miss something, like contiguity issues, or other. Does
> anyone have a clue for me?
>
> Thanks by advance,
> Pierrick
>
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] [PATCH] Makefile: add support for git svn clones

2015-01-19 Thread Karl Millar via R-devel
Fellipe,

CXXR development has moved to github, and we haven't fixed up the build for
using git yet.  Could you send a pull request with your change to the repo
at https://github.com/cxxr-devel/cxxr/?

Also, this patch may be useful for pqR too.
https://github.com/radfordneal/pqR

Thanks

On Mon, Jan 19, 2015 at 2:35 PM, Dirk Eddelbuettel  wrote:

>
> On 19 January 2015 at 17:11, Duncan Murdoch wrote:
> | The people who would have to maintain the patch can't test it.
>
> I don't understand this.
>
> The patch, as we may want to recall, was all of
>
>+GIT := $(shell if [ -d "$(top_builddir)/.git" ]; then \
>+echo "git"; fi)
>+
>
> and
>
>-  (cd $(srcdir); LC_ALL=C TZ=GMT svn info || $(ECHO) "Revision:
> -99") 2> /dev/null \
>+  (cd $(srcdir); LC_ALL=C TZ=GMT $(GIT) svn info || $(ECHO)
> "Revision: -99") 2> /dev/null \
>
> I believe you can test that builds works before applying the patch, and
> afterwards---even when you do not have git, or in this case a git checkout.
> The idiom of expanding a variable to "nothing" if not set is used all over
> the R sources and can be assumed common.  And if (hypothetically speaking)
> the build failed when a .git directory was present?  None of R Core's
> concern
> either as git was never supported.
>
> I really do not understand the excitement over this.  The patch is short,
> clean, simple, and removes an entirely unnecessary element of friction.
>
> Dirk
>
> --
> http://dirk.eddelbuettel.com | @eddelbuettel | e...@debian.org
>
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Recycling memory with a small free list

2015-02-19 Thread Karl Millar via R-devel
If you link to tcmalloc instead of the default malloc on your system, the
performance of large allocations should improve.  On unix machines you
don't even need to recompile -- you can do this with LD_PRELOAD.  The
downside is that you'll almost certainly end up with higher average memory
usage.as tcmalloc never returns memory to the OS.

It would also be worth checking what jemalloc does with large allocations.


It may well be worth tweaking the way that large allocations are handled in
R -- most allocation libraries assume that large allocations are infrequent
and that you won't be frequently requesting the same sized memory block.
Those assumptions don't hold in R.  On the other hand, I don't see much
benefit to R having it's own logic for handling small allocations, as most
malloc implementations handle those extremely efficiently.

Karl

On Thu, Feb 19, 2015 at 10:15 AM,  wrote:

> On Wed, 18 Feb 2015, Nathan Kurz wrote:
>
>  On Wed, Feb 18, 2015 at 7:19 AM, Radford Neal 
>> wrote:
>>
>>> ... with assignments inside of loops like this:

 reweight = function(iter, w, Q) {
   for (i in 1:iter) {
 wT = w * Q
   }
 }
 ... before the RHS is executed, the LHS allocation would be added
 to a small fixed length list of available space which is checked
 before future allocations.   If the same size is requested before the
 next garbage collection, the allocation is short-circuited and the
 allocation is reused.   This list could be very small, possibly even
 only a single entry.  Entries would only be put on the list if they
 have no other references.

>>>
>> Here's an article about the benefits of this approach in Go that might
>> explain better than I was able:
>> https://blog.cloudflare.com/recycling-memory-buffers-in-go/
>> Their charts explain the goal very clearly: stabilize at a smaller
>> amount of memory to reduce churn, which improves performance in a
>> myriad of ways.
>>
>
> Thanks -- will have a look.
>
>
>  Reusing the LHS storage immediately isn't possible in general, because
>>> evaluation of the RHS might produce an error, in which case the LHS
>>> variable is supposed to be unchanged.
>>>
>>
>> What's the guarantee R actually makes?  What's an example of the use
>> case where this behaviour would be required? More generally, can one
>> not assume "a = NULL; a = func()" is equivalent to "a = func()" unless
>> func() references 'a' or has it as an argument?  Or is the difficulty
>> that there is no way to know in advance it if will be referenced?
>>
>>  Detecting special cases where
>>> there is guaranteed to be no error, or at least no error after the
>>> first modification to newly allocated memory, might be too
>>> complicated.
>>>
>>
>> Yes, if required, the complexity of guaranteeing this might  well rule
>> out the approach I suggested.
>>
>>  Putting the LHS storage on a small free list for later reuse (only
>>> after the old value of the variable will definitely be replaced) seems
>>> more promising (then one would need only two copies for examples such
>>> as above, with them being used in alternate iterations).
>>>
>>
>> OK, let's consider that potentially easier option instead:  do nothing
>> immediately, but add a small queue for recycling from which the
>> temporary might be drawn.   It has slightly worse cache behavior, but
>> should handle most of the issues with memory churn.
>>
>>  However,
>>> there's a danger of getting carried away and essentially rewriting
>>> malloc.  To avoid this, one might try just calling "free" on the
>>> no-longer-needed object, letting "malloc" then figure out when it can
>>> be re-used.
>>>
>>
>> Yes, I think that's what I was anticipating:  add a free() equivalent
>> that does nothing if the object has multiple references/names, but
>> adds the object to small fixed size "free list" if it does not.
>> Perhaps this is only for certain types or for objects above a certain
>> size.
>>
>> When requesting memory, allocvector() or perhaps R_alloc() does a
>> quick check of that "free list" to see if it has anything of the exact
>> requested size.  If it does, it short circuits and recycles it.  If it
>> doesn't, normal allocation takes place.
>>
>> The "free list" is stored as two small fixed size arrays containing
>> size/address pairs.   Searching is done linearly using code that
>> optimizes to SIMD comparisons.   For 4/8/16 slots overhead of the
>> search should be unmeasurably fast.
>>
>> The key to the approach would be keeping it simple, and realizing that
>> the goal is only to get the lowest hanging fruit:  repeated
>> assignments of large arrays used in a loop.  If it's complex, skip it
>> --- the behavior will be no worse than current.
>>
>> By the way, what's happening with Luke's refcnt patches?  From the
>> outside, they seem like a great improvement.
>> http://homepage.stat.uiowa.edu/~luke/talks/dsc2014.pdf
>> http://developer.r-project.org/Refcnt.html
>> Are they slated to beco