Re: [Rd] source bug ? (PR#7929)

2005-06-10 Thread Thomas Lumley
On Fri, 10 Jun 2005 [EMAIL PROTECTED] wrote:

> hello bug fixers
>
> i think this bug is probably mentioned before but i couldn't find the
> answer to it on google.
> i have to build R from the source files ( on a mac os x system )
> because i need the python R interface too work. for this too happen r
> needs to be configured in the following way
>
> ./configure --enable-R-shlib
>

The R for Mac OS X FAQ tells you how to configure.  You do need the
   --with-blas='-framework vecLib' --with-lapack 
flags that it lists.

-thomas

__
[EMAIL PROTECTED] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Citation for R

2005-06-13 Thread Thomas Lumley
On Mon, 13 Jun 2005, Gordon K Smyth wrote:

> This is just a note that R would get a lot more citations if the 
> recommended citation was an article in a recognised journal or from a 
> recognised publisher.
>

This is unfortunately true, but R is *not* an article or a book, it is a 
piece of software.  I don't think I'm the only person who thinks it is 
counterproductive in the long run to encourage users to cite an article 
that they probably haven't read instead of citing the software they 
actually used.

Jan's suggestion of the Journal of Statistical Software might provide a 
solution, since JSS *does* publish software.

-thomas

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] (no subject)

2005-06-13 Thread Thomas Lumley

Online registration for DSC 2005 in Seattle, August 13-14 is now open. See 
the conference web page at
http://depts.washington.edu//dsc2005/

-thomas

Thomas Lumley   Assoc. Professor, Biostatistics
[EMAIL PROTECTED]   University of Washington, Seattle

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] DSC 2005

2005-06-13 Thread Thomas Lumley

Persons conversant with W3C standards would have noticed too many // in the URL 
quoted in my previous message.  The correct conference page URL is

   http://depts.washington.edu/dsc2005

   -thomas

On Mon, 13 Jun 2005, Thomas Lumley wrote:

>
> Online registration for DSC 2005 in Seattle, August 13-14 is now open. See
> the conference web page at
> http://depts.washington.edu//dsc2005/
>

Thomas Lumley   Assoc. Professor, Biostatistics
[EMAIL PROTECTED]   University of Washington, Seattle

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Open device -> glibc 2.3.4 bug for Redhat Enterprise 4?

2005-06-21 Thread Thomas Lumley

This was supposed to be fixed in 2.1.1 -- which version are you using?

-thomas

On Tue, 21 Jun 2005, Martin Maechler wrote:

> We have been using Redhat Enterprise 4, on some of our Linux
> clients for a while,
> and Christoph has just found that opening an R device for a file
> without write permission gives a bad glibc error and subsequent
> seg.fault:
>
>> postscript("/blabla.ps")
> *** glibc detected *** double free or corruption (!prev): 0x01505f10 
> ***
>
> or
>
>> xfig("/blabla.fig")
> *** glibc detected *** double free or corruption (!prev): 0x01505f10 
> ***
>
> and similar for pdf();
> does not happen for jpeg() {which runs via x11},
> nor e.g. for
>
>> sink("/bla.txt")
>
> ---
>
> Happens both on 32-bit (Pentium) and 64-bit (AMD Athlon)
> machines with the following libc :
>
> 32-bit:
>  -rwxr-xr-x  1 root root 1451681 May 13 00:17 /lib/tls/libc-2.3.4.so*
> 64-bit:
>  -rwxr-xr-x  1 root root 1490956 May 12 23:26 /lib64/tls/libc-2.3.4.so*
>
> ---
>
> Can anyone reproduce this problem?
>
> Regards,
> Martin
>
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Open device -> glibc 2.3.4 bug for Redhat Enterprise 4?

2005-06-21 Thread Thomas Lumley
On Tue, 21 Jun 2005, Martin Maechler wrote:

>>>>>> "TL" == Thomas Lumley <[EMAIL PROTECTED]>
>>>>>> on Tue, 21 Jun 2005 09:59:31 -0700 (PDT) writes:
>
>TL> This was supposed to be fixed in 2.1.1 -- which version are you using?
>
> 2.1.1 -- and 2.1.0 and 2.0.0 all showed the problem.
>
> But thanks, Thomas, looking in "NEWS" of R-devel showed that
> there was a fix for this in R-devel only --- too bad it didn't
> make it for R 2.1.1.
>

It's a double free(), so it produces undefined behaviour and anything can 
happen.  On the Mac (where it was first reported), "anything" was a 
warning message from malloc, and on many systems "anything" is nothing.

I thought I had added it to R-patched as well, but obviously not.

-thomas

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] segmentation fault after max level of parenthesis (PR#7990)

2005-07-04 Thread Thomas Lumley

Already fixed, I believe.

-thomas

On Mon, 4 Jul 2005 [EMAIL PROTECTED] wrote:

> Full_Name: Matthias Laabs
> Version: 2.1.0 (source compiled)
> OS: debian linux (sarge)
> Submission from: (NULL) (195.143.236.2)
>
>
> Hi!
> R crashes with a segmentation fault when I use more than 85 parenthesis (it
> actually happened by accidently hitting a wrong shortcut in emacs ... )
>
> code:
> sum(()
>
>
> cheers from Berlin,
>
> Matthias
>
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

Thomas Lumley   Assoc. Professor, Biostatistics
[EMAIL PROTECTED]   University of Washington, Seattle

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] R_alloc problems in R v1.11

2005-07-06 Thread Thomas Lumley

On Wed, 6 Jul 2005, [iso-8859-1] Marie-Hélène Ouellette wrote:


Dear Dr. Ripley,


Or possibly other people on the list.


I'm using the R v1.11 on Macintoch


Unlikely. There is no R v1.11, and R 1.1.1 (following the most common 
misspelling) wasn't available for the Mac.



and I seem to have a problem with the
function R_alloc. It crashes when using the following .C function (only an
example):

///
# include 
void Hello(int *n)
{
int i,x;
for(i=1;1< *n ; i++)


This is an infinite loop.  You probably mean i<*n as the second 
expression, or perhaps i<=*n (did you want n or n-1 repeats?)



{
Rprintf('salut!!!\n');


This is invalid C, as your compiler should have told you. You need double 
quotes.



}
x = (int *) R_alloc(5,sizeof(int));


x was declared as int, not int *, so again this is invalid.


}

///

I call it in R with this line:

.C('Hello',as.integer(5))

Any idea why and how I can resolve this problem?


After fixing the C errors and in a version of R that exists (2.0.1) I get

dyn.load("hello.so")
.C('Hello',as.integer(5))

salut!!!
salut!!!
salut!!!
salut!!!
[[1]]
[1] 5

-thomas__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Problem with dyn.load...or else...

2005-07-06 Thread Thomas Lumley

On Wed, 6 Jul 2005, [iso-8859-1] Marie-H?l?ne  Ouellette wrote:


And try to use the function:
> K_MEANSR(tab,centers=c(2,4))
[1] "AA"
[1] "AAA"
[1] "A"
[1] "B"
Error in .C("K_MEANSC", xrows = as.integer(xrows), xcols =
as.integer(xcols),  :
"C" function name not in load table



Hmm. Strange.  R doesn't think your C function is called K_MEANSC (I 
assume that K_MEANSC *is* actually the name of your C function.)


In a Terminal window you can use
  nm K_MEANSC.so
to see all the names of externally visible objects in your DLL. This would 
narrow down whether the change of names is happening in R or in 
compilation.


-thomas

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] R_AllocatePtr

2005-07-19 Thread Thomas Lumley
On Tue, 19 Jul 2005, Paul Roebuck wrote:

> Had been looking into Luke Tierney's R_AllocatePtr() and
> was left with a question about exactly when does R reclaim
> heap memory. Implication of 'simpleref.nw' is that one can
> allocate C data on the R heap, and as long as pointer object
> is alive, the data and pointer will remain valid. But it
> calls allocString() which is implemented using R_alloc().

Um, no.  allocString is
SEXP allocString(int length)
{
 return allocVector(CHARSXP, length);
}

In fact the reverse is true, R_alloc is implemented using allocString.

-thomas


Thomas Lumley   Assoc. Professor, Biostatistics
[EMAIL PROTECTED]   University of Washington, Seattle

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Follow-Up: R on FC4

2005-08-01 Thread Thomas Lumley
On Mon, 1 Aug 2005, Gavin Simpson wrote:
> R-devel compiles without error (same set of flags as above) but fails
> make check-all in p-r-random-tests.R - the relevant section of p-r-
> random-tests.Rout.fail is:
>
> ...
>> dkwtest("norm")
> norm() PASSED
> [1] TRUE
>> dkwtest("norm",mean = 5,sd = 3)
> norm(mean = 5, sd = 3) PASSED
> [1] TRUE
>>
>> dkwtest("gamma",shape =  0.1)
> gamma(shape = 0.1) PASSED
> [1] TRUE
>> dkwtest("gamma",shape =  0.2)
> gamma(shape = 0.2) PASSED
> [1] TRUE
>> dkwtest("gamma",shape = 10)
> gamma(shape = 10) FAILED
> Error in dkwtest("gamma", shape = 10) : dkwtest failed
> Execution halted
>
> Is this a tolerance setting in R's tests not being met or is this
> indicative of a bad compilation of R-Devel on FC4 with gfortran?
>

Try this again -- it is a random test.  If it fails again then something 
is wrong.  The tolerance on these tests is not very tight, certainly 
nowhere near rounding error.

-thomas

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] bug? (PR#8074)

2005-08-16 Thread Thomas Lumley
On Wed, 17 Aug 2005 [EMAIL PROTECTED] wrote:
> I just don't understand this:
>
>> (2*2)==4
> [1] TRUE
>> .2*.2
> [1] 0.04
>> (.2*.2)==.04
> [1] FALSE

It's a FAQ, not a bug. Consider:

> (.2*.2) - .04
[1] 6.938894e-18

and read the FAQ

-thomas


> or
>
>> x=.04
>> x
> [1] 0.04
>> y=.2*.2
>> y
> [1] 0.04
>> y==x
> [1] FALSE
>
> ______
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

Thomas Lumley   Assoc. Professor, Biostatistics
[EMAIL PROTECTED]   University of Washington, Seattle

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] bug? (PR#8074)

2005-08-16 Thread Thomas Lumley
On Tue, 16 Aug 2005, Paul Mosquin wrote:
> I guess that I expect R to act pretty much as C or C++ would do if I were to 
> program the same code.  It's a bit of a surprise that assignment of 
> rationals, well within precision, followed by multiplication leading to a 
> result well within precision picks up those extra bits along the way. 
> Something to watch out for, to be sure.

But those rationals are *not* well within precision. 0.2 is a infinite 
repeating binary fraction (in base 16 it is 0.333) so it is not stored 
precisely. 0.04 is also not stored precisely, and it so happens that the 
error in representing 0.04 is not the same as the error in representing 
0.02*0.02.

Of course this will still happen in C: R is written in C.
For example, on my computer the following C program
---
#include 

int main(){
   double d=0.2;
   double dd;

   dd=d*d;
   if(dd==d)
  printf("Equal\n");
   else
  printf("Difference=%20.18f\n",0.04-dd);
}
--
prints
[al:~] thomas% ./a.out
Difference=-0.07

which happens to agree with the result R gives, though this isn't 
guaranteed.  You simply cannot rely on floating point equality unless you 
know how the last bit rounding errors are handled.

-thomas

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Memory leakage/violation?

2005-08-26 Thread Thomas Lumley

I can't reproduce this on R2.2.0dev on Windows XP (in a few hundred 
tries), or running under Valgrind on AMD64 Linux (in four or five tries).

-thomas


On Fri, 26 Aug 2005, Henrik Bengtsson wrote:

> Hi,
>
> I've spotted a possible memory leakage/violation in the latest R v2.1.1
> patched and R v2.2.0dev on Windows XP Pro SP2 Eng.
>
> I first caught it deep down in a nested svd algorithm when subtracting a
> double 'c' from a integer vector 'a' where both had finite values but
> when assigning 'a <- a - c' would report NaNs whereas (a-c) alone would
> not.  Different runs with the identical data would introduce NaNs at
> random positions, but not all the time.
>
> Troubleshooting is after a couple of hours still at v0.5, but here is a
> script that generates the strange behavior on the above R setups.  I let
> the script speak for itself.  Note that both the script 'strange.R' and
> the data 'strange.RData' is online too, see code below.
>
> People on other systems (but also on Windows), could you please try it
> and see if you can reproduce what I get.
>
> Cheers
>
> Henrik
>
>
> # The following was tested on: Windows XP Pro SP2 Eng with
> #   i) R Version 2.1.1 Patched (2005-08-25)
> #  ii) R 2.2.0 Under development (unstable) (2005-08-25 r35394M)
>
> # Start 'R --vanilla' and source() this script, i.e.
> #  source("http://www.maths.lth.se/help/R/strange.R";)
> # If you do not get any errors, retry a few times.
>
> foo <- function(x) {
>   print(list(
> name=as.character(substitute(x)),
> storage.mode=storage.mode(x),
> na=any(is.na(x)),
> nan=any(is.nan(x)),
> inf=any(is.infinite(x)),
> ok=all(is.finite(a))
>   ))
>   print(length(x))
>   print(summary(x))
> }
>
> # Load data from a complicated "non-reproducible" algorithm.
> # The below errors occur also when data is not
> # saved and then reloaded from file.  Data was generated in
> # R v2.1.1 patched (see above).
> if (file.exists("strange.RData")) {
>   load("strange.RData")
> } else {
>   load(url("http://www.maths.lth.se/help/R/strange.RData";))
> }
>
> # First glance at data...
> foo(a)
> foo(c)
>
> ## $name
> ## [1] "a"
> ##
> ## $storage.mode
> ## [1] "integer"
> ##
> ## $na
> ## [1] FALSE
> ##
> ## $nan
> ## [1] FALSE
> ##
> ## $inf
> ## [1] FALSE
> ##
> ## $ok
> ## [1] TRUE
> ##
> ## [1] 15000
> ##Min. 1st Qu.  MedianMean 3rd Qu.Max.
> ##41.051.063.0   292.2   111.0 65170.0
> ## $name
> ## [1] "c"
> ##
> ## $storage.mode
> ## [1] "double"
> ##
> ## $na
> ## [1] FALSE
> ##
> ## $nan
> ## [1] FALSE
> ##
> ## $inf
> ## [1] FALSE
> ##
> ## $ok
> ## [1] TRUE
> ##
> ## [1] 1
> ##Min. 1st Qu.  MedianMean 3rd Qu.Max.
> ##   53.43   53.43   53.43   53.43   53.43   53.43
> ##
>
> # But, trying the following, will result in
> # no-reproducible error messages. Sometimes
> # it errors at kk==1, sometimes at kk >> 1.
> # Also, look at the different output for
> # different kk:s.
> for (kk in 1:100) {
>   cat("kk=",kk, "\n")
>   print(summary(a-c))
> }
>
> ## kk= 1
> ##  Min. 1st Qu.  MedianMean 3rd Qu.Max.
> ## -7.741e+307  -2.431e+00   9.569e+00   5.757e+01
> ## kk= 2
> ##Min.   1st Qu.Median  Mean   3rd Qu.  Max.
> ##   -12.430-2.431 9.569   238.70057.570 65120.000
> ## kk= 3
> ##Min.   1st Qu.Median  Mean   3rd Qu.  Max.
> ##   -12.430-2.431 9.569  57.570 65120.000
> ## kk= 4
> ##Min.   1st Qu.Median  Mean   3rd Qu.  Max.
> ##   -12.430-2.431 9.569   238.70057.570 65120.000
> ## kk= 5
> ##Min.   1st Qu.Median  Mean   3rd Qu.  Max.
> ##   -12.430-2.431 9.569   238.70057.570 65120.000
> ## kk= 6
> ## Error in quantile.default(object) : missing values and NaN's
> ## not allowed if 'na.rm' is FALSE
>
>
> ## Comments: If you shorten down 'a', the bug occurs less frequently.
>
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

Thomas Lumley   Assoc. Professor, Biostatistics
[EMAIL PROTECTED]   University of Washington, Seattle

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Compile warning in unique.c

2005-08-29 Thread Thomas Lumley
On Mon, 29 Aug 2005, Harris, Michael (NIH/NCI) [E] wrote:
>
> I am getting a compile warning when building R from source.  I am building
> on a AMD64 Opteron system  with  gcc (GCC) 3.3.3 (SuSE Linux)
>
> The warning is:
>
>unique.c: In function `cshash':
>
> unique.c:1146: warning: cast from pointer to integer of different size
>

The comment immediately above this suggests that it is deliberate

  /* Use hashing to improve object.size. Here we want equal CHARSXPs,
 not equal contents.  This only uses the bottom 32 bits of the pointer,
 but for now that's almost certainly OK */

The warning is presumably because casting this int back to a pointer would 
fail (and is a common 32 to 64bit conversion error), but that's not what 
is happening here.

-thomas

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] .Call and Segmentation Fault

2005-08-30 Thread Thomas Lumley
On Tue, 31 Aug 2005, Peter Dalgaard wrote:

> Well, did you try running under a debugger?
>
>  R -d gdb
>
> then "bt" after the segfault. (Make sure that things are compiled with
> -g option)

It would also be worth renaming the the function -- while I don't see 
exactly how it could be causing the problem, I would be nervous about 
having a C function called main() that wasn't the real main().

-thomas

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] 64 bit R for Windows

2005-09-02 Thread Thomas Lumley
On Fri, 2 Sep 2005, Martin Maechler wrote:

>> "PD" == Peter Dalgaard <[EMAIL PROTECTED]>
>> on 02 Sep 2005 18:48:24 +0200 writes:
>
>PD> "Milton Lopez" <[EMAIL PROTECTED]> writes:
>
>>> I appreciate the update. We will consider using Linux,
>>> which leads me to one more question: what is the maximum
>>> RAM that R can use on each platform (Linux and Windows)?
>>>
>>> Thanks again for your prompt responses.
>
>PD> On Win32, something like 3GB. Maybe a little more on
>PD> Linux32, but there's a physical limit at 4GB.
>
> for a *single* object, yes.  However (and Peter knows this
> probably better than me ..), R's workspace can be very much
> larger which makes it realistically possible to start *using* R
> functions on objects of around 4GB.

No, no.  On *Windows* there is an address space limit of about 3Gb (and on 
other 32bit systems)

On a 64bit system the limit is that a vector can't have length greater 
than 2^31, but this would be 8Gb for integers or 16Gb for doubles and so 
represents larger objects than you would want to handle on most current 
64-bit systems.

-thomas

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] .Call with C and Fortran together (PR#8122)

2005-09-05 Thread Thomas Lumley

>
> In some machines I don't get the segmentation fault problem, but I don't get 
> the
> message "Just a simple test" either (when using "cg" as the subroutine's 
> name).
> I believe this is bug in R because if I change my C interface again to return 
> a
> 0 instead of a R_NilValue, and then use it with another C program wich loads 
> the
> dynamic library amd call the function simple_program(), everything work
> perfectly.
>

I don't think it is an R bug.  I think it is because there is already a 
Fortran function called cg in R. The fact that changing the name matters 
suggest that you have a linking problem, and this turns out to be the 
case.

When I try running your code under gdb in R as Peter Dalgaard suggested 
(after editing it to use R's macros for calling fortran from C instead of 
"cfortran.h" which I don't have), I get

> .Call("simple_program")
  Calling the function...

Program received signal SIGSEGV, Segmentation fault.
0x081604e5 in cg_ (nm=0x9e5dda4, n=0xbfefccfc, ar=0xbfefcce8, ai=0x89a826,
 wr=0x9e5dda4, wi=0x9790cc0, matz=0x56090a0, zr=0x80992d4, zi=0x0, 
fv1=0x0,
 fv2=0x9e745f8, fv3=0x89a810, ierr=0x706d6973) at eigen.f:3416
3416  IERR = 10 * N
Current language:  auto; currently fortran


That is, your program is calling the Fortran subroutine CG in eigen.f, 
rather than your CG.

There should be some set of linker flags that makes sure your definition 
of CG is used, but I don't know what it would be (and it's probably very 
platform dependent)

-thomas

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Issue tracking in packages [was: Re: [R] change in read.spss, package foreign?]

2005-09-09 Thread Thomas Lumley
On Fri, 9 Sep 2005, Gabor Grothendieck wrote:
>
> I personally put NEWS, WISHLIST and THANKS files in the 'inst'
> directory of all my source packages.  This has the effect of copying them to 
> the
> top level of the built version so that they are accessible from R via:
>

I'm not sure that WISHLIST and THANKS need to be available to people who 
haven't installed the package.   NEWS, on the other hand, really does.

One option (if it doesn't turn out to be too much work for the CRAN 
maintainers) would be to have an optional Changelog field in the 
DESCRIPTION file giving the relative path to the file. This would mean 
that maintainers would not all have to switch to the same format.
eg for foreign
   Changelog: ChangeLog
and for survey
   Changelog: inst/NEWS

This might be enough to make it easy for CRAN to display these when the 
maintainer provides them.

-thomas

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Issue tracking in packages [was: Re: [R] change in read.spss, package foreign?]

2005-09-09 Thread Thomas Lumley
On Fri, 9 Sep 2005, Gabor Grothendieck wrote:
> How about if there were just a standard location and name such as inst/NEWS,
> inst/WISHLIST, inst/THANKS (which has the advantage that they are 
> automatically
> made available in the built package under the current way packages are
> built)

The problem is that there *isn't* a standard location. As Robert Gentleman 
has pointed out, if you only maintain two or three packages it isn't too 
bad to change them to some new layout, but if you are the bioconductor 
project it gets painful quite quickly.

Also, there are good reasons for having NEWS in the top level directory. 
Nearly everything that isn't an R package does this, because it's a useful 
standard.

-thomas

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Issue tracking in packages [was: Re: [R] change in read.spss, package foreign?]

2005-09-10 Thread Thomas Lumley
>
> Standard location or a mechachanism like the one you describe are both
> similar amount of work (and not much at all), the HTML pages are
> generated by perl and I have the parsed DESCRIPTION file there, i.e.,
> using a fixed name or the value of the Changelog field is basically
> the same.
>

In which case a Changlog entry in DESCRIPTION would be a very nice 
addition, and would have the advantage of not requiring changes to 
packages.

-thomas

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Issue tracking in packages [was: Re: [R] change in read.spss, package foreign?]

2005-09-10 Thread Thomas Lumley
On Sat, 10 Sep 2005, Gabor Grothendieck wrote:
>
> And one more comment.   The DESCRIPTION file does not record the
> location or existence of the various subdirectories such as R, man,
> exec, etc. If NEWS is to be recorded as a meta data line item in
> DESCRIPTION then surely all of these should be too so its symmetric
> and they are all on an equal footing (or else none of them
> should be, which in fact I think is preferable).
>

I don't see any advantage in symmetry.  The locations of these 
subdirectories are fixed and I can't see why someone trying to decide 
whether to install an upgrade needs to know if it has an exec 
subdirectory before they download the package.

I also don't see why THANKS and WISHLIST should need to be visible before 
you download the package.  CRAN does display a URL if one is given, and if 
these are important they could be at that URL.

The changelog, on the other hand, is one piece of information that is 
really valuable in deciding whether or not to update a package, so it 
would be worth having it visible on CRAN.  Since other coding standards 
suggest different things for the name and location of this file, a path in 
DESCRIPTION seems a minimal change.

-thomas

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Issue tracking in packages [was: Re: [R] change in, read.spss, package foreign?]

2005-09-10 Thread Thomas Lumley
On Sat, 10 Sep 2005, Frank E Harrell Jr wrote:

> I would vote for allowing a URL or external file name in in DESCRIPTION,
> whose contents could be automatically displayed for the user when
> needed.  Our changelogs are automatically generated by CVS and are on
> the web.

Yes, this would be nice.

However, a URL facility is already present (and you already use it, and 
link changelogs to the URL, as do I).

-thomas

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Issue tracking in packages

2005-09-10 Thread Thomas Lumley
On Sat, 10 Sep 2005, Seth Falcon wrote:
> For what its worth, I don't like this idea of adding a ChangeLog field
> to the DESCRIPTION file.
>
> Agreeing upon a standard location for NEWS or CHANGES or some such
> seems a more simple solution.  As long as the presence of such a file
> is *optional*.  And if the location really needs to be at the top,
> then the build tools could grab it from there as they do the
> DESCRIPTION file.

We're certainly agreed on its being optional.

-thomas

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Issue tracking in packages [was: Re: [R] change in read.spss, package foreign?]

2005-09-10 Thread Thomas Lumley
On Sat, 10 Sep 2005, Gabor Grothendieck wrote:

> On 9/10/05, Thomas Lumley <[EMAIL PROTECTED]> wrote:
>> On Sat, 10 Sep 2005, Gabor Grothendieck wrote:
>>>
>>> And one more comment.   The DESCRIPTION file does not record the
>>> location or existence of the various subdirectories such as R, man,
>>> exec, etc. If NEWS is to be recorded as a meta data line item in
>>> DESCRIPTION then surely all of these should be too so its symmetric
>>> and they are all on an equal footing (or else none of them
>>> should be, which in fact I think is preferable).
>>>
>>
>> I don't see any advantage in symmetry.  The locations of these
>
> The present discussion is where the change information may be located
> but that is also true of the source and other information.We could
> just as easily have a field in the DESCRIPTION that tells the build
> where to find the R source.
> Its really the same issue.
>

There are two important differences

1/ No existing package has its source anywhere other than in the R 
subdirectory. Existing packages have their change logs in different places 
and different formats.

2/ Having source code where it will not be found must be an error -- 
making the source code available to R *cannot* be optional.  Making a 
change log available *must* be optional.


-thomas

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Typo [Was: Rd and guillemots]

2005-09-16 Thread Thomas Lumley

On Fri, 16 Sep 2005, [EMAIL PROTECTED] wrote:


The name of the "continental" quotation mark « is "guillemet".



For anyone who is still confused:

Left pointing guillemet (U+00BB)
http://www.mathmlcentral.com/characters/glyphs/LeftGuillemet.html

Left pointing guillemot (Uria aalge)
http://www.rspb.org.uk/scotland/action/disaster/index.asp

Right pointing guillemet: (Unicode U+00AB)
http://www.mathmlcentral.com/characters/glyphs/RightGuillemet.html

Right pointing guillemot: (Uria aalge)
http://www.yptenc.org.uk/docs/factsheets/animal_facts/guillemot.html


-thomas__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Typo [Was: Rd and guillemots]

2005-09-16 Thread Thomas Lumley

On Fri, 16 Sep 2005, Thomas Lumley wrote:


On Fri, 16 Sep 2005, [EMAIL PROTECTED] wrote:


The name of the "continental" quotation mark ? is "guillemet".



For anyone who is still confused:


It should perhaps be noted that the Postscript name for the Unicode "Left 
pointing guillemet" is guillemotleft, which explains some of the 
confusion.  There does not seem to be a Postscript name for "Left pointing 
guillemot"


-thomas__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] as.data.frame segfaults on large lists (PR#8141)

2005-09-20 Thread Thomas Lumley


Under Valgrind on x86_64 I get
==27405==  Access not within mapped region at address 0x33FFEFD8
==27405==at 0x447045: Rf_substituteList (coerce.c:2003)
==27405== Stack overflow in thread 1: can't grow stack to 0x33FFEF98


-thomas

On Sun, 18 Sep 2005, Peter Dalgaard wrote:


[EMAIL PROTECTED] writes:


Full_Name: Ulrich Poetter
Version: 2.1.1
OS: i686-pc-linux-gnu FC2
Submission from: (NULL) (134.147.95.187)


as.data.frame() segfaults on lists with very many elements:


dfn <- rep(list(rep(0,2)),198000)
test <- as.data.frame.list(dfn)


Process R segmentation fault at Sun Sep 18 17:06:02 2005


Not for me on FC4. The process size grows to about 180M and the system
thrashes badly, but the calculation runs to completion.

It's not unlikely that we are ignoring a failed allocation somewhere,
but there's not much hope of finding it from the available
information. You could try running under gdb and see where things go
wrong for you.

--
  O__   Peter Dalgaard Øster Farimagsgade 5, Entr.B
 c/ /'_ --- Dept. of Biostatistics PO Box 2099, 1014 Cph. K
(*) \(*) -- University of Copenhagen   Denmark  Ph:  (+45) 35327918
~~ - ([EMAIL PROTECTED])  FAX: (+45) 35327907

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel



Thomas Lumley   Assoc. Professor, Biostatistics
[EMAIL PROTECTED]   University of Washington, Seattle__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Subscripting fails if name of element is "" (PR#8161)

2005-09-30 Thread Thomas Lumley

On Fri, 30 Sep 2005, "Jens Oehlschlägel" wrote:

Dear all,

The following shows cases where accessing elements via their name fails (if
the
name is a string of length zero).



This looks deliberate (there is a function NonNullStringMatch that does 
the matching).  I assume this is because there is no other way to 
indicate that an element has no name.


If so, it is a documentation bug -- help(names) and FAQ 7.14 should 
specify this behaviour.  Too late for 2.2.0, unfortunately.


-thomas






Best regards


Jens Oehlschlägel



p <- 1:3
names(p) <- c("a","", as.character(NA))
p

  a  
  123


for (i in names(p))

+ print(p[[i]])
[1] 1
[1] 2
[1] 3


# error 1: vector subsripting with "" fails in second element
for (i in names(p))

+ print(p[i])
a
1

 NA

  3


# error 2: print method for list shows no name for second element
p <- as.list(p)


for (i in names(p))

+ print(p[[i]])
[1] 1
[1] 2
[1] 3


# error 3: list subsripting with "" fails in second element
for (i in names(p))

+ print(p[i])
$a
[1] 1

$"NA"
NULL

$"NA"
[1] 3



version

_
platform i386-pc-mingw32
arch i386
os   mingw32
system   i386, mingw32
status
major2
minor1.1
year 2005
month06
day  20
language R




# -- replication code --

p <- 1:3
names(p) <- c("a","", as.character(NA))
p

for (i in names(p))
 print(p[[i]])

# error 1: vector subsripting with "" fails in second element
for (i in names(p))
 print(p[i])

# error 2: print method for list shows no name for second element
p <- as.list(p)


for (i in names(p))
 print(p[[i]])

# error 3: list subsripting with "" fails in second element
for (i in names(p))
 print(p[i])




--

__________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel



Thomas Lumley   Assoc. Professor, Biostatistics
[EMAIL PROTECTED]   University of Washington, Seattle__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] access to R parse tree for Lisp-style macros?

2005-10-03 Thread Thomas Lumley
On Mon, 3 Oct 2005, Duncan Murdoch wrote:

> On 10/3/2005 3:25 AM, Andrew Piskorski wrote:
>> R folks, I'm curious about possible support for Lisp-style macros in
>> R.  I'm aware of the "defmacro" support for S-Plus and R discussed
>> here:
>>
>>   http://www.biostat.wustl.edu/archives/html/s-news/2002-10/msg00064.html
>>
>> but that's really just a syntactic short-cut to the run-time use of
>> substitute() and eval(), which you could manually put into a function
>> yourself if you cared too.  (AKA, not at all equivalent to Lisp
>> macros.)

Well, yes and no.  It is a syntactic shortcut using functions, but what it 
does is manipulate and then evaluate pieces of parse tree.  It doesn't 
have the efficiency under compilation that real macros would, but we don't 
have compilation.  It doesn't have gensyms, but again, R fails to support 
these in a fairly fundamental way, so they have to be faked using 
variables with weird random names.

I have a long-term plan to add real macros, but not until after Luke 
Tierney's byte-code compiler is finished.

-thomas

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] 8 char labels in read.spss

2005-10-11 Thread Thomas Lumley
On Tue, 11 Oct 2005, Knut Krueger wrote:

> I was wondering why it is possible to read long labels from the CVS
> files but not from the SPSS files.

The SPSS file formats are not documented, and so we rely on the code from 
PSPP.  At the time, PSPP did not read long variable names.  It now does, 
so it would be possible for someone update the SPSS-reading code to handle 
long variable names.  This is much more complicated than just changing a 
#define; the long variable names are stored in a different part of the 
file.

I don't expect anyone on R-core to get around to this any time soon. If 
you want to try, the current PSPP code is at
http://savannah.gnu.org/projects/pspp

-thomas


> I did not have much time to search for the code but I found:
>
> in foreign_0.8-10 source file var.h.in
>
>> /* Definition of the max length of a short string value, generally
>>eight characters.  */
>> #define MAX_SHORT_STRING ((SIZEOF_DOUBLE)>=8 ? ((SIZEOF_DOUBLE)+1)/2*2
>> : 8)
>> #define MIN_LONG_STRING (MAX_SHORT_STRING+1)
>>
>> /* FYI: It is a bad situation if sizeof(R_flt64) < MAX_SHORT_STRING:
>>then short string missing values can be truncated in system files
>>because there's only room for as many characters as can fit in a
>>R_flt64. */
>> #if MAX_SHORT_STRING > 8
>> #error MAX_SHORT_STRING must be less than 8.
>> #endif
>
> I am am right then there was a restriction in the year 2000 because the
> files are from the year 2000.
>
> Now there are some questions:
> Did I found the right code?
> is it possible that anybody could recompile this with long value names
> or where is the best manual for a quick start in compiling packages.
>
>
> I found a couple of weeks before a tread where anybody wrote a complete
> way for building packages.
> He described all problems of him and there were a lot of hints for the
> first steps, but I am not able to find it again - I don't know the
> search terms which I used before :-(
>
>
> with regards
> Knut
>
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

Thomas Lumley   Assoc. Professor, Biostatistics
[EMAIL PROTECTED]   University of Washington, Seattle

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] 8 char labels in read.spss

2005-10-11 Thread Thomas Lumley
On Tue, 11 Oct 2005, Knut Krueger wrote:
>
> I found a definition of the SPSS files.
> http://www.wotsit.org/download.asp?f=spssdata
> but they recommend to use the spss input/output dll to ensure upward
> compatbility
>

"Well, they would say that, wouldn't they" (Rice-Davis 1963)

Unfortunately, that document is an old file format. It doesn't describe 
record 7 subtype 13, which is where the long variable names live.

-thomas

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] SPSS I/O dll was: 8 char labels in read.spss

2005-10-12 Thread Thomas Lumley
On Wed, 12 Oct 2005, Knut Krueger wrote:
> Thomas Lumley schrieb:
>> On Tue, 11 Oct 2005, Knut Krueger wrote:
>>> I found a definition of the SPSS files.
>>> http://www.wotsit.org/download.asp?f=spssdata
>>> but they recommend to use the spss input/output dll to ensure upward
>>> compatbility
>>>
>> "Well, they would say that, wouldn't they" (Rice-Davis 1963)
>>
>> Unfortunately, that document is an old file format. It doesn't describe
>> record 7 subtype 13, which is where the long variable names live.
>>
> What about using the SPSS input dll for those R users which would like to use 
> their old SPSS files. Most universities here have SPSS licences and therefor 
> the dll is available.
> I did not found yet any copyright notice on the developer part of the SPSS 
> CD. Maybe the DLL is usable with the own program without the SPSS Licence. I 
> will look for that if using the I/O dll is possible with R.
> All necessary C codes are delivered with the developer kit on the SPSS CD.
>

Yes, but it can't be incorporated in the  "foreign" package, which has a 
GPL license, and in any case wouldn't work off Windows.  It's not clear 
that it would be much easier than using the PSPP code in any case.

-thomas

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Why no .Machine$sizeof.double?

2005-10-18 Thread Thomas Lumley
On Tue, 18 Oct 2005, Earl F. Glynn wrote:

> Whis is there a .Machine$sizeof.longdouble but no .Machine$sizeof.double?
>

sizeof(double) is always 8 and sizeof(int) is always 4, because R requires 
the IEEE/IEC standard arithmetic types.  R will not compile with any other 
sizes.

-thomas

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Why no .Machine$sizeof.double?

2005-10-18 Thread Thomas Lumley
On Tue, 18 Oct 2005, Earl F. Glynn wrote:

> "Thomas Lumley" <[EMAIL PROTECTED]> wrote in message
> news:[EMAIL PROTECTED]
>> On Tue, 18 Oct 2005, Earl F. Glynn wrote:
>>
>>> Whis is there a .Machine$sizeof.longdouble but no
> .Machine$sizeof.double?
>>>
>>
>> sizeof(double) is always 8 and sizeof(int) is always 4, because R requires
>> the IEEE/IEC standard arithmetic types.  R will not compile with any other
>> sizes.
>
> But it's a common, recommended software engineering practice to define
> mnemonic, named constants.
>
> If I were to see code like .Machine$sizeof.double * N.COL * N.ROW, I know
> that's the number of bytes in a matrix of doubles. If I see code that is 8 *
> N.COL * N.ROW, I can guess what "8" means, but I could guess wrong.  I wrote
> code that looks just like this today because I couldn't find the defined
> constant.  Will someone else reading my code automatically know what the "8"
> means?
>

But why would you ever want to write either .Machine$sizeof.double * N.COL 
* N.ROW or 8 * N.COL * N.ROW?

If you are doing memory allocation in R then numeric() automatically 
allocates things of the correct size. If you are doing memory allocation 
in C then you should use sizeof(double) or, even better, sizeof(*yourpointer).
In both cases, the language has the facility to let you work in the 
correct units without explicit constants, named or unnamed.

You would have a stronger case for arguing that there should be typedefs
in C so that you didn't need to remember that R's numeric type is double 
(rather than, er, float?)and its integer type is int (rather than long)
  and in fact we do provide
  typedef int Sint;
in R.h.

-thomas

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] [R] unvectorized option for outer()

2005-10-30 Thread Thomas Lumley
On Sun, 30 Oct 2005, Jonathan Rougier wrote:

> I'm not sure about this.  Perhaps I am a dinosaur, but my feeling is
> that if people are writing functions in R that might be subject to
> simple operations like outer products, then they ought to be writing
> vectorised functions!

I would agree.  How about an oapply() function that does multiway (rather 
than just two-way) outer products.  Basing the name on "apply" would 
emphasize the similarity to other flexible, not particularly optimized 
second-order functions.

-thomas



>   Maybe it's not possible to hold this line, and
> maybe "outer" is not the right place to draw it, but I think we ought to
> respect the "x is a vector" mindset as much as possible in the base
> package.  As Tony notes, the documentation does try to be clear about
> what outer actually does, and how it can be used.
>
> So I am not a fan of the VECTORIZED argument, and definitely not a fan
> of the VECTORIZED=FALSE default.
>
> Jonathan.
>
> Gabor Grothendieck wrote:
>> If the default were changed to VECTORIZED=FALSE then it would
>> still be functionally compatible with what we have now so all existing
>> software would continue to run correctly yet would not cause
>> problems for the unwary.  Existing software would not have to be changed
>> to add VECTORIZED=TRUE except for those, presumbly few, cases
>> where outer performance is critical.   One optimization might be to
>> have the default be TRUE if the function is * or perhaps if its specified
>> as a single character and FALSE otherwise.
>>
>> Having used APL I always regarded the original design of outer in R as
>> permature performance optimization and this would be a chance to get
>> it right.
>>
>> On 10/28/05, Tony Plate <[EMAIL PROTECTED]> wrote:
>>
>>> [following on from a thread on R-help, but my post here seems more
>>> appropriate to R-devel]
>>>
>>> Would a patch to make outer() work with non-vectorized functions be
>>> considered?  It seems to come up moderately often on the list, which
>>> probably indicates that many many people get bitten by the same
>>> incorrect expectation, despite the documentation and the FAQ entry.  It
>>> looks pretty simple to modify outer() appropriately: one extra function
>>> argument and an if-then-else clause to call mapply(FUN, ...) instead of
>>> calling FUN directly.
>>>
>>> Here's a function demonstrating this:
>>>
>>> outer2 <- function (X, Y, FUN = "*", ..., VECTORIZED=TRUE)
>>> {
>>>no.nx <- is.null(nx <- dimnames(X <- as.array(X)))
>>>dX <- dim(X)
>>>no.ny <- is.null(ny <- dimnames(Y <- as.array(Y)))
>>>dY <- dim(Y)
>>>if (is.character(FUN) && FUN == "*") {
>>>robj <- as.vector(X) %*% t(as.vector(Y))
>>>dim(robj) <- c(dX, dY)
>>>}
>>>else {
>>>FUN <- match.fun(FUN)
>>>Y <- rep(Y, rep.int(length(X), length(Y)))
>>>if (length(X) > 0)
>>>X <- rep(X, times = ceiling(length(Y)/length(X)))
>>>if (VECTORIZED)
>>>robj <- FUN(X, Y, ...)
>>>else
>>>robj <- mapply(FUN, X, Y, MoreArgs=list(...))
>>>dim(robj) <- c(dX, dY)
>>>}
>>>if (no.nx)
>>>nx <- vector("list", length(dX))
>>>else if (no.ny)
>>>ny <- vector("list", length(dY))
>>>if (!(no.nx && no.ny))
>>>dimnames(robj) <- c(nx, ny)
>>>robj
>>> }
>>> # Some examples
>>> f <- function(x, y, p=1) {cat("in f\n"); (x*y)^p}
>>> outer2(1:2, 3:5, f, 2)
>>> outer2(numeric(0), 3:5, f, 2)
>>> outer2(1:2, numeric(0), f, 2)
>>> outer2(1:2, 3:5, f, 2, VECTORIZED=F)
>>> outer2(numeric(0), 3:5, f, 2, VECTORIZED=F)
>>> outer2(1:2, numeric(0), f, 2, VECTORIZED=F)
>>>
>>> # Output on examples
>>>
>>>> f <- function(x, y, p=1) {cat("in f\n"); (x*y)^p}
>>>> outer2(1:2, 3:5, f, 2)
>>>
>>> in f
>>>     [,1] [,2] [,3]
>>> [1,]9   16   25
>>> [2,]   36   64  100
>>>
>>>> outer2(numeric(0), 3:5, f, 2)
>>>
>>> in f
>>> [,1] [,2] [,3]
>>>
>>>> outer2(1:2, numeric(0), f, 2)
>>>
>>> in f
>>>

Re: [Rd] [R] unvectorized option for outer()

2005-10-31 Thread Thomas Lumley
On Mon, 31 Oct 2005, Liaw, Andy wrote:

>> From: Thomas Lumley
>>
>> On Sun, 30 Oct 2005, Jonathan Rougier wrote:
>>
>>> I'm not sure about this.  Perhaps I am a dinosaur, but my feeling is
>>> that if people are writing functions in R that might be subject to
>>> simple operations like outer products, then they ought to be writing
>>> vectorised functions!
>>
>> I would agree.  How about an oapply() function that does
>> multiway (rather
>> than just two-way) outer products.  Basing the name on "apply" would
>> emphasize the similarity to other flexible, not particularly
>> optimized
>> second-order functions.
>>
>>  -thomas
>
> I'll toss in my $0.02:  The following is my attempt at creating a "general
> outer" that works with more than two arguments.
>
> gouter <- function (x, FUN, ...) {
>xgrid <- as.list(do.call("expand.grid", x))
>names(xgrid) <- NULL
>array(do.call(deparse(substitute(FUN)), c(xgrid, list(...))),
>dim = sapply(x, length), dimnames = x)
> }
>

Yes, that's the sort of thing I had in mind.  The name "gouter" isn't 
exactly to my taste -- the point was that outer() is the fast but 
restricted function and that the general but slow function should have a 
different name (eg xapply or oapply).


-thomas

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] [R] unvectorized option for outer()

2005-11-01 Thread Thomas Lumley
On Tue, 1 Nov 2005, Duncan Murdoch wrote:

> The version I posted yesterday did indeed mess up when some arguments were 
> unspecified.  Here's a revision that seems to work in all the tests I can 
> think of.  I also added the SIMPLIFY and USE.NAMES args from mapply to it, 
> and a sanity check to the args.
>
> I did notice and work around one buglet in mapply:  if you choose not to 
> vectorize any arguments, you don't get a call to the original function, 
> mapply returns "list()".
>
> For example,
>
>> mapply(function(x) x^2, MoreArgs = list(x=2))
> list()
>
> whereas I would think 4 is a more logical answer.
>

I don't agree at all.  The answer should be the length of the longest 
vectorised argument, and it is.

-thomas

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] A problem with glm() and possibly model-using functions in general?

2005-11-18 Thread Thomas Lumley
On Fri, 18 Nov 2005, Byron Ellis wrote:

> So, consider the following:
>
> > example(glm)
> > g = function(model) { w = runif(9);glm(model,weights=w); }
> > g(counts ~ outcome + treatment)
> Error in eval(expr, envir, enclos) : object "w" not found
>
> Huh?! I suspect that somebody is lazily evaluating arguments in the
> wrong environment (probably GlobalEnv in this case). I'm willing to
> accept the fact that there's some mysterious reason you'd actually
> want this behavior, but this looks like it should be filed as a bug
> to me.

Yes, there is a reason you'd actually want this behaviour, and 
it is documented. In help(model.frame) it says

  All the variables in 'formula', 'subset' and in '...' are looked
  for first in 'data' and then in the environment of 'formula' (see
  the help for 'formula()' for further details) and collected into a
  data frame.

In your example the environment of 'formula' is the global environment, 
since that's where it was created.

There isn't a set of scoping rules for formulas that will make everyone 
happy, but this lexical scope is what R has done for quite some time.

-thomas

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] [R] computing the variance

2005-12-05 Thread Thomas Lumley
>
> Using the biased variance just because it is the MLE (if that is the
> argument) seems confused to me. However, there's another point:
>
>> var(sample(1:3, 10, replace=TRUE))
> [1] 0.6680556
>
> i.e. if we are considering x as the entire population, then the
> variance when sampling from it is indeed 1/N*E(X-EX)^2, which is why
> some presentations distinguish between the "population" and "sample"
> variances. We might want to support this distinction somehow.
>

We might also consider that the purpose of computing the variance is often 
to take the square root, and that using 1/(n-1) as the divisor does not 
give any particular optimality as an estimator of the standard deviation.

-thomas

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] 0/1 vector for indexing leads to funny behaviour (PR#8389) (maybe a documentation deficiency?)

2005-12-14 Thread Thomas Lumley
On Wed, 14 Dec 2005, Tony Plate wrote:
>
> That's what I was trying to say: the whole truth is that numeric index
> vectors that contain positive integral quantities can also contain
> zeros.  Upon rereading this passage yet again, I think it is more
> misleading than merely incomplete: the phrasings "positive integral
> quantities", and "*must* lie in the set ..." rule out the possibility of
> the vector containing zeros.
>

"Someone told me that you can't run without bouncing the ball in 
basketball. I got a basketball and tried it and it worked fine. He must be 
wrong"  -- a comp.lang.c standard

It doesn't rule out the the possibility of the vector containing zeros, it 
tells you that you should not put zeros in the vector.

-thomas

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] 2 x 2 chisq.test (PR#8415)

2005-12-20 Thread Thomas Lumley

This is the same as PR#8265, which was reported two months ago by someone 
else from syd.odn.ne.jp.  It still isn't a bug. According to Brian 
Ripley's response at that time, "almost all" the sources he checked gave 
the correction that R uses.

-thomas

On Tue, 20 Dec 2005, [EMAIL PROTECTED] wrote:

> Full_Name: nobody
> Version: 2.2.0
> OS: any
> Submission from: (NULL) (219.66.34.183)
>
>
> 2 x 2 table, such as
>
>> x
> [,1] [,2]
> [1,]   10   12
> [2,]   11   13
>
>> chisq.test(x)
>
>   Pearson's Chi-squared test with Yates'
>   continuity correction
>
> data:  x
> X-squared = 0.0732, df = 1, p-value = 0.7868
>
> but, X-squared = 0.0732 is over corrected.
>
> when abs(a*d-b*c) <= sum(a,b,c,d), chisq.value must be 0!, and P-value must be
> 1!
>
> code of chisq.test must be as follows
>
> #   if (correct && nrow(x) == 2 && ncol(x) == 2) {
> #   YATES <- 0.5
> #   METHOD <- paste(METHOD, "with Yates' continuity correction")
> #   }
> #   else YATES <- 0
> #   STATISTIC <- sum((abs(x - E) - YATES)^2/E)
> ## replace begin
> if (correct && nrow(x) == 2 && ncol(x) == 2) {
> STATISTIC <- if (abs(x[1,1]*x[2,2]-x[1,2]*x[2,1]) < sum(x)/2) > 0
>
>  else sum((abs(x - E) - 0.5)^2/E)
> METHOD <- paste(METHOD, "with Yates' continuity correction")
> }
> else STATISTIC <- sum((abs(x - E))^2/E)
> ## replace end
>
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

Thomas Lumley   Assoc. Professor, Biostatistics
[EMAIL PROTECTED]   University of Washington, Seattle

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Multiplication (PR#8466)

2006-01-06 Thread Thomas Lumley
On Fri, 6 Jan 2006, [EMAIL PROTECTED] wrote:

> hi - in version 2.1 the command
>
> >-2^2
>
> gives
>
> -4
>
> as the answer.  (-2)^2 is evaluated correctly.

So is -2^2.  The precedence of ^ is higher than that of unary minus. It 
may be surprising, but it *is* documented and has been in S for a long 
time.


-thomas

Thomas Lumley   Assoc. Professor, Biostatistics
[EMAIL PROTECTED]   University of Washington, Seattle

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] prod(numeric(0)) surprise

2006-01-09 Thread Thomas Lumley
On Mon, 9 Jan 2006, Martin Morgan wrote:

> I guess I have to say yes, I'd exepct
>
> x <- 1:10
> sum(x[x>10]) ==> numeric(0)
>
> this would be reinforced by recongnizing that numeric(0) is not zero,
> but nothing. I guess the summation over an empty set is an empty set,
> rather than a set containing the number 0. Certainly these
>
> exp(x[x>10]) ==> numeric(0)
> numeric(0) + 1 ==> numeric(0)
>

There are some fairly simple rules in how R does it.  You do need to 
distinguish between functions (binary operators) that map two vectors of 
length n to a vector of length n and functions such as prod and sum that 
map a vector of length n to a vector of length 1.

The output of sum and prod is always of length 1, so sum(numeric(0)) and 
prod(numeric(0)) should be of length 1 (or give an error).  It is 
convenient that sum(c(x,y)) is equal to sum(x)+sum(y) and that 
prod(c(x,y)) is equal to prod(x)*prod(y), which motivates making 
sum(numeric(0)) give 0 and prod(numeric(0)) give 1.

Single argument functions such as exp(numeric(0)) seem fairly obvious: you 
have no numbers and you exponentiate them so you still have no numbers. 
You could also argue based on c() and exp() commuting.

The rules for binary operators are a little less tidy [my fault]. They 
come from the idea that x+1 should always add 1 to each element of x.  If 
you add 1 to each element of numeric(0) you get numeric(0).  The usual 
recycling rule says that the shorter vector should be repeated to make it 
the same length as the longer vector, so this is a wart.  On the other 
hand, you can't recycle a vector of length 0 to have length 1, so the 
usual recycling rule can't be applied here. This also makes matrix 
operations work, at least in the sense of getting 
matrices of the right dimension.

-thomas

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Issue with c++ .C call

2006-01-10 Thread Thomas Lumley
On Tue, 10 Jan 2006, Dominick Samperi wrote:

> Sean,
>
> prm in your function calcStepgram is NOT a vector of doubles, it is of type
> SEXP, and you need to use R macros to fetch the value(s). This is done
> automatically in the Rcpp package, and if you want to see how this is
> done look at the definition of the class RcppVector in Rcpp.cpp

Not at all.  He is using .C, which passes a double * to the C function. 
You may be thinking of .Call

-thomas


> Dominick
>
> Sean Davis wrote:
>> I am still having some difficulties with connecting R to a C++ function.  I
>> am able to call the function as expected after compiling the shared library
>> and such.  However, the call to the function is via .C; parameters from the
>> .C call are not being passed correctly to the function.  As an example, I
>> have attached a GDB run of the code.  I set a breakpoint on entry to the
>> function I am calling from R.  What is bothering me (and probably causing
>> the segmentation faults I am seeing) is that the parameter
>> prm=as.double(c(3.2,1.1)) is not 3.2,1.1 IMMEDIATELY after the call to .C.
>> I am sure I am missing something very basic.
>>
>> Thanks,
>> Sean
>>
>>
>>
>>> sessionInfo()
>>>
>> R version 2.2.0, 2005-08-11, powerpc-apple-darwin7.9.0
>>
>> attached base packages:
>> [1] "methods"   "stats" "graphics"  "grDevices" "utils" "datasets"
>> [7] "base"
>>
>>
>>
>>
>> holmes:~/Mercury/projects/R/StepGram sdavis$ R -d gdb
>> GNU gdb 6.1-20040303 (Apple version gdb-413) (Wed May 18 10:17:02 GMT 2005)
>> Copyright 2004 Free Software Foundation, Inc.
>> GDB is free software, covered by the GNU General Public License, and you are
>> welcome to change it and/or distribute copies of it under certain
>> conditions.
>> Type "show copying" to see the conditions.
>> There is absolutely no warranty for GDB.  Type "show warranty" for details.
>> This GDB was configured as "powerpc-apple-darwin"...Reading symbols for
>> shared libraries ... done
>>
>> (gdb) r
>> Starting program:
>> /Users/sdavis/R-devel2/R.framework/Versions/2.2.0/Resources/bin/exec/R
>> Reading symbols for shared libraries ...+ done
>>
>> R : Copyright 2005, The R Foundation for Statistical Computing
>> Version 2.2.0 Under development (unstable) (2005-08-11 r35256)
>> ISBN 3-900051-07-0
>>
>> R is free software and comes with ABSOLUTELY NO WARRANTY.
>> You are welcome to redistribute it under certain conditions.
>> Type 'license()' or 'licence()' for distribution details.
>>
>> R is a collaborative project with many contributors.
>> Type 'contributors()' for more information and
>> 'citation()' on how to cite R or R packages in publications.
>>
>> Type 'demo()' for some demos, 'help()' for on-line help, or
>> 'help.start()' for a HTML browser interface to help.
>> Type 'q()' to quit R.
>>
>> Reading symbols for shared libraries
>> . done
>> Reading symbols for shared libraries . done
>> Reading symbols for shared libraries . done
>>
>>> dyn.load('StepGram/src/Stepgram.so')
>>>
>> Reading symbols for shared libraries .. done
>>
>>> mySG <- function(dat1,thresh,noise) {
>>>
>>   vec <- c(thresh,noise)
>>   .C('calcStepgram',
>>  data=as.double(dat1),
>>  prm=as.double(vec),
>>  intervals=double(1*3+1),
>>  max=as.integer(1),
>>  n=as.integer(length(dat1)),
>>  plot=double(length(dat1)))}
>>
>>> dat <- c(0.01,0.1,-0.2, 0.1,-0.1,-1000,3.2,3.5,-1.3,3.1,
>>>
>> 3.2,3.1,-0.1,0.2,0.15,-0.05,-0.1,0.2,0.1,-0.1)
>>
>> Program received signal SIGINT, Interrupt.
>> 0x9001f208 in select ()
>> (gdb) break calcStepgram
>> Breakpoint 1 at 0x10bb418: file Stepgram.cpp, line 22.
>> (gdb) c
>> Continuing.
>>
>>> mySG(dat1=dat,thresh=3.2,noise=1.1)
>>>
>>
>> Breakpoint 1, calcStepgram (data=0x1137048, prm=0x1c81eb0,
>> intervals=0xbfffd7e0, max=0x2954840, n=0xbfffd6c0, plot=0x180d574) at
>> Stepgram.cpp:22
>> (gdb) print prm[0]
>> $1 = 1.7716149411915527e-303<<<<<--This should be 3.2!
>> Current language:  auto; currently c++
>>
>> __
>> R-devel@r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>
>
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

Thomas Lumley   Assoc. Professor, Biostatistics
[EMAIL PROTECTED]   University of Washington, Seattle

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Issue with c++ .C call

2006-01-10 Thread Thomas Lumley
On Tue, 10 Jan 2006, Sean Davis wrote:
> and such.  However, the call to the function is via .C; parameters from the
> .C call are not being passed correctly to the function.  As an example, I
> have attached a GDB run of the code.  I set a breakpoint on entry to the
> function I am calling from R.  What is bothering me (and probably causing
> the segmentation faults I am seeing) is that the parameter
> prm=as.double(c(3.2,1.1)) is not 3.2,1.1 IMMEDIATELY after the call to .C.

Is this compiled with optimization? If so, you can't conclude much from 
the gdb info as the code can be executed in a different order from how 
it's written.

When I use this example

extern "C" void calcStepgram(double *data,  double *prm, double *intervals,
int *max, int *n,double *plot) {

  prm[0]=data[0];
  return;
}

if I compile with -g -02 (the default) it looks as though there are 
problems with initialization like the ones you report, but in fact the 
function works correctly.  If I compile without optimization the 
initialization looks fine.

-thomas

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Issue with c++ .C call

2006-01-10 Thread Thomas Lumley
On Tue, 10 Jan 2006, Sean Davis wrote:

>
> Thanks, Thomas.  That did fix the initialization issue (or apparent one).
> Unfortunately, the reason that I started debugging was for segmentation
> faults, which have not gone away.  However, it now looks like the problem is
> internal to the C++ code and not with the way the arguments were being
> passed.

If you can get access to a Linux machine then it's worth trying Valgrind, 
which is very helpful for this sort of thing.

-thomas

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] [OT] GPL3 draft

2006-01-17 Thread Thomas Lumley


Since someone is bound to point this out soon I will note that

a) A discussion draft of the proposed GPL version 3 is up at 
http://gplv3.fsf.org/


b) If you have comments on the draft, send them to the FSF rather than to 
r-devel


  -thomas

Thomas Lumley   Assoc. Professor, Biostatistics
[EMAIL PROTECTED]   University of Washington, Seattle

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] (PR#8500) standard install on OS 10.3.9 crashes on start without useful diagnostics (PR#8500)

2006-01-18 Thread Thomas Lumley

This won't actually help you at all, but I used the standard install on OS 
10.3.9 (Powerbook G4) just last week without any problems.

On the other hand, my install was on a machine that had previously had 
other versions of R.

-thomas

On Wed, 18 Jan 2006, [EMAIL PROTECTED] wrote:

> Full_Name: Bob Donahue
> Version: 2.2
> OS: Mac OS 10.3.9
> Submission from: (NULL) (204.152.13.26)
>
>
> That's pretty much it.  I did the most basic install possible, tried running 
> the
> package through the GUI and on the command line, it crashed hard, immediately,
> with absolutely NO useful information as to WHY it crashed.
>
> To reproduce:
> 1) get the installer for OS X
> 2) install in the default places
> 3) run
> 4) watch it crash with no useful diagnostics
>
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

Thomas Lumley   Assoc. Professor, Biostatistics
[EMAIL PROTECTED]   University of Washington, Seattle

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Minumum memory requirements to run R.

2006-01-23 Thread Thomas Lumley
On Mon, 23 Jan 2006, Hin-Tak Leung wrote:

> Prof Brian Ripley wrote:
>> We know: we even document it in the appropriate places.
>
> I went and have a look - it is the last section of R-admin (and of
> course, for those who "read the source", R/include/Rinternals.h). It
> would be good to mention this in the FAQ (which it doesn't, or maybe I
> didn't look hard enough), or the beginning of R-admin?
>

It's not in the FAQ because it isn't a FAQ (yet).

If you use the PDF manual it is in the table of contents on page i.

In the HTML manual it is admittedly less clear: there isn't a table of 
contents and there is nothing obvious in the index. To some extent this is 
a problem with all the manuals. The structure in the .texi file isn't 
translated well to HTML form by the makeinfo tools.

-thomas

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Minumum memory requirements to run R.

2006-01-23 Thread Thomas Lumley
On Mon, 23 Jan 2006, Hin-Tak Leung wrote:
>
> The 32-bit/64-bit issue affects purchasing or upgrading decisions
> - whether one wants to spend the money on buying cheaper
> 32-bit machines, versus more expensive 64-bit machines. That
> decision would be based on information available while *not* having
> an operational R installation...
>

Not necessarily. It's perfectly feasible to use a 32-bit build on a 64-bit 
machine, as it says in the manual, which is available from 
http://www.r-project.org whether or not you have an R installation.

-thomas

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] R support for 64 bit integers

2010-08-10 Thread Thomas Lumley

On Tue, 10 Aug 2010, Martin Maechler wrote:


{Hijacking the thread from from R-help to R-devel -- as I am
consciously shifting the focus away from the original question
...
}


David Winsemius 
on Tue, 10 Aug 2010 08:42:12 -0400 writes:


   > On Aug 9, 2010, at 2:45 PM, Theo Tannen wrote:

   >> Are integers strictly a signed 32 bit number on R even if
   >> I am running a 64 bit version of R on a x86_64 bit
   >> machine?
   >>
   >> I ask because I have integers stored in a hdf5 file where
   >> some of the data is 64 bit integers. When I read that
   >> into R using the hdf5 library it seems any integer
   >> greater than 2**31 returns NA.

   > That's the limit. It's hard coded and not affected by the
   > memory pointer size.

   >>
   >> Any solutions?

   > I have heard of packages that handle "big numbers". A bit
   > of searching produces suggestions to look at gmp on CRAN
   > and Rmpfr on R-Forge.

Note that Rmpfr has been on CRAN, too, for a while now.
If you only need large integers (and rationals), 'gmp' is enough
though.

*However* note that the gmp or Rmpfr (or any other arbitray
precision) implementation will be considerably slower in usage
than if there was native 64-bit integer support.

Introducing 64-bit integers natively into "base R" is an
"interesting" project, notably if we also allowed using them for
indices, and changed the internal structures to use them instead
of 32-bit.
This would allow to free ourselves from the increasingly
relevant  maximum-atomic-object-length = 2^31 problem.
The latter is something we have planned to address, possibly for
R 3.0.
However, for that, using 64-bit integers is just one
possibility, another being to use "double precision integers".
Personally, I'd prefer the "long long" (64-bit) integers quite
a bit, but there are other considerations, e.g.,
one big challenge will be to go there in a way such that not
all R packages using compiled code will have to be patched
extensively...
another aspect is how the BLAS / Lapack team will address the
problem.


At the moment, all the following are the same type:
 length of an R vector
 R integer type
 C int type
 Fortran INTEGER type

The last two are fixed at 32 bits (in practice for C, by standard for Fortran), 
and we would like the first and perhaps the second to become 64bit.

If both the R length type and the R integer type become the same 64bit type and 
replace the current integer type then every compiled package has to change to 
declare the arguments as int64 (or long, on most 64bit systems) and INTEGER*8. 
That should be all that is needed for most code, since C compilers nowadays 
already complain if you do unclean things like stuffing an int into a pointer.

If the R length type changes to something /different/ from the integer type 
then any compiled code has to be checked to see if  C int arguments are lengths 
or integers, which is more work and more error-prone.

On the other hand, changing the integer type to 64bit will presumably make 
integer code run noticeably more slowly on 32bit systems.

In both cases, the changes could be postponed by having an option to .C/.Call 
forcing lengths and integers to be passed as 32-bit. This would mean that the 
code couldn't use large integers or large vectors, but it would keep working 
indefinitely.

-thomas

Thomas Lumley
Professor of Biostatistics
University of Washington, Seattle

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Non identical numerical results from R code vs C/C++ code?

2010-09-10 Thread Thomas Lumley

On Fri, 10 Sep 2010, Duncan Murdoch wrote:


On 10/09/2010 7:07 AM, Renaud Gaujoux wrote:

Thank you Duncan for your reply.

Currently I am using 'double' for the computations.
What type should I use for extended real in my intermediate computations?


I think it depends on compiler details.  On some compilers "long double" will 
get it, but I don't think there's a standard type that works on all 
compilers.  (In fact, the 80 bit type on Intel chips isn't necessarily 
supported on other hardware.)  R defines LDOUBLE in its header files and it 
is probably best to use that if you want to duplicate R results.


As a little more detail, 'long double' is in the C99 standard and seems to be 
fairly widely implemented, so code using it is likely to compile.   The 
Standard, as usual, doesn't define exactly what type it is, and permits it to 
be a synonym for 'double', so you may not get any extra precision.

On Intel chips it is likely to be the 80-bit type, but the Sparc architecture 
doesn't have any larger hardware type.  Radford Neal has recently reported much 
slower results on Solaris with long double, consistent with Wikipedia's 
statement that long double is sometimes a software-implemented 128-bit type on 
these systems.



The result will still be 'double' anyway right?


Yes, you do need to return type double.

Duncan Murdoch





On 10/09/2010 13:00, Duncan Murdoch wrote:

On 10/09/2010 6:46 AM, Renaud Gaujoux wrote:

Hi,

suppose you have two versions of the same algorithm: one in pure R, the 
other one in C/C++ called via .Call().
Assuming there is no bug in the implementations (i.e. they both do the 
same thing), is there any well known reason why the C/C++ implementation 
could return numerical results non identical to the one obtained from the 
pure R code? (e.g. could it be rounding errors? please explain.)

Has anybody had a similar experience?
R often uses extended reals (80 bit floating point values on Intel chips) 
for intermediate values.  C compilers may or may not do that.
By not identical, I mean very small differences (< 2.4 e-14), but enough 
to have identical() returning FALSE. Maybe I should not bother, but I 
want to be sure where the differences come from, at least by mere 
curiosity.


Briefly the R code perform multiple matrix product; the C code is an 
optimization of those specific products via custom for loops, where 
entries are not computed in the same order, etc... which improves both 
memory usage and speed. The result is theoretically the same.
Changing the order of operations will often affect rounding.  For example, 
suppose epsilon is the smallest number such that 1 + epsilon is not equal 
to 1.  Then 1 + (epsilon/2) + (epsilon/2) will evaluate to either 1 or 1 + 
epsilon, depending on the order of computing the additions.


Duncan Murdoch


Thank you,
Renaud



 
###
UNIVERSITY OF CAPE TOWN 
This e-mail is subject to the UCT ICT policies and e-mail disclaimer 
published on our website at 
http://www.uct.ac.za/about/policies/emaildisclaimer/ or obtainable from +27 
21 650 4500. This e-mail is intended only for the person(s) to whom it is 
addressed. If the e-mail has reached you in error, please notify the 
author. If you are not the intended recipient of the e-mail you may not 
use, disclose, copy, redirect or print the content. If this e-mail is not 
related to the business of UCT it is sent by the sender in the sender's 
individual capacity.


###
 


______
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel



Thomas Lumley
Professor of Biostatistics
University of Washington, Seattle

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Assignment to a slot in an S4 object in a list seems to violate copy rules?

2010-09-30 Thread Thomas Lumley
On Thu, Sep 30, 2010 at 8:15 AM, peter dalgaard  wrote:
>
> On Sep 30, 2010, at 16:19 , Niels Richard Hansen wrote:
>
>> setClass("A", representation(a = "numeric"))
>> B <- list()
>> myA <- new("A", a = 1)
>> B$otherA <- myA
>> b$oth...@a <- 2
>> m...@a
>
> R version 2.12.0 Under development (unstable) (2010-09-13 r52905)
> Platform: i386-apple-darwin9.8.0/i386 (32-bit)
>
> --- not anymore, it seems: ---
>> setClass("A", representation(a = "numeric"))
> [1] "A"
>> B <- list()
>> myA <- new("A", a = 1)
>> B$otherA <- myA
>> b$oth...@a <- 2
>> m...@a
> [1] 1
>> sessionInfo()
> R version 2.12.0 alpha (2010-09-29 r53067)
> Platform: x86_64-apple-darwin10.4.0 (64-bit)
>
> So somewhere in the last 162 commits, this got caught. Probably r52914, but 
> it looks like it hasn't been recored in NEWS (and it should be as this was 
> apparently a live bug, not an obscure corner case):
>
> r52914 | luke | 2010-09-15 19:06:13 +0200 (Wed, 15 Sep 2010) | 4 lines
>
> Modified applydefine to duplicate if necessary to ensure that the
> assignment target in calls to assignment functions via the complex
> assignment mechanism always has NAMED == 1.
>

Yes, that was the one.  It was reported as a bug back then too, and
there was quite a bit of discussion that ended up with Luke's fix.

  -thomas

-- 
Thomas Lumley
Professor of Biostatistics
University of Washington, Seattle

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] DESCRIPTION file and Rd examples

2011-04-17 Thread Thomas Lumley
On Mon, Apr 18, 2011 at 2:00 PM, Dario Strbenac
 wrote:
> It turns out what I needed was IRanges in the Imports: field. I assumed that 
> require(GenomicRanges) at the top of my function would, as a result of 
> loading GenomicRanges, implicitly already know about all the IRanges classes, 
> because GenomicRanges itself depends on them.

If GenomicRanges just imports IRanges (rather than using
library()/require()) then the IRanges functions won't be exported to
your package.  This is a Good Thing -- you only get the function
redefinitions you ask for, rather than everything in the dependency
tree.

-thomas

-- 
Thomas Lumley
Professor of Biostatistics
University of Auckland

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Wishlist: write R's bin path to the PATH variable and remove the version string in the installation dir under Windows

2011-05-03 Thread Thomas Lumley
On Wed, May 4, 2011 at 3:25 PM, Yihui Xie  wrote:
> 1. "Few Windows users use these commands" does not imply they are not
> useful, and I have no idea how many Windows users really use them. How
> do you run "R CMD build" when you build R packages under Windows? You
> don't write "C:/Program Files/R/R-2.13.0/bin/i386/R.exe CMD build", do
> you?
>
> I think the reason we have to mess with the PATH variable for each
> single software package is that Windows is Not Unix, so you may hate
> Windows instead of a package that modifies your PATH variable.
>
> For the choice of i386 and x64, you can let the user decide which bin
> path to use. I believe the number of users who frequently switch back
> and forth is fairly small.
>
> 2. Under most circumstances I just keep the latest version of R. To
> maintain R code with old R versions will be more and more difficult
> with new features and changes coming in. Disk space is cheap, but time
> is not.
>

I keep old versions for basically the same reasons you don't -- that
is, I have analyses that ran under the old versions, and I can be sure
they will give the same answer a year later if I keep the old
versions. This isn't so much because of changes in R as because of
changes in packages.

   -thomas

-- 
Thomas Lumley
Professor of Biostatistics
University of Auckland

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] RProfmem output format

2011-05-14 Thread Thomas Lumley
On Sat, May 14, 2011 at 5:00 AM, Hadley Wickham  wrote:
> Hi all,
>
> When I run the example in RProfmem, I get:
>
>
>     Rprofmem("Rprofmem.out", threshold=1000)
>     example(glm)
>     Rprofmem(NULL)
>     noquote(readLines("Rprofmem.out", n=5))
>
> ...
>
> [1] 1384 :5416 :5416 :1064 :1064 :"readRDS" "index.search" "example"
> [2] 1064 :"readRDS" "index.search" "example"
> [3] 4712 :"readRDS" "index.search" "example"
> [4] 4712 :"readRDS" "index.search" "example"
> [5] 1064 :"readRDS" "index.search" "example"
>
> What do 5 the numbers in the first line mean?
>
> In the subsequence lines I'm assuming the structure is bytes allocated : call.

I think the five numbers come from four memory allocations before
example() is called.  Looking at the code in src/main/memory.c, it
prints newline only when the call stack is not empty.

I don't see why this is done, and I may well be the person who did it
(I don't have svn on this computer to check), but it is clearly
deliberate.

   -thomas

-- 
Thomas Lumley
Professor of Biostatistics
University of Auckland

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] RProfmem output format

2011-05-15 Thread Thomas Lumley
On Mon, May 16, 2011 at 1:02 AM, Hadley Wickham  wrote:
> So what causes allocations when the call stack is empty?  Something
> internal?  Does the garbage collector trigger allocations (i.e. could
> it be caused by moving data to contiguous memory)?

The garbage collector doesn't move anything, it just swaps pointers in
a linked list.

The lexer, parser, and evaluator all have  to do some work before a
function context is set up for the top-level function, so I assume
that's where it is happening.

> Any ideas what the correct thing to do with these memory allocations?
> Ignore them because they're not really related to the function they're
> attributed to?  Sum them up?
>
>> I don't see why this is done, and I may well be the person who did it
>> (I don't have svn on this computer to check), but it is clearly
>> deliberate.
>
> It seems like it would be more consistent to always print a newline,
> and then it would obvious those allocations occurred when the call
> stack was empty.  This would make parsing the file a little bit
> easier.

Yes. It's obviously better to always print a newline, and so clearly
deliberate not to, that I suspect there may have been a good reason.
If I can't work it out (after my grant deadline this week) I will just
assume it's wrong.


   -thomas

-- 
Thomas Lumley
Professor of Biostatistics
University of Auckland

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Possible bug in termplot function (stats package) ?

2011-06-06 Thread Thomas Lumley
On Tue, Jun 7, 2011 at 7:30 AM, Joris Meys  wrote:
> On Mon, Jun 6, 2011 at 9:15 PM, peter dalgaard  wrote:
>>
>> I'd say that the burden of proof is really on your side, but how hard can it 
>> be:
>>
>>> x <- 1:10
>>> y <- rnorm(10)
>>> m <- lm(y~x)
>>> m$call
>> lm(formula = y ~ x)
>>> m$call$data
>> NULL
>>
> I see... indeed, thx for the answer and sorry for my stupidity. Should
> have thought about that case. Still, I wonder why it is necessary to
> go look for the data in a calling environment if it should be
> contained in the model frame of the fitted object. Or am I again wrong
> in assuming that's always the case?

You are again wrong.   Life would be much simpler if the data were
always available like that, but there are at least two problems.

1) There need not be a model frame in the fitted object. (it's optional)

2) More importantly, if you have y~sin(x), the model frame will
contain sin(x), not x. For what termplot() does, it has to be able to
reconstruct 'x', which isn't possible without the original data.

It's quite possible that termplot() could be rewritten to be simpler
and more general, but I don't think minor editing will do it.

-thomas

-- 
Thomas Lumley
Professor of Biostatistics
University of Auckland

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] .C and .Call: convolve example not working

2011-08-11 Thread Thomas Lumley
On Fri, Aug 12, 2011 at 1:06 AM, Lars Wißler  wrote:
> Dear R users,
>
> I want to call C code via the .C or .Call interface. This works fine
> with integer values but using doubles the array received in C will be
> set to zeros.
>
> I have tried the convolve examples (Writing R extensions, chapter 5.2)
> and still the resulting array consists of zeros.
>
> My code (shortened for my purposes. Original did not work either):
> 
> convolve.r
>
> a=array(1, c(4))
> b=array(3, c(4))
>
> print(a)
> print(b)
>
> .C("convolve",
>    as.double(a),
>    as.double(b))
>
> 
> convolve.c
>
> void convolve(double* a, double* b){
>     int i;
>
>     for(i=0;i<4;i++) Rprintf("a: %d", a[i]);
>     for(i=0;i<4;i++) Rprintf("b: %d", b[i]);
> }


The C code here is wrong for two reasons.  Firstly, there's no
declaration for Rprintf(), because you don't include the header file.

Secondly, you're using %d to print, which means you are telling the
compiler you're passing ints to Rprintf(), but you are actually
passing doubles.

When I fix these problems the code works for me.

-thomas

Thomas Lumley
Professor of Biostatistics
University of Auckland

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] readChar and blocking connections

2013-08-14 Thread Thomas Lumley
If readChar() reads from a connection and the specified number of bytes are
not all available, it returns with what it can find. That's as documented
(further down the help page) and it's what the low-level recv() does.

However, for a blocking connection, wouldn't it be more natural to retry
until the requested data are available?

   -thomas

-- 
Thomas Lumley
Professor of Biostatistics
University of Auckland

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] internal copying in R (soon to be released R-3.1.0

2014-03-18 Thread Thomas Lumley
On Mon, Mar 3, 2014 at 2:50 PM, Simon Urbanek
wrote:

> Jens,
>
> On Mar 3, 2014, at 3:35 PM, Jens Oehlschlägel <
> jens.oehlschlae...@truecluster.com> wrote:
>
> >
> > Well, I did read, for example "Writing R Extensions" (Version 3.1.0
> Under development (2014-02-28)) chapter "5.9.10 Named objects and copying"
> which says "Currently all arguments to a .Call call will have NAMED set to
> 2, and so users must assume that they need to be duplicated before
> alteration."
>
> Matthew pointed out that line and I cannot shed more light on it, since
> it's not true - at least not currently.
>
>
> > This is consistent with the observation of my test code: that NAMED() in
> .Call always returns 2.
>
> It is not - you're not testing .Call() - your'e testing the assignments in
> frames which cause additional bumps of NAMED. If you actually test .Call()
> you'll see what I have reported - .Call() itself does NOT affect NAMED.
>
>
I think that's what R-ext is saying. The context is assignment functions,
and the section is explaining that while you might think you can write
assignment functions with .Call() that modify their arguments, by the time
the object gets into .Call() it will have NAMED==2 because there's the
object you're trying to modify and the reference to it inside the
assignment function.

If you write assignment functions that way, you need to make potentially
unsafe assumptions. For example, that you can't get an error halfway up a
chain of nested complex assignments when it's too late to back out of the
expression.

   -thomas



-- 
Thomas Lumley
Professor of Biostatistics
University of Auckland

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Why did R 3.0's resolveNativeRoutine remove full-search ability?

2014-04-23 Thread Thomas Lumley
On Sat, Apr 19, 2014 at 2:29 AM, Simon Urbanek
 wrote:
> Andrew,
>
> On Apr 18, 2014, at 9:55 AM, Andrew Piskorski  wrote:
>
>> In versions of R prior to 3.0, by default .C and .Call would find the
>> requested C function regardless of which shared library it was located
>> in.  You could use the PACKAGE argument to restrict the search to a
>> specific library, but doing so was not necessary for it to work.
>>
>> R 3.0 introduced a significant change to that behavior; from the NEWS
>> file:
>>
>>  CHANGES IN R 3.0.0:
>>  PERFORMANCE IMPROVEMENTS:
>>* A foreign function call (.C() etc) in a package without a PACKAGE
>>  argument will only look in the first DLL specified in the
>>  NAMESPACE file of the package rather than searching all loaded
>>  DLLs.  A few packages needed PACKAGE arguments added.
>>
>> That is not merely a performance improvement, it is a significant
>> change in functionality.  Now, when R code in my package foo tries to
>> call C code located in bar.so, it fails with a "not resolved from
>> current namespace (foo)" error.  It works if I change all my uses of
>> .C and .Call to pass a PACKAGE="bar" argument.  Ok, I can make that
>> change in my code, no big deal.
>>
>> What surprises me though, is that there appears to be no way to invoke
>> the old (and very conventional Unix-style), "I don't want to specify
>> where the function is located, just keep searching until you find it"
>> behavior.  Is there really no way to do that, and if so, why not?
>>
>> Comparing the R sources on the 3.1 vs. 2.15 branches, it looks as if
>> this is due to some simple changes to resolveNativeRoutine in
>> "src/main/dotcode.c".  Specifically, the newer code adds this:
>>
>>   errorcall(call, "\"%s\" not resolved from current namespace (%s)",
>> buf, ns);
>>
>> And removes these lines:
>>
>>   /* need to continue if the namespace search failed */
>>   *fun = R_FindSymbol(buf, dll.DLLname, symbol);
>>   if (*fun) return args;
>>
>> Is that extra call to R_FindSymbol really all that's necessary to
>> invoke the old "keep searching" behavior?  Would it be a good idea to
>> provide an optional way of finding a native routine regardless of
>> where it's located, perhaps via an optional PACKAGE=NA argument to .C,
>> .Call, etc.?
>>
>> And now I see that help(".Call") says:
>>
>>   'PACKAGE = ""' used to be accepted (but was undocumented): it is
>>now an error.
>>
>> I assume passing PACKAGE="" used to invoke the same "keep searching"
>> behavior as not passing any PACKAGE argument at all.  So apparently
>> the removal of functionality was intentional.  I'd like to better
>> understand why.  Why should that be an error?  Or said another way,
>> why has traditional Unix-style symbol resolution been banned from use
>> with .C and .Call ?
>>
>
> I cannot speak for the author, but a very strong argument is to prevent 
> (symbol) namespace issues. If you cannot even say where the symbol comes 
> from, you have absolutely no way of knowing that the symbol you get has 
> anything to do with the symbol you intended to get, because you could get any 
> random symbol in any shared object that may or may not have anything to do 
> with your code. Note that even you as the author of the code have no control 
> over the namespace so although you intended this to work, loading some other 
> package can break your code - and in a fatal manner since this will typically 
> lead to a segfault. Do you have any strong use case for allowing this given 
> how dangerous it is? Ever since symbol registration has been made easy, it's 
> much more efficient and safe to use symbols directly instead.
>


As a follow-up to this, note that with traditional Unix symbol
resolution it was forbidden to have two different routines with the
same name linked into an object. That just isn't an option for R
because of the package system.  This isn't theoretical: the PACKAGE=
argument was introduced when finding the wrong symbol resolution
became a real problem late last century
(http://marc.info/?l=r-devel&m=107151103308418&w=2), but there wasn't
a good cross-package calling mechanism for quite a while. Now there is
a cross-package mechanism that works, where the Unix-style approach
cannot be made to work safely with packages from multiple authors.

   -thomas


-- 
Thomas Lumley
Professor of Biostatistics
University of Auckland

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] capture.output on S4 slot

2014-08-19 Thread Thomas Lumley
On Fri, Jul 18, 2014 at 4:00 PM, Dario Strbenac
 wrote:
> Hello,
>
> capture.output produces a different result if the S4 object was created with 
> a constructor than if the body of the constructor is copied and pasted.
>
> setClass("TransformParams", representation(
>   transform = "function",
>   otherParams = "list")
> )
>
> setGeneric("TransformParams", function(transform, ...)
> {standardGeneric("TransformParams")})
> setMethod("TransformParams", character(0), function()
> {
>   new("TransformParams", transform = function(){}, otherParams = list())
> })
>
>> capture.output(TransformParams()@transform)
> [1] "function () " "{"
> [3] "}"""
>> capture.output(new("TransformParams", transform = function(){}, otherParams 
>> = list())@transform)
> [1] "function(){}"
>
> Why is the function split into three parts if a constructor is used ?
>
> --
> Dario Strbenac
> PhD Student
> University of Sydney
> Camperdown NSW 2050
> Australia


Dario,

When you use the constructor, the environment of the function is the
environment inside the constructor; when you use new() it is
R_GlobalEnv

The way functions print is that they print their environment when it
isn't something special like R_GlobalEnv. Since capture.output
captures the printed output, that's what it sees.

The function isn't split up; the printed output is.

   -thomas

-- 
Thomas Lumley
Professor of Biostatistics
University of Auckland

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Speed difference between df$a[1] and df[1,"a"]

2011-10-20 Thread Thomas Lumley
On Wed, Oct 19, 2011 at 2:34 PM, Stavros Macrakis  wrote:
> I was surprised to find that df$a[1] is an order of magnitude faster than
> df[1,"a"]:

Yes.  This treats a data frame as a list, and is much faster.

> I thought this might be because df[,] builds a data frame before simplifying
> it to a vector, but with drop=F, it is even slower, so that doesn't seem to
> be the problem:

drop=FALSE creates a data frame first, and then simplifies it to a
vector, so this test isn't showing what you think it is.

> I then wondered if it might be because '[' allows multiple columns and
> handles rownames. Sure enough, '[[,]]', which allows only one column, and
> does not handle rownames, is almost 3x faster:

That's part of it, but if you look at [.data.frame you see there is
also quite a bit of copying that could be avoided in simple cases but
is hard to avoid in full generality.

-thomas

-- 
Thomas Lumley
Professor of Biostatistics
University of Auckland

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] C function is wrong under Windows 7

2011-10-24 Thread Thomas Lumley
On Mon, Oct 24, 2011 at 6:04 AM, Evarist Planet
 wrote:
> Dear mailing list,
>
> I have a C function that gives me a wrong result when I run it under Windows
> 7.

The fact that you extract pointers to the contents of fchr and sign
before coercing them to REALSXP is deeply suspicious to me, though I
admit I don't see how it causes the problem, since you don't use the
new versions, you don't overwrite them, and the old versions should
still be protected as arguments to the function.

> SEXP getEs(SEXP fchr, SEXP sign)
> {
>  int i, nfchr, nsign;
>  double *rfchr = REAL(fchr), *res;
>  int *rsign = INTEGER(sign);
>
>  PROTECT(fchr = coerceVector(fchr, REALSXP));
>  PROTECT(sign = coerceVector(sign, REALSXP));
>
>  nfchr = length(fchr);
>  nsign = length(sign);
>
>  SEXP es;
>  PROTECT(es = allocVector(REALSXP, nfchr));
>  res = REAL(es);
>
>  double nr = getNr(rfchr, rsign, nsign);
>
>  SEXP phit;
>  PROTECT(phit = allocVector(REALSXP, nfchr));
>  double *rphit;
>  rphit = REAL(phit);
>  for(i = 0; i < nfchr; i++) rphit[i] = 0.0;
>  getPhit(rfchr, rsign, nsign, nr, rphit);
>  cumsum(rphit, nfchr);
>
>  SEXP pmiss;
>  PROTECT(pmiss = allocVector(REALSXP, nfchr));
>  double *rpmiss;
>  rpmiss = REAL(pmiss);
>  for(i = 0; i < nfchr; i++) rpmiss[i] = 0.0;
>  getPmiss(rsign, nfchr, nsign, rpmiss);
>  cumsum(rpmiss, nfchr);
>
>  for (i = 0; i < nfchr; ++i) {
>   res[i] = rphit[i] - rpmiss[i];
>   }
>
>  UNPROTECT(5);
>  return es;
> }
>
> Could you please help me to find out what I am doing wrong?
>
> Many thanks in advance,
>
> --
> Evarist Planet
> Research officer, Bioinformatics and Biostatistics unit
> IRB Barcelona
> Tel (+34) 93 402 0553
> Fax (+34) 93 402 0257
>
> evarist.pla...@irbbarcelona.org
> http://www.irbbarcelona.org/bioinformatics
>
>        [[alternative HTML version deleted]]
>
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>



-- 
Thomas Lumley
Professor of Biostatistics
University of Auckland

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] droplevels: drops contrasts as well

2011-10-24 Thread Thomas Lumley
On Fri, Oct 21, 2011 at 5:57 AM, Thaler, Thorn, LAUSANNE, Applied
Mathematics  wrote:
> Dear all,
>
> Today I figured out that there is a neat function called droplevels,
> which, well, drops unused levels in a data frame. I tried the function
> with some of my data sets and it turned out that not only the unused
> levels were dropped but also the contrasts I set via "C". I had a look
> into the code, and this behaviour arises from the fact that droplevels
> uses simply factor to drop the unused levels, which uses the default
> contrasts as set by options("contrasts").
>
> I think this behaviour is annoying, because if one does not look
> carefully enough, one looses the contrasts silently. Hence may I suggest
> to change the code of droplevels to something like the following:

This silently changes the contrasts -- eg, if the first level of the
factor is one of the empty levels, the reference level used by
contr.treatment() will change.  Also, if the contrasts are a matrix
rather than specifying a contrast function, the matrix will be invalid
for the the new factor.

I think just having a warning would be better -- in general it's not
clear what (if anything) it means to have the same contrasts on
factors with different numbers of levels.

   -thomas

-- 
Thomas Lumley
Professor of Biostatistics
University of Auckland

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Moderating consequences of garbage collection when in C

2011-11-14 Thread Thomas Lumley
On Tue, Nov 15, 2011 at 8:47 AM,   wrote:
> dhi...@sonic.net wrote:
>> Martin Morgan  wrote:
>> > Allocating many small objects triggers numerous garbage collections as R
>> > grows its memory, seriously degrading performance. The specific use case
>> > is in creating a STRSXP of several 1,000,000's of elements of 60-100
>> > characters each; a simplified illustration understating the effects
>> > (because there is initially little to garbage collect, in contrast to an
>> > R session with several packages loaded) is below.
>
>> What a coincidence -- I was just going to post a question about why it
>> is so slow to create a STRSXP of ~10,000,000 unique elements, each ~10
>> characters long.  I had noticed that this seemed to show much worse
>> than linear scaling.  I had not thought of garbage collection as the
>> culprit -- but indeed it is.  By manipulating the GC trigger, I can
>> make this operation take as little as 3 seconds (with no GC) or as
>> long as 76 seconds (with 31 garbage collections).
>
> I had done some google searches on this issue, since it seemed like it
> should not be too uncommon, but the only other hit I could come up
> with was a thread from 2006:
>
> https://stat.ethz.ch/pipermail/r-devel/2006-November/043446.html
>
> In any case, one issue with your suggested workaround is that it
> requires knowing how much additional storage is needed, which may be
> an expensive operation to determine.  I've just tried implementing a
> different approach, which is to define two new functions to either
> disable or enable GC.  The function to disable GC first invokes
> R_gc_full() to shrink the heap as much as possible, then sets a flag.
> Then in R_gc_internal(), I first check that flag, and if it is set, I
> call AdjustHeapSize(size_needed) and exit immediately.
>
> These calls could be used to bracket any code section that expects to
> make lots of calls to R's memory allocator.  The down side is that
> this approach requires that all paths out of such a code section
> (including error handling) need to take care to unset the GC-disabled
> flag.  I think I would want to hear from someone on the R team about
> whether they think this is a good idea.

If .Call and .C re-enabled the GC on return from compiled code (and
threw some sort of error) that would help contain the potential
damage.

You'd might also  want to re-enable GC if malloc() returned NULL,
rather than giving an out-of-memory error.

  -thomas

-- 
Thomas Lumley
Professor of Biostatistics
University of Auckland

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] rowsum

2011-12-18 Thread Thomas Lumley
On Wed, Mar 30, 2011 at 1:56 AM, Terry Therneau  wrote:
>    > with the entirely different rowSums, but it has been around
>    > for a long time.)
>    A lot longer than rowSums ...
>    > Bill Dunlap
>    > Spotfire, TIBCO Software
>    ---
>      This made me smile.  The rowsums function was originally an internal
>    part of the survival package, used for fast computation of certain sums
>    when there is a cluster() statement.  It was Statistical Sciences
>    (S-Plus) who moved it to global status.  That is, they used it in enough
>    other places that they decided to speed it up, took over the
>    maintainance and ownership of the function (with my blessing), and
>    ceased to label it as part of "survival" in the manual.
>      This "metabug" can't be laid at R's feet.

The same process happened independently in R.  I ported the 'survival'
version of rowsum() to R, added it to base R in version 0.63.1, and
later it made it faster using hashing.

So perhaps it isn't entirely StatSci's fault, although it's likely
that R would eventually have added a rowsum() function for
compatibility.

  -thomas

-- 
Thomas Lumley
Professor of Biostatistics
University of Auckland

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] names<- appears to copy 3 times?

2012-01-17 Thread Thomas Lumley
On Tue, Jan 17, 2012 at 9:11 PM, Matthew Dowle  wrote:
> Hi,
>
> $ R --vanilla
> R version 2.14.1 (2011-12-22)
> Platform: i686-pc-linux-gnu (32-bit)
>> DF = data.frame(a=1:3,b=4:6)
>> DF
>  a b
> 1 1 4
> 2 2 5
> 3 3 6
>> tracemem(DF)
> [1] "<0x8898098>"
>> names(DF)[2]="B"
> tracemem[0x8898098 -> 0x8763e18]:
> tracemem[0x8763e18 -> 0x8766be8]:
> tracemem[0x8766be8 -> 0x8766b68]:
>> DF
>  a B
> 1 1 4
> 2 2 5
> 3 3 6
>>
>
> Are those 3 copies really taking place?
>

tracemem() isn't likely to give false positives.  Since you're on
Linux, you could check by running under gdb and setting a breakpoint
on memtrace_report, which is the function that prints the message.
That would show where the duplicates are happening.

  - thomas

-- 
Thomas Lumley
Professor of Biostatistics
University of Auckland

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] read.spss issues

2012-02-15 Thread Thomas Lumley
On Wed, Feb 15, 2012 at 7:05 PM, Jeroen Ooms  wrote:

> The second problem is that the spss dataformat allows to specify
> 'duplicate labels', whereas this is not allowed for factors. read.spss
> does not deal with this and creates a bad factor
>
> x <- read.spss("http://www.stat.ucla.edu/~jeroen/spss/duplicate_labels.sav";,
> use.value.labels=T);
> levels(x$opinion);
>
> which causes issues downstream. I am not sure if this is an issue in
> read.spss() or as.factor(), but I guess it might be wise to try to
> detect duplicate levels and assign them all with one and the same
> integer value when converting to a factor.

I think this one would be better dealt with by giving an error.

SPSS value labels are just labels, so they don't map very well onto R
factors, which are enumerated types.  Rather than force them and lose
data, I would prefer to make the user decide what to do.

-thomas

-- 
Thomas Lumley
Professor of Biostatistics
University of Auckland

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] portable parallel seeds project: request for critiques

2012-02-19 Thread Thomas Lumley
On Sat, Feb 18, 2012 at 4:33 PM, Paul Johnson  wrote:
> On Fri, Feb 17, 2012 at 5:06 PM, Petr Savicky  wrote:
>> On Fri, Feb 17, 2012 at 02:57:26PM -0600, Paul Johnson wrote:
>> Hi.
>>
>> Some of the random number generators allow as a seed a vector,
>> not only a single number. This can simplify generating the seeds.
>> There can be one seed for each of the 1000 runs and then,
>> the rows of the seed matrix can be
>>
>>  c(seed1, 1), c(seed1, 2), ...
>>  c(seed2, 1), c(seed2, 2), ...
>>  c(seed3, 1), c(seed3, 2), ...
>>  ...
>>
> Yes, I understand.
>
> The seed things I'm using are the 6 integer values from the L'Ecuyer.
> If you run the example script, the verbose option causes some to print
> out.  The first 3 seeds in a saved project seeds file looks like:
>
>> projSeeds[[1]]
> [[1]]
> [1]         407   376488316  1939487821  1433925148 -1040698333   579503880
> [7] -624878918
>
> [[2]]
> [1]         407 -1107332181   854177397  1773099324  1774170776  -266687360
> [7] 816955059
>
> [[3]]
> [1]         407   936506900 -1924631332 -1380363206  2109234517  1585239833
> [7] -1559304513
>
> The 407 in the first position is an integer R uses to note the type of
> stream for which the seed is intended, in this case R'Lecuyer.
>
>
>
>
>
>> There could be even only one seed and the matrix can be generated as
>>
>>  c(seed, 1, 1), c(seed, 1, 2), ...
>>  c(seed, 2, 1), c(seed, 2, 2), ...
>>  c(seed, 3, 1), c(seed, 3, 2), ...
>>
>> If the initialization using the vector c(seed, i, j) is done
>> with a good quality hash function, the runs will be independent.
>>
> I don't have any formal proof that a "good quality hash function"
> would truly create seeds from which independent streams will be drawn.

That is essentially the *definition* of a good quality hash function,
at least in the cryptographic sense.  It maps the inputs into numbers
that are indistinguishable from uniform random except that the same
input always gives the same output.

What's harder is to prove that you *have* a good quality hash
function, but for these (non-adversarial) purposes even something like
MD4 would be fine, and certainly the SHA family.

-thomas

-- 
Thomas Lumley
Professor of Biostatistics
University of Auckland

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] CRAN policies

2012-03-28 Thread Thomas Lumley
On Thu, Mar 29, 2012 at 3:30 AM, Gabor Grothendieck
 wrote:
> 2012/3/28 Uwe Ligges :
>>
>>
>> On 27.03.2012 20:33, Jeffrey Ryan wrote:
>>>
>>> Thanks Uwe for the clarification on what goes and what stays.
>>>
>>> Still fuzzy on the notion of "significant" though.  Do you have an example
>>> or two for the list?
>>
>>
>>
>> We have to look at those notes again and again in order to find if something
>> important is noted, hence please always try to avoid all notes unless the
>> effect is really intended!
>>
>>
>> Consider the Note "No visible binding for global variable"
>> We cannot know if your code intends to use such a global variable (which is
>> undesirable in most cases), hence would let is pass if it seems to be
>> sensible.
>>
>> Another Note such as "empty section" or "partial argument match" can quickly
>> be fixed, hence just do it and don't waste our time.
>>
>> Best,
>> Uwe Ligges
>
> What is the point of notes vs warnings if you have to get rid of both
> of them?  Furthermore, if there are notes that you don't have to get
> rid of its not fair that package developers should have to waste their
> time on things that are actually acceptable.  Finally, it makes the
> whole system arbitrary since packages can be rejected based on
> undefined rules.
>

The "notes" are precisely the things for which clear rules can't be
written.  They are reported by CMD check because they are usually
signs of coding errors, but are not warnings because their use is
sometimes justified.

The 'No visible binding for global variable" is a good example.  This
found some bugs in my 'survey' package, which I removed. There is
still one note of this type, which arises when I have to handle two
different versions of the hexbin package with different internal
structures.  The note is a false positive because the use is guarded
by an if(), but  CMD check can't tell this.   So, it's a good idea to
remove all Notes that can be removed without introducing other code
problems, which is nearly all of them, but occasionally there may be a
good reason for code that produces a Note.

But if you want a simple, unambiguous, mechanical rule for *your*
packages, just eliminate all Notes.

   -thomas

-- 
Thomas Lumley
Professor of Biostatistics
University of Auckland

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] The constant part of the log-likelihood in StructTS

2012-05-03 Thread Thomas Lumley
On Thu, May 3, 2012 at 3:36 AM, Mark Leeds  wrote:
> Hi Ravi: As far as I know ( well , really read ) and Bert et al can say
> more , the AIC is not dependent on the models being nested as long as the
> sample sizes used are the same when comparing. In some cases, say comparing
> MA(2), AR(1), you have to be careful with sample size usage but there is no
> nesting requirement for AIC atleast, I'm pretty sure.

This is only partly true.  The expected value of the AIC will behave
correctly even if models are non-nested, but there is no general
guarantee that the standard deviation is small,  so AIC need not even
asymptotically lead to optimal model choice for prediction in
arbitrary non-nested models.

Having said that, 'nearly' nested models like these are probably ok.
I believe it's sufficient that all your models are nested in a common
model, with a bound on the degree of freedom difference, but my copy
of Claeskens & Hjort's book on model selection and model averaging is
currently with a student so I can't be definitive.


   -thomas

-- 
Thomas Lumley
Professor of Biostatistics
University of Auckland

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] confusion over S3/S4 importing

2012-08-01 Thread Thomas Lumley
On Mon, Jul 30, 2012 at 7:12 AM, Ben Bolker  wrote:
>
>   Can anyone help me figure out the right way to import a method that is
> defined as S3 in one package and S4 in another?
>
>   Specifically:
>
>  profile() is defined as an S3 method in the stats package:
>
> function (fitted, ...)
> UseMethod("profile")
> 
> 
>
>  In stats4 it is defined as an S4 method:
>
>  stats4:::profile
> standardGeneric for "profile" defined from package "stats"
>
> function (fitted, ...)
> standardGeneric("profile")
> 
>
>In the bbmle package I want to define it as an S4 method:
>
>   setMethod("profile", "mle2", function (...) {...})
>
>   In the NAMESPACE file for bbmle I have
>
>  importFrom(stats4,profile)
>
>   However, in R 2.14.2 (which is I am currently trying to fix: I would
> like to try to allow for backward compatibility), I get
>
> library(bbmle)
> example(mle2) ## to create some fitted objects
> profile(fit)
> Error in UseMethod("profile") :
>   no applicable method for 'profile' applied to an object of class "mle2"
>
>  whereas stats4::profile(fit) works fine.
>
>   -- a similar issue occurs for coef.
>
>   I would have thought that importFrom() would allow package code to
> find the 'stats4' version of profile (and coef) before it found the
> 'stats' versions , but that doesn't seem to be the case.
>
>   Just for kicks I tried
>
> import(stats4)
>
>   which leads to warning messages but no improvements.
>
> If I add an explicit Depends: on stats4 that fixes the problem. I may be
> forced to do that, but I thought it was supposed to be a last resort and
> that I was *supposed* to be able to fix my problems by proper use of
> imports.

'Imports' won't be enough -- the whole point of a generic is that it's
visible to the user, which doesn't happen with imports.

   -thomas

-- 
Thomas Lumley
Professor of Biostatistics
University of Auckland

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] error on uneven recycling?

2012-09-24 Thread Thomas Lumley
Is there some reason why

> (1:2)+(1:3)
[1] 2 4 4
Warning message:
In (1:2) + (1:3) :
  longer object length is not a multiple of shorter object length

can't be made into an error?  I realise it was there in S-PLUS, but
since it produces a warning there can't be many examples on CRAN or
Bioconductor using it, and I can't think of any situation where it
would be used deliberately.

-thomas

-- 
Thomas Lumley
Professor of Biostatistics
University of Auckland

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] How to build a list with missing values? What is missing, anyway?

2012-10-07 Thread Thomas Lumley
On Fri, Oct 5, 2012 at 4:07 PM, Bert Gunter  wrote:
> On Thu, Oct 4, 2012 at 6:31 PM, Peter Meilstrup

>> explain some things, such as under what circumstances one would get a
>> "`...` used in incorrect context" error.
>
> How could it possibly know that?
>
> -- Bert

By looking at the code that throws the error.

The code is in src/main/eval.c, and the error is thrown in several
places when the symbol ... appears in a call being evaluated, and the
value of the symbol is neither the special DOTSXP type nor NULL.  That
is, if you use ... in code and there isn't a ... formal argument in
scope.

   -thomas


-- 
Thomas Lumley
Professor of Biostatistics
University of Auckland

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] setting option in function

2012-10-19 Thread Thomas Lumley
old_options <- options(na.action=na.fail)
on.exit(options(old_options))

You can also use this to define a wrapper that executes an expression
using special options

withOptions<-function(optlist,expr){
oldopt<-options(optlist)
on.exit(options(oldopt))
expr<-substitute(expr)
eval.parent(expr)
}


-thomas



On Sat, Oct 20, 2012 at 10:35 AM, Charles Geyer  wrote:
> is it possible to set an option inside a function ((I want to set
> na.action = na.fail) and have the previous state restored if there is
> an error so that the function doesn't change the option behind the
> user's back?
>
> Sorry if this has been answered before, but this subject is hard to Google.
>
> --
> Charles Geyer
> Professor, School of Statistics
> University of Minnesota
> char...@stat.umn.edu
>
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel



-- 
Thomas Lumley
Professor of Biostatistics
University of Auckland

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Crash - cause 'memory not mapped'

2012-11-11 Thread Thomas Lumley
On Sat, Nov 10, 2012 at 12:33 PM, Peter Langfelder <
peter.langfel...@gmail.com> wrote:

> On Fri, Nov 9, 2012 at 3:05 PM, gianluca mastrantonio
>  wrote:
> > the error "memory non mapped" happen also if i use
> > void FiltroGaus(SEXP size1_r, SEXP size2_r, SEXP sigma_r)
> > instead
> > SEXP FiltroGaus(SEXP size1_r, SEXP size2_r, SEXP sigma_r) {
> >
>
> If you use a .Call on a function, the function must return a SEXP, it
> cannot be have a void return value.
>
>
And if one uses a reasonable warning level on the compiler, there should be
a warning, such as gcc's "control reaches end of a non-void function"

Also .C() doesn't pass SEXPs, so the declaration of the function is still
wrong for .C(), though this one the compiler won't notice.

--thomas


-- 
Thomas Lumley
Professor of Biostatistics
University of Auckland

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] What is the INDEX file for?

2012-11-11 Thread Thomas Lumley
On Fri, Nov 9, 2012 at 10:02 PM, Christophe Genolini
wrote:

> Hi the list,
>
> In WRE (or in Rdindex), we can find how the INDEX is make, how to change
> it, but I do not manage to find the purpose of this file. So what is the
> INDEX file for?
>
>
It's an index, or more precisely a table of contents for the package.

The INDEX is part of what is displayed by help(package="pkgname").  It was
more important before package help pages and vignettes, especially for
non-GUI systems when it could be quite hard to work out which function help
pages to look at in an unfamiliar package.

-thomas

-- 
Thomas Lumley
Professor of Biostatistics
University of Auckland

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] as.matrix.Surv -- R core question/opinions

2013-01-01 Thread Thomas Lumley
I agree that it's cleaner to remove the extra attributes -- that's the
point of as.matrix.Surv(), to produce a matrix that doesn't include any
extra details of how Surv objects are implemented.

   -thomas


On Fri, Dec 7, 2012 at 3:48 AM, Terry Therneau  wrote:

> 1. A Surv object is a matrix with some extra attributes.  The
> as.matrix.Surv function removes the extras but otherwise leaves it as is.
>
> 2. The last several versions of the survival library were accidentally
> missing the S3method('as.matrix', 'Surv') line from their NAMESPACE file.
>  (Instead it's position is held by a duplicate of the line just above it in
> the NAMESPACE file, suggesting a copy/paste error).  As a consequence the
> as.matrix.Surv function was effectively ignored, and the default method was
> being used.
>The as.matrix.default function leaves anything with a "dim" attribute
> alone.
>
> 3. In my current about-to-submit-to-CRAN  version of survival the missing
> NAMESPACE line was restored.  This breaks one function in one package (rms)
> which calls "as.matrix(y)" on a Surv object but then later looks at the
> "type" attribute of y.
>
>  So now to the design question: should the as.matrix.Surv function
> "sanitize" the result by removing the extra attributes, or should it leave
> them alone?  The first seems cleaner; my accidental multi-year test of
> leaving them in, however, clearly shows that it does no harm.
>
> Terry T.
>
> __**____
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/**listinfo/r-devel<https://stat.ethz.ch/mailman/listinfo/r-devel>
>



-- 
Thomas Lumley
Professor of Biostatistics
University of Auckland

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Understanding svd usage and its necessity in generalized inverse calculation

2013-01-01 Thread Thomas Lumley
On Thu, Dec 6, 2012 at 12:58 PM, Paul Johnson  wrote:

> Dear R-devel:
>
> I could use some advice about matrix calculations and steps that might
> make for faster computation of generalized inverses. It appears in
> some projects there is a bottleneck at the use of svd in calculation
> of generalized inverses.
>
> Here's some Rprof output I need to understand.
>
> >   summaryRprof("Amelia.out")
> $by.self
>  self.time self.pct total.time total.pct
> "La.svd"150.3427.66 164.82 30.32
>
>
[snip]


> I *Think* this means that a bottlleneck here is svd, which is being
>

A bottleneck to a limited extent -- it's still only 30% of computation
time, so speeding it up can't save more than that


> called by this function that calculates generalized inverses:
>
> ## Moore-Penrose Inverse function (aka Generalized Inverse)
> ##   X:symmetric matrix
> ##   tol:  convergence requirement
> mpinv <- function(X, tol = sqrt(.Machine$double.eps)) {
>   s <- svd(X)
>   e <- s$d
>   e[e > tol] <- 1/e[e > tol]
>   s$v %*% diag(e,nrow=length(e)) %*% t(s$u)
> }
>
> That is from the Amerlia package, which we like to use very much.
>
> Basically, I wonder if I should use a customized generalized inverse
> or svd calculator to make this faster.
>
>

If your matrix is produced as a covariance matrix, so that it really is
symmetric positive semidefinite up to rounding error, my understanding is
that the Cholesky approach is stable in the sense that it will reliably
produce an accurate inverse or a zero-to-within-tolerance pivot and error
message.

Now, if most of the matrices you are trying to invert are actually
invertible (as I would hope), it may be quicker to use the Cholesky
approach will a fallback to the SVD for semidefinite matrices. That is,
something like

tryCatch(
chol2inv(chol(xx)),
error=function(e) ginv(xx)
  )

Most of the  time you will get the Cholesky branch, which is much faster
(about five-fold for 10x10 matrices on my system).  On my system and using
a 10x10 matrix the overhead in the tryCatch() is much smaller than the time
taken by either set of linear algebra, so there is a net gain as long as
even a reasonably minority of the matrices are actually invertible.

You can probably do slightly better by replacing the chol2inv() with
backsolve(): solving just the systems of linear equations you need is
usually preferable to constructing a matrix inverse.


Note that this approach will give wrong answers without warning if the
matrices are not symmetric.

   -thomas

-- 
Thomas Lumley
Professor of Biostatistics
University of Auckland

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] R-forge, package dependencies

2013-01-15 Thread Thomas Lumley
I have a project on R-forge (sqlsurvey.r-forge.r-project.org) with two
packages, RMonetDB and sqlsurvey.

At the moment, sqlsurvey is listed as failing to build.  The error is on
the Linux package check, which says that RMonetDB is not available:


* checking package dependencies ... ERROR
Package required but not available: ‘RMonetDB’

RMonetDB has built successfully: r-forge lists its status as 'current',
with Linux, Windows, and Mac packages available for download.  The package
check for sqlsurvey on Windows and Mac finds RMonetDB without any problems,
it's just on Linux that it appears to be unavailable.

Any suggestions for how to fix this? I've tried uploading a new version of
RMonetDB, but the situation didn't change: it built successfully, but the
Linux check of sqlsurvey still couldn't find it.

    -thomas


-- 
Thomas Lumley
Professor of Biostatistics
University of Auckland

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] stopping finalizers

2013-02-12 Thread Thomas Lumley
Is there some way to prevent finalizers running during a section of code?

I have a package that includes R objects linked to database tables.  To
maintain the call-by-value semantics, tables are copied rather than
modified, and the extra tables are removed by finalizers during garbage
collection.

However, if the garbage collection occurs in the middle of processing
another SQL query (which is relatively likely, since that's where the
memory allocations are) there are problems with the database interface.

Since the guarantees for the finalizer are "at most once, not before the
object is out of scope" it seems harmless to be able to prevent finalizers
from running during a particular code block, but I can't see any way to do
it.

Suggestions?

-thomas


-- 
Thomas Lumley
Professor of Biostatistics
University of Auckland

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] stopping finalizers

2013-02-14 Thread Thomas Lumley
Luke,

We're actually adopting the first of your generic approaches.

As a more concrete description:

There are R objects representing survey data sets, with the data stored in
a database table.  The subset() method, when applied to these objects,
creates a new table indicating which rows of the data table are in the
subset -- we don't modify the original table, because that breaks the
call-by-value semantics. When the subset object in R goes out of scope, we
need to delete the extra database table.

 I have been doing this with a finalizer on an environment that's part of
the subset object in R.   This all worked fine with JDBC, but the native
database interface requires that all communications with the database come
in send/receive pairs. Since R is single-threaded, this would normally not
be any issue. However, since garbage collection can happen at any time, it
is possible that the send part of the finalizer query "drop table
_sbs_whatever" comes between the send and receive of some other query, and
the database connection then falls over.   So, I'm happy for the finalizer
to run at any time except during a small critical section of R code.

In this particular case the finalizer only issues "drop table" queries, and
it doesn't need to know if they succeed, so we can keep a lock in the
database connection and just store any "drop table" queries that arrive
during a database operation for later execution.   More generally, though,
the fact that no R operation is atomic with respect to garbage collection
seems to make it a bit difficult to use finalizers -- if you need a
finalizer, it will often be in order to access and free some external
resource, which is when the race conditions can matter.

What I was envisaging was something like

without_gc(expr)

to evaluate expr with the memory manager set to allocate memory (or attempt
to do so) without garbage collection.  Even better would be if gc could
run, but weak references were temporarily treated as strong so that garbage
without finalizers would be collected but finalizers didn't get triggered.
 Using this facility would be inefficient, because it would allocate more
memory than necessary and would also mess with the tuning of the garbage
collector,  but when communicating with other programs it seems it would be
very useful to have some way of running an R code block and knowing that no
other R code block would run during it (user interrupts are another issue,
but they can be caught, and in any case I'm happy to fail when the user
presses CTRL-C).

 -thomas




On Fri, Feb 15, 2013 at 12:53 AM,  wrote:

> It might help if you could be more specific about what the issue is --
> if they are out of scope why does it matter whether the finalizers
> run?
>
> Generically two approaches I can think of:
>
> you keep track of whenit is safe to fully run your finalizers and have
> your finalizers put the objects on a linked list if it isn't safe to
> run the finalizer now and clear the list each time you make a new one
>
> keep track of your objects with a weak list andturn them into strong
> references before your calls, then drop the list after.
>
> I'm pretty sure we don't have a mechanism for temporarily suspending
> running the finalizers but it is probably fairly easy to add if that
> is the only option.
>
> I might be able to think of other options with more details on the
> issue.
>
> Best,
>
> luke
>
>
> On Tue, 12 Feb 2013, Thomas Lumley wrote:
>
>  Is there some way to prevent finalizers running during a section of code?
>>
>> I have a package that includes R objects linked to database tables.  To
>> maintain the call-by-value semantics, tables are copied rather than
>> modified, and the extra tables are removed by finalizers during garbage
>> collection.
>>
>> However, if the garbage collection occurs in the middle of processing
>> another SQL query (which is relatively likely, since that's where the
>> memory allocations are) there are problems with the database interface.
>>
>> Since the guarantees for the finalizer are "at most once, not before the
>> object is out of scope" it seems harmless to be able to prevent finalizers
>> from running during a particular code block, but I can't see any way to do
>> it.
>>
>> Suggestions?
>>
>>-thomas
>>
>>
>>
>>
> --
> Luke Tierney
> Chair, Statistics and Actuarial Science
> Ralph E. Wareham Professor of Mathematical Sciences
> University of Iowa  Phone: 319-335-3386
> Department of Statistics andFax:   319-335-3017
>Actuarial Science
> 241 Schaeffer Hall  email:   luke-tier...@uiowa.edu
> Iowa City, IA 52242 WWW:  http://www.stat.uiowa.edu
>



-- 
Thomas Lumley
Professor of Biostatistics
University of Auckland

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] stopping finalizers

2013-02-15 Thread Thomas Lumley
On Fri, Feb 15, 2013 at 3:35 PM, Hadley Wickham  wrote:

> > There are R objects representing survey data sets, with the data stored
> in
> > a database table.  The subset() method, when applied to these objects,
> > creates a new table indicating which rows of the data table are in the
> > subset -- we don't modify the original table, because that breaks the
> > call-by-value semantics. When the subset object in R goes out of scope,
> we
> > need to delete the extra database table.
>
> Isn't subset slightly too early to do this?  It would be slightly more
> efficient for subset to return an object that creates the table when
> you first attempt to modify it.
>

The subset table isn't a copy of the subset, it contains the unique key and
an indicator column showing whether the element is in the subset.  I need
this even if the subset is never modified, so that I can join it to the
main table and use it in SQL 'where' conditions to get computations for the
right subset of the data.

 The whole point of this new sqlsurvey package is that most of the
aggregation operations happen in the database rather than in R, which is
faster for very large data tables.  The use case is things like the
American Community Survey and the Nationwide Emergency Department
Subsample, with millions or tens of millions of records and quite a lot of
variables.  At this scale, loading stuff into memory isn't feasible on
commodity desktops and laptops, and even on computers with enough memory,
the database (MonetDB) is faster.

   -thomas

-- 
Thomas Lumley
Professor of Biostatistics
University of Auckland

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] stopping finalizers

2013-02-15 Thread Thomas Lumley
On Sat, Feb 16, 2013 at 2:32 PM, Hadley Wickham  wrote:

> > The subset table isn't a copy of the subset, it contains the unique key
> and
> > an indicator column showing whether the element is in the subset.  I need
> > this even if the subset is never modified, so that I can join it to the
> main
> > table and use it in SQL 'where' conditions to get computations for the
> right
> > subset of the data.
>
> Cool - Is that faster than storing a column that just contains the
> include indices?
>

I haven't tried -- I'm writing this so it doesn't modify database tables,
partly for safety and partly to reduce the privileges it needs to run.


> >  The whole point of this new sqlsurvey package is that most of the
> > aggregation operations happen in the database rather than in R, which is
> > faster for very large data tables.  The use case is things like the
> American
> > Community Survey and the Nationwide Emergency Department Subsample, with
> > millions or tens of millions of records and quite a lot of variables.  At
> > this scale, loading stuff into memory isn't feasible on commodity
> desktops
> > and laptops, and even on computers with enough memory, the database
> > (MonetDB) is faster.
>
> Have you done any comparisons of monetdb vs sqlite - I'm interested to
> know how much faster it is. I'm working on a package
> (https://github.com/hadley/dplyr) that compiles R data manipulation
> expressions into (e.g. SQL), and have been wondering if it's worth
> considering a column-store like monetdb.


It's enormously faster than SQLite for databases slightly larger than
physical memory.  I don't have measurements, but I started this project
using SQLite and it just wasn't fast enough to be worthwhile.  My guess is
that for the JOIN and SELECT SUM() ... GROUP BY operations I'm using it's
perhaps ten times faster.

For moderate-sized databases it's competitive with analysis in memory even
if you ignore the data loading time.

For example, using a data set already in memory, with 18000 records and 96
variables:

> system.time(svymean(~BPXSAR+BPXDAR,subset(dhanes,RIAGENDR==2)))

   user  system elapsed

   0.090.010.10

Using MonetDB

> system.time(svymean(~bpxsar+bpxdar,se=TRUE,subset(sqhanes,riagendr==2)))

   user  system elapsed

  0.020   0.001   0.108

   -thomas

-- 
Thomas Lumley
Professor of Biostatistics
University of Auckland

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] S3 generics in packages: default method

2013-02-28 Thread Thomas Lumley
For a package providing 'virtual' data.frames backed by MonetDB database
tables, we want to make some functions generic, to implement versions that
work on single or multiple database columns.

My approach was

sd <- function(x, na.rm=TRUE,...) UseMethod("sd")

sd.default<- base::sd

I've done that before to make dotchart() generic in the 'survey' package.

With dotchart() (and even sd()) there still doesn't seem to be a problem
(based on the CRAN check results), but with var() we get a note

Found .Internal call in the following function:
  ‘var.default’
with calls to .Internal functions
  ‘cov’

because base::var contains calls to .Internal.  These seems harmless to me
-- the package is only calling .Internal() through base functions in usual
way.

We could use
   var.default<-function(x, y=NULL, na.rm=TRUE, ...)
base::var(x,y,na.rm,...)

but it seems that could have different lazy-evaluation behaviour.

-thomas

-- 
Thomas Lumley
Professor of Biostatistics
University of Auckland

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] nobs() with glm(family="poisson")

2013-02-28 Thread Thomas Lumley
On Thu, Feb 28, 2013 at 11:56 AM, Ravi Varadhan wrote:

> This is getting further away from typical R-devel issues, but let me add
> another perspective:  the `n' in BIC reflects the rate at which the
> information in the log-likelihood grows.
>

But in the derivation of  BIC, the log(n) term is kept and various terms of
order 1 are discarded.  What we're arguing about is one of the O(1) terms.
 If it makes an important difference then presumably we should also worry
about the other O(1) terms that got discarded.

   -thomas

-- 
Thomas Lumley
Professor of Biostatistics
University of Auckland

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Registering method for "t.test"

2013-03-14 Thread Thomas Lumley
On Thu, Mar 14, 2013 at 3:31 AM, Lukas Lehnert
wrote:

> Dear R-developers,
>
> in a new package, I'm currently working on, I tried to add a new method to
> function "t.test". The new class, on which the method is based, is called
> "speclib", so I defined the funcion for the new method as "t.test.speclib".
>


You shouldn't have a problem if you tell R it is a method for t.test.  In
the NAMESPACE, use

S3method(t.test, speclib)

rather than exporting the function by name, and in the usage section of the
help page use

\method{t.test}{speclib}



> R CMD check now gives always a warning, that S3 methods are not consistent,
> because R does not recognize that the basic function name is "t.test"
> instead
> of "t":
>
> t:
>   function(x)
> t.test.speclib:
>   function(x, y, paired, ...)
>
>
> Is there any workaround or do I have to rename the t.test.speclib function
> to
> something like t_test.speclib?
>
> Thank you in advance
>
> Lukas
> [[alternative HTML version deleted]]
>
> ______
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>



-- 
Thomas Lumley
Professor of Biostatistics
University of Auckland

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Why change data type when dropping to one-dimension?

2009-05-29 Thread Thomas Lumley

On Fri, 29 May 2009, Jason Vertrees wrote:


My question is: why does the paradigm of changing the type of a 1D
return value to an unlisted array exist?  This introduces boundary
conditions where none need exist, thus making the coding harder and
confusing.

For example, consider:
 > d = data.frame(a=rnorm(10), b=rnorm(10));
 > typeof(d);# OK;
 > typeof(d[,1]);# Unexpected;
 > typeof(d[,1,drop=F]); # Oh, now I see.


It does make it harder for programmers, but it makes it easier for 
non-programmers.  In particular, it is convenient to be able to do d[1,1] to 
extract a number from a matrix, rather than having to explicitly coerce the 
result to stop it being a matrix.

At least the last two times this was discussed, there ended up being a 
reasonable level of agreement that if someone's life had to be made harder the 
programmers were better able to cope and that dropping dimensions was 
preferable.

    -thomas

Thomas Lumley   Assoc. Professor, Biostatistics
tlum...@u.washington.eduUniversity of Washington, Seattle

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Why change data type when dropping to one-dimension?

2009-05-29 Thread Thomas Lumley

On Fri, 29 May 2009, Stavros Macrakis wrote:


This is another example of the general preference of the designers of R for
convenience over consistency.

In my opinion, this is a design flaw even for non-programmers, because I
find that inconsistencies make the system harder to learn.  Yes, the naive
user may stumble over the difference between m[[1,1]] and m[1,1] a few times
before getting it, but once he or she understands the principle, it is
general.


I was on your side of this argument the first time it came up, but ended up 
being convinced the other way.

In contrast to sample(n) or the non-standard evaluation of weights= and subset= 
arguments to modelling functions, or various other conveniences that I think we 
are stuck with despite them being a bad idea, I think dropping dimensions is 
useful.

 -thomas





-s

On Fri, May 29, 2009 at 5:33 PM, Jason Vertrees  wrote:


Hello,

First, let me say I'm an avid fan of R--it's incredibly powerful and I
use it all the time.  I appreciate all the hard work that the many
developers have undergone.

My question is: why does the paradigm of changing the type of a 1D
return value to an unlisted array exist?  This introduces boundary
conditions where none need exist, thus making the coding harder and
confusing.

For example, consider:
> d = data.frame(a=rnorm(10), b=rnorm(10));
> typeof(d);  # OK;
> typeof(d[,1]);  # Unexpected;
> typeof(d[,1,drop=F]);   # Oh, now I see.

This is indeed documented in the R Language specification, but why is it
there in the first place?  It doesn't make sense to the average
programmer to change the return type based on dimension.

Here it is again in 'sapply':
> sapply
> function (X, FUN, ..., simplify = TRUE, USE.NAMES = TRUE)
> {
> [...snip...]
>if (common.len == 1)
>unlist(answer, recursive = FALSE)
>else if (common.len > 1)
>array(unlist(answer, recursive = FALSE),
> dim = c(common.len,
>length(X)), dimnames = if (!(is.null(n1 <-
> names(answer[[1]])) &
>is.null(n2 <- names(answer
>list(n1, n2))
> [...snip...]
>  }

So, in 'sapply', if your return value is one-dimensional be careful,
because the return type will not the be same as if it were otherwise.

Is this legacy or a valid, rational design decision which I'm not yet a
sophisticated enough R coder to enjoy?

Thanks,

-- Jason

--

Jason Vertrees, PhD

Dartmouth College : j...@cs.dartmouth.edu
Boston University : jas...@bu.edu

PyMOLWiki : http://www.pymolwiki.org/

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel



[[alternative HTML version deleted]]

______
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel



Thomas Lumley   Assoc. Professor, Biostatistics
tlum...@u.washington.eduUniversity of Washington, Seattle

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Wishlist: install.packages to look for the newest version (PR#13852)

2009-07-24 Thread Thomas Lumley


Uwe,

I think Ulrike is making a different suggestion, that install.packages() should 
fetch the binary that has been built for the current version of R.

This would be a bad idea for a different reason -- in general it is not 
possible to be sure that the package works with an older version of R. The R 
version dependency isn't enough for two reasons. The first is that the author 
may well not know that the package fails with an older version of R and so 
would not have listed a dependency. The second is that the binary versions may 
be incompatible even if the source versions are compatible.

You can always download a binary package from CRAN in a browser and use the 
option to install from a local zip file. Or, as Uwe suggests, get a new version 
of R.

What I think might be useful if it's not too difficult is a warning from 
install.packages() that a newer version of the package you were installing is 
available for the current version of R.

 -thomas


On Fri, 24 Jul 2009 lig...@statistik.tu-dortmund.de wrote:


Ulrike,

if you install from source, you always get the most recent version of
the package given it does not depend on a newer version of R.

If you want a binary package, you also get the newest version - that was
newest at the time we stopped building binaries for that version of R.
We (or better I if we only talk about Windows, but similar for all other
platforms) cannot build for each R version any more. In that case we'd
have to build even 11 binary versions for Windows just for the R-2.x.y
series now. Binary repositories are fixed at some time (for Windows once
the first patchlevel release of the next R version is out, e.g. at the
time of the R-2.9.1 release the binary builds for R-2.8.x had been stopped).

So please upgrade your version of R or compile yourself from sources for
the R version you need the particular package for.

Best wishes,
Uwe Ligges





groemp...@bht-berlin.de wrote:

Full_Name: Ulrike Groemping
Version: 2.9.0 (and older)
OS: Windows
Submission from: (NULL) (84.190.173.190)


When using an older version of R, packages are not found although they are
available for newer versions of R and do work when installed with the old
version. For example, installing DoE.base on R 2.8.1 installs version 0.2, while
CRAN is at version 0.4-1 currently. It would be nice if the install process
would per default look for the newest version of the package and install this
one if its R-version request allows this. (I recently found a help list entry by
Uwe Ligges that explains how to manually install from a repository for a newer
CRAN version, but I did not bookmark it and cannot find it any more. The
documentation does not enlighten me there.)

Regards, Ulrike

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel



Thomas Lumley   Assoc. Professor, Biostatistics
tlum...@u.washington.eduUniversity of Washington, Seattle

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Basic Question regarding PROTECT

2009-08-24 Thread Thomas Lumley
On Mon, 24 Aug 2009, Saptarshi Guha wrote:

> Thank you. So the reason I wouldnt need to protect y had I returned to
> R, is because had i had done something like
> 
> h<-.Call("boo",a)
> where "boo" contains y=foo()
> 
> the assignment "<-" to h would have a PROTECT somewhere, i.e R's
> assignment is doing the protection for me.
> Had I not returned to R, I would have to do it myself.
> 
> Yes PROTECT is quite cheap, I had thought it to be costly but 1MM
> PROTECT/UNPROTECT calls, takes <1 second  (on a macbook 2.16ghz)

That's not where the cost would be.   PROTECT/UNPROTECT calls themselves are 
very cheap, since they just push and pop pointers from a stack. Any real impact 
of different strategies in using PROTECT would be seen in the garbage 
collector.  

 -thomas

Thomas Lumley   Assoc. Professor, Biostatistics
tlum...@u.washington.eduUniversity of Washington, Seattle

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] optional package dependency

2010-01-15 Thread Thomas Lumley

On Fri, 15 Jan 2010, Seth Falcon wrote:



There is a real need (of some kind) here.  Not all packages work on all
platforms.  For example, the multicore package provides a mechanism for
running parallel computations on a multi-cpu box, but it is not
available on Windows.  A package that _is_ available on all platforms
should be able to optionally make use of multicore on non-Windows.  I
don't think there is a way to do that now and pass check without
resorting to "tricks" as above.  These tricks are bad as they make it
harder to programmatically determine the true "suggests".

And NAMESPACE brings up another issue in that being able to do
conditional imports would be very useful for these cases, otherwise you
simply can't make proper use of name spaces for any optional functionality.

I'm willing to help work on and test a solution if we can arrive at some
consensus as to what the solution looks like.



Seth,

In the case of multicore it seems to work to put it in 'Suggests' and to use 
require() to load it. That's what I did with the survey package, and it didn't 
cause problems on CRAN.  I didn't run CMD check on Windows myself, only on Mac 
and Linux.

A more difficult issue is providing methods for a generic in another package 
that might not be available.  I wanted to provide methods on survey objects for 
generics in odfWeave, and I couldn't find out how to do that without making it 
required.   I ended up creating a new odfWeave.survey package that depends on 
odfWeave and survey, but this seems like the sort of thing that should be able 
to be done with Enhances or Suggests.

-thomas


Thomas Lumley   Assoc. Professor, Biostatistics
tlum...@u.washington.eduUniversity of Washington, Seattle

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Copyright versus Licenses

2010-01-19 Thread Thomas Lumley

On Mon, 18 Jan 2010, Bryan McLellan wrote:


My company recently started using a R library from RCRAN that is
licensed under the LGPL Version 2 or greater per the DESCRIPTION file,
but contains no copy of the LGPL notice, or any copyright notice. I've
grown accustomed to paying attention to copyright and licensing as a
Debian package maintainer, and sent the author of the package an email
expressing my concern. The author believed that assigning themselves
copyright was incompatible with licensing the library under the terms
of the LGPL. I disagree, and further contend that without copyright
notice, the [copyright] license loses a certain degree of
applicability, as it becomes inconclusive as to who is licensing the
software under the terms of the LGPL. Not knowing who I was, the
library author asked me to start a discussion of the subject on this
list, presumably so they could see the opinions of others that they
trust.


Like Simon, I'm speaking only for myself here.

Not having an explicit copyright statement certainly makes it harder for people 
downstream who have questions, so explicit statements seem like a good thing.


The LGPL itself [1], in the final section entitled "How to Apply These
Terms to Your New Libraries", the license provides a template for
adding to the top of each source code file that contains a copyright
line, a general notice regarding the lack of warranty, and information
on where to obtain a full copy of the license. The GPL HOWTO [2]
expresses similar instructions for the inclusion of a copyright line
with the license. I know that R distributes copies of common licenses
under 'share/licenses' in the R source. Debian does as well in
'/usr/share/common-licenses/' for the purpose of not having to include
the full LICENSE and/or COPYING file with every package that uses a
common open source license, allowing the use of verbage such as "The
Debian packaging is © 2010 [author] and is licensed under the Apache
License version 2.0. On debian and derived systems, see
`/usr/share/common-licenses/Apache-2.0' for the complete text." The R
manual for writing extensions suggests a similar approach to avoiding
duplication in Section 1.1 [3].

The R manual for writing extensions also mentions [4] in Section 1.1.1
the optional use of a Copyright field in the DESCRIPTION file,
separate from the License field. As this section implies that the
DESCRIPTION file format is based on the debian control file format, I
assume the goal is to keep these lines simple, generally under 80
characters do to average terminal width. As such, I don't assume this
field is recommended for complete copyright information for a library
with multiple contributors. The aforementioned article does not
specify where a developer should alternately put copyright
information, perhaps assuming one would add it to each source code
file as is recommended by GNU.


Yes, in the source code is good.

In addition, if you look at the source code for R, the approach there has been 
to have a file COPYRIGHTS that lists the original copyright holders of bits of 
the code.  This seems like a good idea if there are multiple sources.  It's 
especially useful if there are areas of doubt and uncertainty (eg code 
published on ftp sites before people worried about licenses), since it at least 
gives you rigidly defined areas of doubt and uncertainty[5]



In closing, do the R developers believe that including a Copyright
notice is imperative with a Copyright License?

If so, what advice do they have for those writing and contributing
open source R libraries as to where this notice should go?

Should that information perhaps be added to the R manual for extensions?

Bryan McLellan

[1] http://www.gnu.org/licenses/lgpl-2.1.txt
[2] http://www.gnu.org/licenses/gpl-howto.html
[3] http://cran.r-project.org/doc/manuals/R-exts.html#Package-structure
[4] http://cran.r-project.org/doc/manuals/R-exts.html#The-DESCRIPTION-file

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel



-thomas

Thomas Lumley   Assoc. Professor, Biostatistics
tlum...@u.washington.eduUniversity of Washington, Seattle



[5] Adams D, (1978) Hitchhiker's Guide. BBC Radio.
__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] AD: advanced R course - June 28-30, Seattle.

2010-01-26 Thread Thomas Lumley


The Summer Institute in Statistical Genetics at the University of Washington 
will include a 2.5 day course in Advanced R Programming, on June 28-30. 

The course will cover object-oriented programming, SQL and netCDF data access, 
calling compiled code, creating R packages, and using some of the software 
infrastructure from the Bioconductor project.

The instructors will be Thomas Lumley and Ken Rice. 

Further details on this and the other 21 modules at the Summer Institute are 
available from http://sisg.biostat.washington.edu/.  I will just note that this 
is the University of Washington, in Seattle (not DC, or St Louis) and that  
June 28-30 is at the beginning of Seattle's non-rainy season

 -thomas

Thomas Lumley   Assoc. Professor, Biostatistics
tlum...@u.washington.eduUniversity of Washington, Seattle

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Package Directory Hierarchy: Recursive inclusion of *.R possible?

2010-02-04 Thread Thomas Lumley

On Thu, 4 Feb 2010, Barry Rowlingson wrote:


On Wed, Feb 3, 2010 at 10:40 AM, Johannes Graumann
 wrote:

Hello,

I would like to organize the "R" directory in my home-grown package into
sub-directories, but "R CMD --build" doesn't seem to find *.R files below
the actual source directory. Is there any way around that?




Not really.  You could always manage your files in a separate directory tree 
and then use a Makefile to put them into the package format.

    -thomas

Thomas Lumley   Assoc. Professor, Biostatistics
tlum...@u.washington.eduUniversity of Washington, Seattle

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Why is there no c.factor?

2010-02-04 Thread Thomas Lumley

On Thu, 4 Feb 2010, Hadley Wickham wrote:


Hi all,

Is there are reason that there is no c.factor method?  Analogous to
c.Date, I'd expect something like the following to be useful:

c.factor <- function(...) {
 factors <- list(...)
 levels <- unique(unlist(lapply(factors, levels)))
 char <- unlist(lapply(factors, as.character))

 factor(char, levels = levels)
}

c(factor("a"), factor("b"), factor(c("c", "b","a")), factor("d"))
# [1] a b c b a d
# Levels: a b c d



It's well established that different people have different views on what 
factors should do, but this doesn't match mine.   I think of factors as 
enumerated data types where the factor levels already specify all the valid 
values for the factor, so I wouldn't want to be able to combine two factors 
with different sets of levels.

For example:
  A <- factor("orange",levels=c("orange","yellow","red","purple"))
  B <- factor("orange", levels=c("orange","apple","mango", "banananana"))

On the other hand, I think the current behaviour, which reduces them to 
numbers, is just wrong.


  -thomas

Thomas Lumley   Assoc. Professor, Biostatistics
tlum...@u.washington.eduUniversity of Washington, Seattle

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Is it valid to do x == Inf?

2010-04-01 Thread Thomas Lumley

On Thu, 1 Apr 2010, Henrik Bengtsson wrote:


Hi,

I found in a bit of code the following test for infinity:

 if (x == Inf) ...

Is that valid


Yes, if you don't want to also include -Inf


, or should it be (as I always thought):

 if (is.infinite(x)) ...?


If you don't want to distinguish Inf and -Inf


Does it depend on whether 'x' is float or integer?


There isn't an integer infinity. Integer values larger than the maximum 
reprensentable give NA
eg > .Machine$integer.max+1L
[1] NA


My question is related to testing for missing values where is.na(x) is required.


NA is different, because NA by its nature can't compare equal to anything: x==NA asks: "Is x 
equal to some number I don't know?", to which the answer is "Don't know".

x==Inf asks "Is x positive infinite?", which is a perfectly well-defined 
question.

-thomas

Thomas Lumley   Assoc. Professor, Biostatistics
tlum...@u.washington.eduUniversity of Washington, Seattle

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


  1   2   3   >