[Rd] Rmpi on Fedora 8

2008-08-01 Thread Prof Brian Ripley
A yum update to lam 7.1.4 (from 7.1.2) broke Rmpi for me this last week 
and quite a few changes were needed to repair this, so I'm reporting here 
in case it helps others.  This was an x86_64 system - adjust 'lib64' 
suitably for 32-bit systems.


There seems to me to be major organizational changes for a 'patchlevel' 
update to a setup that previously worked out of the box.  They almost 
certainly apply to Fedora 9 too.


- yum left some lam 7.1.2 RPMs behind, and I have been unable to remove
  them via yum.  This causes some confusion.

- The lam libs are in /usr/lib64/lam/lib, and ldconfig needs to be told
  about this, so

  cat > /etc/ld.so.conf.d/lam.ld.conf
  /usr/lib64/lam/lib
  ^D
  /sbin/ldconfig

  (AFAIR, the previous version was in /usr/lib64/lam, and installed a
  ld.so.conf.d file.  Make sure /usr/lib64/lam is not in the ldconfig
  path.)

- At this point Rmpi may load and then immediately terminate R as the lam
  helpfile is not found (which is not nice of the lam libs).  You may need to
  export LAMHOME=/usr/lib64/lam .  Even if the helpfile is found, it still
  terminates R if lamd is not running.  (As I recall previous RPM
  installations had run lamboot at system boot.)

- The final step is to start a lam configuration.  I was only able to do
  this by setting -prefix, e.g.

   /usr/lib64/lam/bin/lamboot -prefix /usr/lib64/lam

--
Brian D. Ripley,  [EMAIL PROTECTED]
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel:  +44 1865 272861 (self)
1 South Parks Road, +44 1865 272866 (PA)
Oxford OX1 3TG, UKFax:  +44 1865 272595

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] importing explicitly declared missing values in read.spss (foreign)

2008-08-01 Thread Jeroen Ooms

There is a problem when importing an spss-file containing explicitly declared
missing values in R using the read.spss function from the foreign package.
I'm not sure these problems are the same in every version of spss, I am
using the latest version 16.0.2.

I included  http://www.nabble.com/file/p18776776/missingdata.sav
missingdata.sav  and  http://www.nabble.com/file/p18776776/frequencies.jpg
frequencies.jpg  as an example. The data contains 3 types of missing data: 2
are explicitly declared as a missing-value ('8' = NA and '9' = NAP), the
third type are the system missings. When this file is imported in R, only
the system missings are recognized as missing values, the others are just
imported as levels in the nominal case, and as (labeled) real values 8 and 9
in the continuous case. There are also no attributes in the object returned
by read.spss that contain information about which values/levels are the
missing values; their missingness seems to be completely ignored by the
function.

Is there some way or other function to be able to import spss files, with an
option that replaces all missing values with 's in R? Of course this
comes with the trade-off of losing the meaning of the missingness when there
are multiple types of missingness, but I think this is far less harmfull
than treating all missing values as normal values. 

[code]
> mydata <- read.spss("c:/users/jeroen/desktop/missingdata.sav",
> to.data.frame=T)
Warning messages:
1: In read.spss("c:/users/jeroen/desktop/missingdata.sav", to.data.frame =
T) :
  c:/users/jeroen/desktop/missingdata.sav: File-indicated character
representation code (1252) looks like a Windows codepage
2: In read.spss("c:/users/jeroen/desktop/missingdata.sav", to.data.frame =
T) :
  c:/users/jeroen/desktop/missingdata.sav: Unrecognized record type 7,
subtype 16 encountered in system file
3: In read.spss("c:/users/jeroen/desktop/missingdata.sav", to.data.frame =
T) :
  c:/users/jeroen/desktop/missingdata.sav: Unrecognized record type 7,
subtype 20 encountered in system file

> mydata
   SUBJECT CATEGORI CONTINUO
11  yes 3.11
22  yes 2.10
33  yes 5.34
44  yes 1.54
55  yes 3.89
66   no 2.98
77   no 4.53
88   no 1.98
99   no 3.68
10  10   no 2.94
11  11   NA 8.00
12  12   NA 8.00
13  13   NA 8.00
14  14   NA 8.00
15  15   NA 8.00
16  16  NAP 9.00
17  17  NAP 9.00
18  18  NAP 9.00
19  19  NAP 9.00
20  20  NAP 9.00
21  21NA
22  22NA
23  23NA
24  24NA
25  25NA

> is.na(mydata$CONTINUO)
 [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE 
TRUE

> is.na(mydata$CATEGORI)
 [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE 
TRUE

> summary(mydata)
SUBJECT   CATEGORICONTINUO
 Min.   : 1   yes :5   Min.   :1.540  
 1st Qu.: 7   no  :5   1st Qu.:3.078  
 Median :13   NA  :5   Median :6.670  
 Mean   :13   NAP :5   Mean   :5.854  
 3rd Qu.:19   NA's:5   3rd Qu.:8.250  
 Max.   :25Max.   :9.000  
   NA's   :5.000  
[/code]


-- 
View this message in context: 
http://www.nabble.com/importing-explicitly-declared-missing-values-in-read.spss-%28foreign%29-tp18776776p18776776.html
Sent from the R devel mailing list archive at Nabble.com.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] 4-int indexing limit of R {Re: [R] allocMatrix limits}

2008-08-01 Thread Martin Maechler
[[Topic diverted from R-help]]

> "VK" == Vadim Kutsyy <[EMAIL PROTECTED]>
> on Fri, 01 Aug 2008 07:35:01 -0700 writes:

VK> Martin Maechler wrote:
>> 
  VK> The problem is in array.c, where allocMatrix check for
  VK> "if ((double)nrow * (double)ncol > INT_MAX)".  But why
  VK> itn is used and not long int for indexing? (max int is
  VK> 2147483647, max long int is 9223372036854775807)

>> Well, Brian gave you all info:
>>  (  ?Memory-limits )

VK> exactly, and given that most modern system used for
VK> computations (i.e.  64bit system) have long int which is
VK> much larger than int, I am wondering why long int is not
VK> used for indexing (I don't think that 4 bit vs 8 bit
VK> storage is an issue).

Well, fortunately, reasonable compilers have indeed kept 
'long' == 'long int'  to mean 32-bit integers
((less reasonable compiler writers have not, AFAIK: which leads
  of course to code that no longer compiles correctly when
  originally it did))
But of course you are right that  64-bit integers
(typically == 'long long', and really == 'int64') are very
natural on 64-bit architectures.
But see below.


>> Did you really carefully read ?Memory-limits ??

VK> Yes, it is specify that 4 bit int is used for indexing
VK> in all version of R, but why? I think 2147483647
VK> elements for a single vector is OK, but not as total
VK> number of elements for the matrix.  I am running out of
VK> indexing at mere 10% memory consumption.

If you have too large a numeric matrix, it would be larger than
2^31 * 8 bytes ~=  2^34 / 2^20 ~= 16'000 Megabytes.
If that is is 10% only for you,  you'd have around 160 GB of
RAM.  That's quite a impressive.
I agree that it is at least in the "ball park" of what is
available today.


[]

VK> PS: I have no problem to go and modify C code, but I am
VK> just wondering what are the reasons for having such
VK> limitation.

Compatibility for one:

Note that R objects are (pointers to) C structs that are
"well-defined" platform independently, and I'd say that this
should remain so.
Consequently 64ints (or another "longer int"), would have to be
there "in R", also on 32bit platforms. That may well be feasible,
but it would double the size of quite a few objects.

I think what you are implicitly proposing is that
we'd want 64-bit integer as an R-level type, and that are R
would use (and/or coerce to it from 'int32') for indexing
everywhere.

But more importantly, all (or very much of) the currently
existing C- and Fortran-code (called via .Call(), .C(), .Fortran)
would also have to be able to deal with the "longer ints".

One of the last times this topic came up (within R-core),
we found that for all the matrix/vector operations,
we really would need versions of  BLAS / LAPACK that would also
work with these "big" matrices, ie. such a BLAS/Lapack would
also have to internally use "longer int" for indexing.
At that point in time, we had decied we would at least wait to
hear about the development of such BLAS/LAPACK libraries.

Interested to hear other opinions / get more info on this topic.
I do agree that it would be nice to get over this limit within a
few years.

Martin

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] 4-int indexing limit of R {Re: [R] allocMatrix limits}

2008-08-01 Thread Vadim Kutsyy

Martin Maechler wrote:

[[Topic diverted from R-help]]

Well, fortunately, reasonable compilers have indeed kept 
'long' == 'long int'  to mean 32-bit integers

((less reasonable compiler writers have not, AFAIK: which leads
  of course to code that no longer compiles correctly when
  originally it did))
But of course you are right that  64-bit integers
(typically == 'long long', and really == 'int64') are very
natural on 64-bit architectures.
But see below.
  

well in 64bit Ubunty, /usr/include/limits.h defines:

/* Minimum and maximum values a `signed long int' can hold.  */
#  if __WORDSIZE == 64
#   define LONG_MAX 9223372036854775807L
#  else
#   define LONG_MAX 2147483647L
#  endif
#  define LONG_MIN  (-LONG_MAX - 1L)

and using simple code to test 
(http://home.att.net/~jackklein/c/inttypes.html#int) my desktop, which 
is standard Intel computer, does show.


Signed long min: -9223372036854775808 max: 9223372036854775807


If you have too large a numeric matrix, it would be larger than
2^31 * 8 bytes ~=  2^34 / 2^20 ~= 16'000 Megabytes.
If that is is 10% only for you,  you'd have around 160 GB of
RAM.  That's quite a impressive.
  

>  cat /proc/meminfo | grep MemTotal
MemTotal: 145169248 kB

We have "smaller" SGI NUMAflex to play with, where the memory can 
increased to 512Gb ("larger" version doesn't have this "limitation").  
But with even commodity hardware you can easily get 128Gb for reasonable 
price (i.e. Dell PowerEdge R900)



Note that R objects are (pointers to) C structs that are
"well-defined" platform independently, and I'd say that this
should remain so.

  
I forgot that R stores two dimensional array in a single dimensional  C 
array. Now I understand why there is a limitation on total number of 
elements.  But this is a big limitations.

One of the last times this topic came up (within R-core),
we found that for all the matrix/vector operations,
we really would need versions of  BLAS / LAPACK that would also
work with these "big" matrices, ie. such a BLAS/Lapack would
also have to internally use "longer int" for indexing.
At that point in time, we had decied we would at least wait to
hear about the development of such BLAS/LAPACK libraries
BLAS supports two dimensional metrics definition, so if we would store 
matrix as two dimensional object, we would be fine.  But than all R code 
as well as all packages would have to be modified.


__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] 4-int indexing limit of R {Re: [R] allocMatrix limits}

2008-08-01 Thread Martin Maechler
> "VK" == Vadim Kutsyy <[EMAIL PROTECTED]>
> on Fri, 01 Aug 2008 10:22:43 -0700 writes:

VK> Martin Maechler wrote:
>> [[Topic diverted from R-help]]
>> 
>> Well, fortunately, reasonable compilers have indeed kept
>> 'long' == 'long int' to mean 32-bit integers ((less
>> reasonable compiler writers have not, AFAIK: which leads
>> of course to code that no longer compiles correctly when
>> originally it did)) But of course you are right that
>> 64-bit integers (typically == 'long long', and really ==
>> 'int64') are very natural on 64-bit architectures.  But
>> see below.

... I wrote complete rubbish, 
and I am embarrassed ...

>> 
VK> well in 64bit Ubunty, /usr/include/limits.h defines:

VK> /* Minimum and maximum values a `signed long int' can hold.  */
VK> #  if __WORDSIZE == 64
VK> #   define LONG_MAX 9223372036854775807L
VK> #  else
VK> #   define LONG_MAX 2147483647L
VK> #  endif
VK> #  define LONG_MIN  (-LONG_MAX - 1L)

VK> and using simple code to test 
VK> (http://home.att.net/~jackklein/c/inttypes.html#int) my desktop, which 
VK> is standard Intel computer, does show.

VK> Signed long min: -9223372036854775808 max: 9223372036854775807

yes.  I am really embarrassed.

What I was trying to say was that
the definition of  int / long /...  should not change when going
from 32bit architecture to  64bit 
and that the R internal structures consequently should also be
the same on 32-bit and 64-bit platforms

>> If you have too large a numeric matrix, it would be larger than
>> 2^31 * 8 bytes ~=  2^34 / 2^20 ~= 16'000 Megabytes.
>> If that is is 10% only for you,  you'd have around 160 GB of
>> RAM.  That's quite a impressive.
>> 
>> cat /proc/meminfo | grep MemTotal
VK> MemTotal: 145169248 kB

VK> We have "smaller" SGI NUMAflex to play with, where the memory can 
VK> increased to 512Gb ("larger" version doesn't have this "limitation").  
VK> But with even commodity hardware you can easily get 128Gb for 
reasonable 
VK> price (i.e. Dell PowerEdge R900)

>> Note that R objects are (pointers to) C structs that are
>> "well-defined" platform independently, and I'd say that this
>> should remain so.

>> 
VK> I forgot that R stores two dimensional array in a single dimensional  C 
VK> array. Now I understand why there is a limitation on total number of 
VK> elements.  But this is a big limitations.

Yes, maybe

>> One of the last times this topic came up (within R-core),
>> we found that for all the matrix/vector operations,
>> we really would need versions of  BLAS / LAPACK that would also
>> work with these "big" matrices, ie. such a BLAS/Lapack would
>> also have to internally use "longer int" for indexing.
>> At that point in time, we had decied we would at least wait to
>> hear about the development of such BLAS/LAPACK libraries

VK> BLAS supports two dimensional metrics definition, so if we would store 
VK> matrix as two dimensional object, we would be fine.  But than all R 
code 
VK> as well as all packages would have to be modified.

exactly.  And that was what I meant when I said "Compatibility".

But rather than changing the  
 "matrix = colmunwise stored as long vector" paradigm, should
rather change from 32-bit indexing to longer one.

The hope is that we eventually make up a scheme
which would basically allow to just recompile all packages :

In src/include/Rinternals.h,
we have had the following three lines for several years now:

/* type for length of vectors etc */
typedef int R_len_t; /* will be long later, LONG64 or ssize_t on Win64 */
#define R_LEN_T_MAX INT_MAX


and you are right, that it may be time to experiment a bit more
with replacing 'int' with long (and also the corresponding _MAX)
setting there,
and indeed, in the array.c  code you cited, should repalce
INT_MAX  by  R_LEN_T_MAX

This still does not solve the problem that we'd have to get to
a BLAS / Lapack version that correctly works with "long indices"...
which may (or may not) be easier than I had thought.

Martin

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] data.matrix (was sapply(Date, is.numeric))

2008-08-01 Thread Prof Brian Ripley
I've committed a more liberal version to R-devel.  (It even handles S4 
classes with an as() method.)


On Thu, 31 Jul 2008, Martin Maechler wrote:


"PBR" == Prof Brian Ripley <[EMAIL PROTECTED]>
on Thu, 31 Jul 2008 08:36:22 +0100 (BST) writes:


   PBR> I've now committed fixes in R-patched and R-devel.
   PBR> There is one consequence: data.matrix() was testing for numeric columns 
by
   PBR> unlist(lapply(x, is.numeric)) and so incorrectly treating Date and 
POSIXct
   PBR> columns as numeric (which we had decided they were not).  This affects
   PBR> package gvlma.

   PBR> data.matrix() is now working as documented, but as we have an exception
   PBR> for factors, do we also want exceptions for Date and POSIXct?

Yes, that's a good idea, and much in the spirit of
data.matrix()
as I have understood it.

Note the following from  help(data.matrix)

where I think the 'Title' and 'Description' are more liberal
(rightly so) than 'Details' :

>> Convert a Data Frame to a Numeric Matrix
>>
>> Description:
>>
>>  Return the matrix obtained by converting all the variables in a
>>  data frame to numeric mode and then binding them together as the
>>  columns of a matrix.  Factors and ordered factors are replaced by
>>  their internal codes.

[...]

>> Details:
>>
>>  Supplying a data frame with columns which are not numeric, factor
>>  or logical is an error.  A warning is given if any non-factor
>>  column has a class, as then information can be lost.


Do we really have good reasons to give an error if a column is
not numeric (nor of the "exception class")?

Couldn't we just   lapply(., as.numeric)
and if that doesn't give errors
just "be happy" ?

Martin


   PBR> On Wed, 30 Jul 2008, Martin Maechler wrote:

   >>> "BDR" == Prof Brian Ripley <[EMAIL PROTECTED]>
   >>> on Wed, 30 Jul 2008 13:29:38 +0100 (BST) writes:
   >>
   BDR> On Wed, 30 Jul 2008, Martin Maechler wrote:
   >> >>> "RobMcG" == McGehee, Robert <[EMAIL PROTECTED]>
   >> >>> on Tue, 29 Jul 2008 15:40:37 -0400 writes:
   >> >>
   RobMcG> FYI,
   RobMcG> I've tried posting the below message twice to the bug tracking 
system,
   >> >>
   >> >> [... r-bugs problems discussed in a separate thread ]
   >> >>
   >> >>
   >> >>
   RobMcG> R-developers,
   RobMcG> The results below are inconsistent. From the documentation for
   RobMcG> is.numeric, I expect FALSE in both cases.
   >> >>
   >> >> >> x <- data.frame(dt=Sys.Date())
   >> >> >> is.numeric(x$dt)
   RobMcG> [1] FALSE
   >> >> >> sapply(x, is.numeric)
   RobMcG> dt
   RobMcG> TRUE
   >> >>
   RobMcG> ## Yet, sapply seems aware of the Date class
   >> >> >> sapply(x, class)
   RobMcG> dt
   RobMcG> "Date"
   >> >>
   >> >> Yes, thanks a lot, Robert, for the report.
   >> >>
   >> >> That *is* a bug somewhere in the .Internal(lapply(...)) C code,
   >> >> when S3 dispatch of primitive functions should happen.
   >>
   BDR> The bug is in do_is, which uses CHAR(PRINTNAME(CAR(call))), and when
   BDR> called from lapply that gives "FUN" not "is.numeric".  The root cause is
   BDR> the following comment
   >>
   BDR> FUN = CADR(args);  /* must be unevaluated for use in e.g. bquote */
   >>
   BDR> and hence that the function in the *call* passed to do_is can be
   BDR> unevaluated.
   >>
   >> aah!  I see.
   >>
   >> >> Here's an R scriptlet exposing a 2nd example
   >> >>
   >> >> ### lapply(list, FUN)
   >> >> ### -- seems to sometimes fail for
   >> >> ### .Primitive S3-generic functions
   >> >>
   >> >> (ds <- seq(from=Sys.Date(), by=1, length=4))
   >> >> ##[1] "2008-07-30" "2008-07-31" "2008-08-01" "2008-08-02"
   >> >> ll <- list(d=ds)
   >> >> lapply(list(d=ds), round)
   >> >> ## -> Error in lapply(list(d = ds), round) : dispatch error
   >>
   >>
   BDR> And that's a separate issue, in DispatchGroup which states that 
arguments
   BDR> have been evaluated (true) but the 'call' from lapply gives the
   BDR> unevaluated arguments and so there is a mismatch.
   >>
   >> yes, I too found that this was a separate issue, the latter
   >> one being new since version 2.7.0
   >>
   BDR> I'm testing fixes for both.
   >>
   >> Excellent!
   >> Martin
   >>
   >>
   >> >> ## or -- related to bug report by Robert McGehee on R-devel, on 
2008-07-29:
   >> >> sapply(list(d=ds), is.numeric)
   >> >> ## TRUE
   >> >>
   >> >> ## in spite of
   >> >> is.numeric(`[[`(ll,1)) ## FALSE , because of
   >> >> is.numeric.date
   >> >>
   >> >> ## or
   >> >> round(`[[`(ll,1))
   >> >> ## [1] "2008-07-30" "2008-07-31" "2008-08-01" "2008-08-02"
   >> >>
   >> >> ##-
   >> >>
   >> >> But I'm currently too much tied up with other duties,
   >> >> to find and test bug-fix.
   >> >>
   >> >> Martin Maechler, ETH Zurich and R-Core Team
   >>
   >> __
   >> R-devel@r-project.org mailing list
   >> https://stat.ethz.ch/mailman/listinfo/r-devel
   >>

   PBR> --