[Rd] Recent and upcoming changes to R-devel

2011-07-04 Thread Prof Brian Ripley
There was an R-core meeting the week before last, and various planned 
changes will appear in R-devel over the next few weeks.


These are changes planned for R 2.14.0 scheduled for Oct 31.  As we 
are sick of people referring to R-devel as '2.14' or '2.14.0', that 
version number will not be used until we reach 2.14.0 alpha.  You will 
be able to have a package depend on an svn version number when 
referring to R-devel rather than using R (>= 2.14.0).


All packages are installed with lazy-loading (there were 72 CRAN 
packages and 8 BioC packages which opted out).  This means that the 
code is always parsed at install time which inter alia simplifies the 
descriptions.  R 2.13.1 RC warns on installation about packages which 
ask not to be lazy-loaded, and R-devel ignores such requests (with a 
warning).


In the near future all packages will have a name space.  If the 
sources do not contain one, a default NAMESPACE file will be added. 
This again will simplify the descriptions and also a lot of internal 
code.  Maintainers of packages without name spaces (currently 42% of 
CRAN) are encouraged to add one themselves.


R-devel is installed with the base and recommended packages 
byte-compiled (the equivalent of 'make bytecode' in R 2.13.x, but done 
less inefficiently).  There is a new option

R CMD INSTALL --byte-compile
to byte-compile contributed packages, but that remains optional. 
Byte-compilation is quite expensive (so you definitely want to do it 
at install time, which requires lazy-loading), and relatively few 
packages benefit appreciably from byte-compilation.  A larger number 
of packages benefit from byte-compilation of R itself: for example AER 
runs its checks 10% faster.  The byte-compiler technology is thanks to 
Luke Tierney.


There is support for figures in Rd files: currently with a first-pass
implementation (thanks to Duncan Murdoch).

--
Brian D. Ripley,  rip...@stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel:  +44 1865 272861 (self)
1 South Parks Road, +44 1865 272866 (PA)
Oxford OX1 3TG, UKFax:  +44 1865 272595

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Recent and upcoming changes to R-devel

2011-07-04 Thread Martin Morgan

On 07/04/2011 05:08 AM, Prof Brian Ripley wrote:

There was an R-core meeting the week before last, and various planned
changes will appear in R-devel over the next few weeks.

These are changes planned for R 2.14.0 scheduled for Oct 31. As we are
sick of people referring to R-devel as '2.14' or '2.14.0', that version
number will not be used until we reach 2.14.0 alpha. You will be able to
have a package depend on an svn version number when referring to R-devel
rather than using R (>= 2.14.0).

All packages are installed with lazy-loading (there were 72 CRAN
packages and 8 BioC packages which opted out). This means that the code
is always parsed at install time which inter alia simplifies the
descriptions. R 2.13.1 RC warns on installation about packages which ask
not to be lazy-loaded, and R-devel ignores such requests (with a warning).

In the near future all packages will have a name space. If the sources
do not contain one, a default NAMESPACE file will be added. This again
will simplify the descriptions and also a lot of internal code.
Maintainers of packages without name spaces (currently 42% of CRAN) are
encouraged to add one themselves.

R-devel is installed with the base and recommended packages
byte-compiled (the equivalent of 'make bytecode' in R 2.13.x, but done
less inefficiently). There is a new option
R CMD INSTALL --byte-compile
to byte-compile contributed packages, but that remains optional.


Anticipating the future, contributed package byte-compilation will have 
large effects on CRAN and especially Bioconductor build systems. For 
instance, a moderate-sized package like Biobase built without vignettes 
installs in about 19s with byte compilation, 9s with, while a more 
complicated package IRanges is 1m25s, vs. 29s.


For Bioconductor this will certainly require new hardware across all 
supported platforms, and almost certainly significant effort to improve 
build system efficiencies.


Martin


Byte-compilation is quite expensive (so you definitely want to do it at
install time, which requires lazy-loading), and relatively few packages
benefit appreciably from byte-compilation. A larger number of packages
benefit from byte-compilation of R itself: for example AER runs its
checks 10% faster. The byte-compiler technology is thanks to Luke Tierney.

There is support for figures in Rd files: currently with a first-pass
implementation (thanks to Duncan Murdoch).




--
Computational Biology
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109

Location: M1-B861
Telephone: 206 667-2793

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Recent and upcoming changes to R-devel

2011-07-04 Thread Prof Brian Ripley

On Mon, 4 Jul 2011, Martin Morgan wrote:


On 07/04/2011 05:08 AM, Prof Brian Ripley wrote:

There was an R-core meeting the week before last, and various planned
changes will appear in R-devel over the next few weeks.

These are changes planned for R 2.14.0 scheduled for Oct 31. As we are
sick of people referring to R-devel as '2.14' or '2.14.0', that version
number will not be used until we reach 2.14.0 alpha. You will be able to
have a package depend on an svn version number when referring to R-devel
rather than using R (>= 2.14.0).

All packages are installed with lazy-loading (there were 72 CRAN
packages and 8 BioC packages which opted out). This means that the code
is always parsed at install time which inter alia simplifies the
descriptions. R 2.13.1 RC warns on installation about packages which ask
not to be lazy-loaded, and R-devel ignores such requests (with a warning).

In the near future all packages will have a name space. If the sources
do not contain one, a default NAMESPACE file will be added. This again
will simplify the descriptions and also a lot of internal code.
Maintainers of packages without name spaces (currently 42% of CRAN) are
encouraged to add one themselves.

R-devel is installed with the base and recommended packages
byte-compiled (the equivalent of 'make bytecode' in R 2.13.x, but done
less inefficiently). There is a new option
R CMD INSTALL --byte-compile
to byte-compile contributed packages, but that remains optional.


Anticipating the future, contributed package byte-compilation will have large 
effects on CRAN and especially Bioconductor build systems. For instance, a 
moderate-sized package like Biobase built without vignettes installs in about 
19s with byte compilation, 9s with, while a more complicated package IRanges 
is 1m25s, vs. 29s.


I presume the first is 'with' the second 'without'.  Yes, as I did say
'byte compilation is quite expensive', and it is not clear if it will 
ever become the default for contributed packages.


For Bioconductor this will certainly require new hardware across all 
supported platforms, and almost certainly significant effort to improve build 
system efficiencies.


Martin


Byte-compilation is quite expensive (so you definitely want to do it at
install time, which requires lazy-loading), and relatively few packages
benefit appreciably from byte-compilation. A larger number of packages
benefit from byte-compilation of R itself: for example AER runs its
checks 10% faster. The byte-compiler technology is thanks to Luke Tierney.

There is support for figures in Rd files: currently with a first-pass
implementation (thanks to Duncan Murdoch).




--
Computational Biology
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109

Location: M1-B861
Telephone: 206 667-2793



--
Brian D. Ripley,  rip...@stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel:  +44 1865 272861 (self)
1 South Parks Road, +44 1865 272866 (PA)
Oxford OX1 3TG, UKFax:  +44 1865 272595

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] speeding up perception

2011-07-04 Thread Timothée Carayol
Hi --

It's my first post on this list; as a relatively new user with little
knowledge of R internals, I am a bit intimidated by the depth of some
of the discussions here, so please spare me if I say something
incredibly silly.

I feel that someone at this point should mention Matthew Dowle's
excellent data.table package
(http://cran.r-project.org/web/packages/data.table/index.html) which
seems to me to address many of the inefficiencies of data.frame.
data.tables have no row names; and operations that only need data from
one or two columns are (I believe) just as quick whether the total
number of columns is 5 or 1000. This results in very quick operations
(and, often, elegant code as well).

Regards
Timothee

On Mon, Jul 4, 2011 at 6:19 AM, ivo welch  wrote:
> thank you, simon.  this was very interesting indeed.  I also now
> understand how far out of my depth I am here.
>
> fortunately, as an end user, obviously, *I* now know how to avoid the
> problem.  I particularly like the as.list() transformation and back to
> as.data.frame() to speed things up without loss of (much)
> functionality.
>
>
> more broadly, I view the avoidance of individual access through the
> use of apply and vector operations as a mixed "IQ test" and "knowledge
> test" (which I often fail).  However, even for the most clever, there
> are also situations where the KISS programming principle makes
> explicit loops still preferable.  Personally, I would have preferred
> it if R had, in its standard "statistical data set" data structure,
> foregone the row names feature in exchange for retaining fast direct
> access.  R could have reserved its current implementation "with row
> names but slow access" for a less common (possibly pseudo-inheriting)
> data structure.
>
>
> If end users commonly do iterations over a data frame, which I would
> guess to be the case, then the impression of R by (novice) end users
> could be greatly enhanced if the extreme penalties could be eliminated
> or at least flagged.  For example, I wonder if modest special internal
> code could store data frames internally and transparently as lists of
> vectors UNTIL a row name is assigned to.  Easier and uglier, a simple
> but specific warning message could be issued with a suggestion if
> there is an individual read/write into a data frame ("Warning: data
> frames are much slower than lists of vectors for individual element
> access").
>
>
> I would also suggest changing the "Introduction to R" 6.3  from "A
> data frame may for many purposes be regarded as a matrix with columns
> possibly of differing modes and attributes. It may be displayed in
> matrix form, and its rows and columns extracted using matrix indexing
> conventions." to "A data frame may for many purposes be regarded as a
> matrix with columns possibly of differing modes and attributes. It may
> be displayed in matrix form, and its rows and columns extracted using
> matrix indexing conventions.  However, data frames can be much slower
> than matrices or even lists of vectors (which, like data frames, can
> contain different types of columns) when individual elements need to
> be accessed."  Reading about it immediately upon introduction could
> flag the problem in a more visible manner.
>
>
> regards,
>
> /iaw
>
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] speeding up perception

2011-07-04 Thread Simon Urbanek
Timothée,

On Jul 4, 2011, at 2:47 AM, Timothée Carayol wrote:

> Hi --
> 
> It's my first post on this list; as a relatively new user with little
> knowledge of R internals, I am a bit intimidated by the depth of some
> of the discussions here, so please spare me if I say something
> incredibly silly.
> 
> I feel that someone at this point should mention Matthew Dowle's
> excellent data.table package
> (http://cran.r-project.org/web/packages/data.table/index.html) which
> seems to me to address many of the inefficiencies of data.frame.
> data.tables have no row names; and operations that only need data from
> one or two columns are (I believe) just as quick whether the total
> number of columns is 5 or 1000. This results in very quick operations
> (and, often, elegant code as well).
> 

I agree that data.table is a very good alternative (for other reasons) that 
should be promoted more. The only slight snag is that it doesn't help with the 
issue at hand since it simply does a pass-though for subassignments to data 
frame's methods and thus suffers from the same problems (in fact there is a 
rather stark asymmetry in how it handles subsetting vs subassignment - which is 
a bit surprising [if I read the code correctly you can't use the same indexing 
in both]). In fact I would propose that it should not do that but handle the 
simple cases itself more efficiently without unneeded copies. That would make 
it indeed a very interesting alternative.

Cheers,
Simon


> 
> On Mon, Jul 4, 2011 at 6:19 AM, ivo welch  wrote:
>> thank you, simon.  this was very interesting indeed.  I also now
>> understand how far out of my depth I am here.
>> 
>> fortunately, as an end user, obviously, *I* now know how to avoid the
>> problem.  I particularly like the as.list() transformation and back to
>> as.data.frame() to speed things up without loss of (much)
>> functionality.
>> 
>> 
>> more broadly, I view the avoidance of individual access through the
>> use of apply and vector operations as a mixed "IQ test" and "knowledge
>> test" (which I often fail).  However, even for the most clever, there
>> are also situations where the KISS programming principle makes
>> explicit loops still preferable.  Personally, I would have preferred
>> it if R had, in its standard "statistical data set" data structure,
>> foregone the row names feature in exchange for retaining fast direct
>> access.  R could have reserved its current implementation "with row
>> names but slow access" for a less common (possibly pseudo-inheriting)
>> data structure.
>> 
>> 
>> If end users commonly do iterations over a data frame, which I would
>> guess to be the case, then the impression of R by (novice) end users
>> could be greatly enhanced if the extreme penalties could be eliminated
>> or at least flagged.  For example, I wonder if modest special internal
>> code could store data frames internally and transparently as lists of
>> vectors UNTIL a row name is assigned to.  Easier and uglier, a simple
>> but specific warning message could be issued with a suggestion if
>> there is an individual read/write into a data frame ("Warning: data
>> frames are much slower than lists of vectors for individual element
>> access").
>> 
>> 
>> I would also suggest changing the "Introduction to R" 6.3  from "A
>> data frame may for many purposes be regarded as a matrix with columns
>> possibly of differing modes and attributes. It may be displayed in
>> matrix form, and its rows and columns extracted using matrix indexing
>> conventions." to "A data frame may for many purposes be regarded as a
>> matrix with columns possibly of differing modes and attributes. It may
>> be displayed in matrix form, and its rows and columns extracted using
>> matrix indexing conventions.  However, data frames can be much slower
>> than matrices or even lists of vectors (which, like data frames, can
>> contain different types of columns) when individual elements need to
>> be accessed."  Reading about it immediately upon introduction could
>> flag the problem in a more visible manner.
>> 
>> 
>> regards,
>> 
>> /iaw
>> 
>> __
>> R-devel@r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>> 
> 
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
> 
> 

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] speeding up perception

2011-07-04 Thread Tim Hesterberg
I've written a "dataframe" package that replaces existing methods for
data frame creation and subscripting with versions that use less
memory.  For example, as.data.frame(a vector) makes 4 copies of the
data in R 2.9.2, and 1 copy with the package.  There is a small speed
gain.

I and others have been using it at Google for some years, and it is time
to either put it on CRAN, or move it into R.

R core folks - would you prefer that this be released to CRAN, or
would you like to consider merging it directly into R?

I took existing functions, and did some hacks to reduce the number of
times R copies objects.  Some of it is ugly.  This could be done more
efficiently, and with cleaner code, with some changes or hooks in R
internal code, but I'm not prepared to do that.

I often use lists instead of data frames.  In another package I have a
'subscriptRows' function that subscripts a list as if it were
a data frame.  I could merge that into the dataframe package.

Memory use - number of copies made
#   R 2.9.2 library(dataframe)
#   as.data.frame(y)4   1
#   data.frame(y)   8   3
#   data.frame(y, z)8   3
#   as.data.frame(l)10  3
#   data.frame(l)   15  5
#   d$z <- z3,2 1,1
#   d[["z"]] <- z   4,3 2,1
#   d[, "z"] <- z   6,4,2   2,2,1
#   d["z"] <- z 6,5,2   2,2,1
#   d["z"] <- list(z=z) 6,3,2   2,2,1
#   d["z"] <- Z #list(z=z)  6,2,2   2,1,1
#   a <- d["y"] 2   1
#   a <- d[, "y", drop=F]   2   1
# y and z are vectors, Z and l are lists, and d a data frame.
# Where two numbers are given, they refer to:
#   (copies of the old data frame),
#   (copies of the new column)
# A third number refers to numbers of
#   (copies made of an integer vector of row names)

#  ---  seconds (multiple repetitions) ---
#  creation/column subscripting row subscripting
# R 2.9.2: 34.2 43.9 43.3   10.6 13.0
# library(dataframe) : 22.5 21.8 21.89.7  9.5  9.5

I reported one of the simpler hacks to this list earlier, and it
was included in some version of R after 2.9.2, so the current version
of R isn't as bad as 2.9.2.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Recent and upcoming changes to R-devel

2011-07-04 Thread Mark.Bravington
I may have misunderstood, but: 

Please could we have an optional installation that does not *not* byte-compile 
base and recommended?

Reason: it's not possible to debug byte-compiled code-- at least not with the 
'debug' package, which is quite widely used. I quite often end up using 
'mtrace' on functions in base/recommended packages to figure out what they are 
doing. And sometimes I (and others) experiment with changing functions in 
base/recommended to improve functionality. That seems to be harder with BC 
versions, and might even be impossible, as best I can tell from hints in the 
documentation of 'compile').

Personally, if I had to choose only one, I'd rather live with the speed penalty 
from not byte-compiling. But of course, if both are available, I could install 
both.

Thanks

Mark

-- 
Mark Bravington
CSIRO Mathematical & Information Sciences
Marine Laboratory
Castray Esplanade
Hobart 7001
TAS

ph (+61) 3 6232 5118
fax (+61) 3 6232 5012
mob (+61) 438 315 623

Prof Brian Ripley wrote:
> There was an R-core meeting the week before last, and various planned
> changes will appear in R-devel over the next few weeks. 
> 
> These are changes planned for R 2.14.0 scheduled for Oct 31.  As we
> are sick of people referring to R-devel as '2.14' or '2.14.0', that
> version number will not be used until we reach 2.14.0 alpha.  You
> will be able to have a package depend on an svn version number when
> referring to R-devel rather than using R (>= 2.14.0).
> 
> All packages are installed with lazy-loading (there were 72 CRAN
> packages and 8 BioC packages which opted out).  This means that the
> code is always parsed at install time which inter alia simplifies the
> descriptions.  R 2.13.1 RC warns on installation about packages which
> ask not to be lazy-loaded, and R-devel ignores such requests (with a
> warning). 
> 
> In the near future all packages will have a name space.  If the
> sources do not contain one, a default NAMESPACE file will be added. 
> This again will simplify the descriptions and also a lot of internal
> code.  Maintainers of packages without name spaces (currently 42% of 
> CRAN) are encouraged to add one themselves.
> 
> R-devel is installed with the base and recommended packages
> byte-compiled (the equivalent of 'make bytecode' in R 2.13.x, but
> done less inefficiently).  There is a new option R CMD INSTALL
> --byte-compile to byte-compile contributed packages, but that remains
> optional.
> Byte-compilation is quite expensive (so you definitely want to do it
> at install time, which requires lazy-loading), and relatively few
> packages benefit appreciably from byte-compilation.  A larger number
> of packages benefit from byte-compilation of R itself: for example
> AER runs its checks 10% faster.  The byte-compiler technology is
> thanks to Luke Tierney. 
> 
> There is support for figures in Rd files: currently with a first-pass
> implementation (thanks to Duncan Murdoch). 
__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] Circumventing code/documentation mismatches ('R CMD check')

2011-07-04 Thread Johannes Graumann
Hello,

As prompted by B. Ripley (see below), I am transfering this over from R-User 
...

For a package I am writing a function that looks like

test <- function(Argument1=NA){
# Prerequisite testing
if(!(is.na(Argument1))){
if(!(is.character(Argument1))){
stop("Wrong class.")
}
}
# Function Body
cat("Hello World\n")
}

Documentation of this is straight forward:

...
\usage{test(Argument1=NA)}
...

However writing the function could be made more concise like so:

test2 <- function(Argument1=NA_character_){
# Prerequisite testing
if(!(is.character(Argument1))){
stop("Wrong class.")
}
# Function Body
cat("Hello World\n")
}

To prevent confusion I do not want to use 'NA_character_' in the user-
exposed documentation and using 

...
\usage{test2(Argument1=NA)}
...

leads to a warning reagrding a code/documentation mismatch.

Is there any way to prevent that?

Sincerely, Joh

Prof Brian Ripley wrote:

> On Mon, 4 Jul 2011, Johannes Graumann wrote:
> 
>> Hello,
>>
>> I'm writing a package am running 'R CMD check' on it.
>>
>> Is there any way to make 'R CMD check' not warn about a missmatch between
>> 'NA_character_' (in the function definition) and 'NA' (in the
>> documentation)?
> 
> Be consistent   Why do you want incorrect documentation of your
> package?  (It is not clear of the circumstances here: normally 1 vs 1L
> and similar are not reported if they are the only errors.)
> 
> And please do note the posting guide
> 
> - this is not really the correct list
> - you were asked to give an actual example with output.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel