[Rd] vignettes and papers

2007-11-02 Thread Robin Hankin
Hello everyone

Lots of my packages have been the subject of
journal articles either in JSS or Rnews or (in one
case) elsewhere.

I would like to add these articles
to my packages as vignettes.

Reproducing the papers exactly requires a number
of files [such as style files or PDFs] to be included in
  the inst/doc directory to pass R CMD check.

A vanilla .Rnw file seems to be a good idea,
but loses some of the nice JSS typesetting.

What is Best Practice here?

And are there ethical or other issues that I should
be aware of  before including copies of Rnews
or JSS papers verbatim in an R package?



--
Robin Hankin
Uncertainty Analyst
National Oceanography Centre, Southampton
European Way, Southampton SO14 3ZH, UK
  tel  023-8059-7743

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Cell or PS3 Port

2007-11-02 Thread Luke Tierney
I have been experimenting with ways of parallelizing many of the
functions in the math library.  There are two experimental packages
available in http://www.stat.uiowa.edu/~luke/R/experimental: pnmath,
based on OpenMP, and pnmath0, based on basic pthreads.  I'm not sure
to what degree the approach there would carry over to GPUs or Cell
where the additional processors are different from the main processor
and may not share memory (I forget how that works on Cell).

The first issue is that you need some modifications to the some
functions to ensure they are thread-safe.  For the most part these are
minor; a few functions would require major changes and I have not
tackled them for now (Bessel functions, wilcox, signrank I believe).
RNG functions are also not suitable for parallelization given the
dependence on the sequential underlying RNG.

It is not too hard to get parallel versions to use all available
processor cores. The challenge is to make sure that the parallel
versions don't run slower than the serial versions. They may if the
amount of data is too small.  What is too small for each function
depends on the OS and the processor/memory architecture; if memory is
not shared this gets more complicated still.  For some very simple
functions (floor, ceiling, sign) I could not see any reliable benefit
of parallelization for reasonable data sizes on the systems I was
using so I left those alone for now.

luke

On Sat, 27 Oct 2007, Ed Knutson wrote:

> Hello,
>
> I am interested in optimizing some of R's vector math functions to
> utilize the SPE units of the Cell processor (commonly found in the
> Playstation 3) and I am wondering if anyone has already done any work in
> that area.  I can't find anything using the search page or Google.
> (Admittedly it is difficult to search for information on a
> one-letter-named programming language whose contributed documentation
> intrinsically refers to "cells" frequently. :)  I'm assuming it will be
> possible to compile R under a PS3 version of Linux, since it has a ppc64
> architecture and R already runs on OS X.  Are there any known caveats to
> compiling R for a distro like Ubuntu with X11 support?
>
> I'm just going through the Cell SDK documentation at this point so it
> will be a few days before I really get into the guts of it.  Any
> information would be greatly appreciated.
>
> -Ed
>
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

-- 
Luke Tierney
Chair, Statistics and Actuarial Science
Ralph E. Wareham Professor of Mathematical Sciences
University of Iowa  Phone: 319-335-3386
Department of Statistics andFax:   319-335-3017
Actuarial Science
241 Schaeffer Hall  email:  [EMAIL PROTECTED]
Iowa City, IA 52242 WWW:  http://www.stat.uiowa.edu

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] vignettes and papers

2007-11-02 Thread Achim Zeileis
Robin:

> Lots of my packages have been the subject of
> journal articles either in JSS or Rnews or (in one
> case) elsewhere.
>
> I would like to add these articles
> to my packages as vignettes.
>
> Reproducing the papers exactly requires a number
> of files [such as style files or PDFs] to be included in
>   the inst/doc directory to pass R CMD check.
>
> A vanilla .Rnw file seems to be a good idea,
> but loses some of the nice JSS typesetting.
>
> What is Best Practice here?
>
> And are there ethical or other issues that I should
> be aware of  before including copies of Rnews
> or JSS papers verbatim in an R package?

There are two separate issues here:
  1. Using jss.cls
 This is ok, as long as the vignette is completely
 identical with the published paper. But as some details of a
 vignette might be modified/extended/corrected/enhanced, I typically
 remove the JSS header and instead include a comment like
A previous version to this introduction to the
R package zoo has been published as Zeileis
and Grothendieck (2005) in the Journal of Statistical
Software.
 in the vignette. See
   vignette("zoo", package = "zoo")
 If you want a .cls file that provides the same commands as jss.cls
 but without the JSS header and footer, see Z.cls at
   http://statmath.wu-wien.ac.at/~zeileis/tex/
  2. Including style files in the package
 Checking whether the LaTeX sources can be compiled wasn't done
 until recently and now leads to a notification. Because I'm using
 the same infrastructure (typically Z.cls) in all of my packages,
 I don't want to maintain them in each of my packages additionally
 and currently don't supply them within the R package. My personal
 opinion is that this is not so much of a problem because people
 want to work with the R code and not the LaTeX code...but, obviously,
 there is some tension in this. Just my EUR 0.02.

hth,
Z

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] applying duplicated, unique and match to lists?

2007-11-02 Thread Jens Oehlschlägel
Dear R developers,

While improving duplicated.array() and friends and developing equivalents for 
the new ff package for large datasets I came across two questions:

1) is it safe to use duplicated.default(), unique.default() and match() on 
arbitrary lists? If so, we can speed up duplicated.array and friends 
considerably by using list() instead of paste(collapse="\r")

2) while duplicated.default() is very fast even on lists, match() is very slow 
on lists. Why is the internal conversion to character necessary? If the 
hashtable behind duplicated() in unique.c work for lists, why can't we use them 
for match()? If conversion to character is unavoidable, a better scaling 
alternative could be serializing and compressing to md5: even with final 
identity check agains unlikely collisions this is much faster in many cases 
(break even seems to be for quite small list elements like 2 doubles).

1) the new versions should also work for lists with a dim attribute (old 
versions has as.vector() which does not work for lists)
Factor 10 speedup for row duplicates (here atomic matrices)
>   system.time(duplicated(x, hashFUN=function(x)paste(x, collapse="\r")))
   user  system elapsed 
   2.370.022.45 
>   system.time(duplicated(x, hashFUN=md5))
   user  system elapsed 
   0.510.000.51 
>   system.time(duplicated(x, hashFUN=list))
   user  system elapsed 
   0.170.000.17

2) Speedup potential for list matching (md5 results below)
> x <- as.list(runif(10))
> system.time(duplicated(x))
   user  system elapsed 
   0.010.000.02 
> system.time(match(x,x))
   user  system elapsed 
   2.010.002.03

Please find below more comments and tests, new code for duplicated.array() and 
friends, suggestions for new classes 'hash' (requiring digest) and 'id' (and if 
you are curious: first code drafts for the respective ff methods).

Best regards



Jens Oehlschlägel





# Hashing of large objects in ff
# (c) 2007 Jens Oehlschägel
# Licence: GPL2
# Created: 2007-10-30
# Last changed: 2007-10-30

require(digest) # digest maintainer: Dirk Eddelbuettel <[EMAIL PROTECTED]>

# { --- available hash functions ---

# perfect projection: list
# NOTE that the 'easiest hash function' is 'list'
# it is faster than everything else when calculating duplicated or unique, but 
it is extremely slow for 'match' (currently, R-2.6.0)
# thus for matching list elements, it is faster converting the list elements 
with md5

# no projection for vectors only
none <- function(x)x

# concatenation of as.character as currently (R-2.6.1) in duplicated.array, 
match.array (pairs of projections may erroneously apear as identical when the 
vectors are very similar, RAM expensive)
pasteid <- function(x)paste(x, collapse="\r")

# perfectly identity preserving projection (but even more RAM expensive)
id1 <- function(x)paste(.Call("R_serialize", x, NULL, FALSE, NULL, PACKAGE = 
"base")[-(1:14)], collapse="")

# 32 byte projection
md5 <- function(x).Call("digest", .Call("R_serialize", x, NULL, FALSE, NULL, 
PACKAGE = "base"), 1L, -1L, 14L, PACKAGE = "digest")

# 40 byte projection
sha1 <- function(x).Call("digest", .Call("R_serialize", x, NULL, FALSE, NULL, 
PACKAGE = "base"), 2L, -1L, 14L, PACKAGE = "digest")

# 8 byte projection: more collisions
crc32 <- function(x).Call("digest", .Call("R_serialize", x, NULL, FALSE, NULL, 
PACKAGE = "base"), 3L, -1L, 14L, PACKAGE = "digest")


#! \name{md5}
#! \alias{md5}
#! \title{ faster shortcut functions for in-memory digest }
#! \description{
#!   These functions project (serialize or hash) their input object and return 
a string. Because they avoid any R overhead they are better suitable for 
sapply() than using the more general function 'digest'
#! }
#! \usage{
#! md5(x)
#! sha1(x)
#! crc32(x)
#! id1(x)
#! }
#! %- maybe also 'usage' for other objects documented here.
#! \arguments{
#!   \item{x}{ a fully serializable R object }
#! }
#! \value{
#!   character scalar
#! }
#! \seealso{ \code{\link{digest}}, \code{\link[base]{serialize}} }
#! \examples{
#!   md5(pi)
#!   sha1(pi)
#!   crc32(pi)
#!   id1(pi)
#!
#!   dontshow{
#! if (!identical(paste(serialize(list(str="a string", double=pi), 
connection=NULL)[-(1:14)], collapse=""), id1(list(str="a string", double=pi
#!   stop("something has changed in serialization, please fix the internal 
.Calls in function 'id1', 'md5, 'sha1', 'crc32'")
#!
#! if (!identical(digest(list(str="a string", double=pi), algo="md5"), 
md5(list(str="a string", double=pi
#!   stop("something has changed in package 'digest' or in serialization, 
please fix the internal .Calls in function 'md5'")
#!
#! if (!identical(digest(list(str="a string", double=pi), algo="sha1"), 
sha1(list(str="a string", double=pi
#!   stop("something has changed in package 'digest' or in serialization, 
please fix the internal .Calls in function 'sha1'")
#!
#! if (!identical(digest(list(str="a string", double=pi), algo="crc32"), 
crc32(list(s

Re: [Rd] Cell or PS3 Port

2007-11-02 Thread Ed Knutson
The main core of the Cell (the PPE) uses IBM's version of hyperthreading 
to expose two logical, main CPU's to the OS, so code that is "simply" 
multi-threaded should still see an advantage.  In addition, IBM provides 
an SDK which includes workflow management as well as libraries to 
support common linear algebra and other math functions on the 
sub-processors (called SPE's).  They also provide an interface to a 
hardware RNG as well as 3 software types (2 psuedo, 1 quasi) that are 
coded for the SPE.

Each SPE has its own small, local memory store and communicates with 
main memory using a DMA queue.  It seems to be a question of breaking up 
each task into units that are small enough to offload to an SPE.  My 
initial direction will be to set up a rudimentary workflow manager.  As 
an optimized function is encountered, a sufficient number of SPE threads 
will be spawned and execution of the main thread will wait for all 
results.  As for the optimized functions, I intend to start with the 
ones who already have an analogous implementation in the IBM math libraries.

MPI has been employed by some Cell developers to allow multiple SPE's 
working on sections of the same task to communicate with each other.  I 
like the idea of this approach, since it lays the groundwork to allow 
multiple Cell (or really any) processors to be clustered.


Luke Tierney wrote:
> I have been experimenting with ways of parallelizing many of the
> functions in the math library.  There are two experimental packages
> available in http://www.stat.uiowa.edu/~luke/R/experimental: pnmath,
> based on OpenMP, and pnmath0, based on basic pthreads.  I'm not sure
> to what degree the approach there would carry over to GPUs or Cell
> where the additional processors are different from the main processor
> and may not share memory (I forget how that works on Cell).
> 
> The first issue is that you need some modifications to the some
> functions to ensure they are thread-safe.  For the most part these are
> minor; a few functions would require major changes and I have not
> tackled them for now (Bessel functions, wilcox, signrank I believe).
> RNG functions are also not suitable for parallelization given the
> dependence on the sequential underlying RNG.
> 
> It is not too hard to get parallel versions to use all available
> processor cores. The challenge is to make sure that the parallel
> versions don't run slower than the serial versions. They may if the
> amount of data is too small.  What is too small for each function
> depends on the OS and the processor/memory architecture; if memory is
> not shared this gets more complicated still.  For some very simple
> functions (floor, ceiling, sign) I could not see any reliable benefit
> of parallelization for reasonable data sizes on the systems I was
> using so I left those alone for now.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel