[Rd] vignettes and papers
Hello everyone Lots of my packages have been the subject of journal articles either in JSS or Rnews or (in one case) elsewhere. I would like to add these articles to my packages as vignettes. Reproducing the papers exactly requires a number of files [such as style files or PDFs] to be included in the inst/doc directory to pass R CMD check. A vanilla .Rnw file seems to be a good idea, but loses some of the nice JSS typesetting. What is Best Practice here? And are there ethical or other issues that I should be aware of before including copies of Rnews or JSS papers verbatim in an R package? -- Robin Hankin Uncertainty Analyst National Oceanography Centre, Southampton European Way, Southampton SO14 3ZH, UK tel 023-8059-7743 __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Cell or PS3 Port
I have been experimenting with ways of parallelizing many of the functions in the math library. There are two experimental packages available in http://www.stat.uiowa.edu/~luke/R/experimental: pnmath, based on OpenMP, and pnmath0, based on basic pthreads. I'm not sure to what degree the approach there would carry over to GPUs or Cell where the additional processors are different from the main processor and may not share memory (I forget how that works on Cell). The first issue is that you need some modifications to the some functions to ensure they are thread-safe. For the most part these are minor; a few functions would require major changes and I have not tackled them for now (Bessel functions, wilcox, signrank I believe). RNG functions are also not suitable for parallelization given the dependence on the sequential underlying RNG. It is not too hard to get parallel versions to use all available processor cores. The challenge is to make sure that the parallel versions don't run slower than the serial versions. They may if the amount of data is too small. What is too small for each function depends on the OS and the processor/memory architecture; if memory is not shared this gets more complicated still. For some very simple functions (floor, ceiling, sign) I could not see any reliable benefit of parallelization for reasonable data sizes on the systems I was using so I left those alone for now. luke On Sat, 27 Oct 2007, Ed Knutson wrote: > Hello, > > I am interested in optimizing some of R's vector math functions to > utilize the SPE units of the Cell processor (commonly found in the > Playstation 3) and I am wondering if anyone has already done any work in > that area. I can't find anything using the search page or Google. > (Admittedly it is difficult to search for information on a > one-letter-named programming language whose contributed documentation > intrinsically refers to "cells" frequently. :) I'm assuming it will be > possible to compile R under a PS3 version of Linux, since it has a ppc64 > architecture and R already runs on OS X. Are there any known caveats to > compiling R for a distro like Ubuntu with X11 support? > > I'm just going through the Cell SDK documentation at this point so it > will be a few days before I really get into the guts of it. Any > information would be greatly appreciated. > > -Ed > > __ > R-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel > -- Luke Tierney Chair, Statistics and Actuarial Science Ralph E. Wareham Professor of Mathematical Sciences University of Iowa Phone: 319-335-3386 Department of Statistics andFax: 319-335-3017 Actuarial Science 241 Schaeffer Hall email: [EMAIL PROTECTED] Iowa City, IA 52242 WWW: http://www.stat.uiowa.edu __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] vignettes and papers
Robin: > Lots of my packages have been the subject of > journal articles either in JSS or Rnews or (in one > case) elsewhere. > > I would like to add these articles > to my packages as vignettes. > > Reproducing the papers exactly requires a number > of files [such as style files or PDFs] to be included in > the inst/doc directory to pass R CMD check. > > A vanilla .Rnw file seems to be a good idea, > but loses some of the nice JSS typesetting. > > What is Best Practice here? > > And are there ethical or other issues that I should > be aware of before including copies of Rnews > or JSS papers verbatim in an R package? There are two separate issues here: 1. Using jss.cls This is ok, as long as the vignette is completely identical with the published paper. But as some details of a vignette might be modified/extended/corrected/enhanced, I typically remove the JSS header and instead include a comment like A previous version to this introduction to the R package zoo has been published as Zeileis and Grothendieck (2005) in the Journal of Statistical Software. in the vignette. See vignette("zoo", package = "zoo") If you want a .cls file that provides the same commands as jss.cls but without the JSS header and footer, see Z.cls at http://statmath.wu-wien.ac.at/~zeileis/tex/ 2. Including style files in the package Checking whether the LaTeX sources can be compiled wasn't done until recently and now leads to a notification. Because I'm using the same infrastructure (typically Z.cls) in all of my packages, I don't want to maintain them in each of my packages additionally and currently don't supply them within the R package. My personal opinion is that this is not so much of a problem because people want to work with the R code and not the LaTeX code...but, obviously, there is some tension in this. Just my EUR 0.02. hth, Z __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] applying duplicated, unique and match to lists?
Dear R developers, While improving duplicated.array() and friends and developing equivalents for the new ff package for large datasets I came across two questions: 1) is it safe to use duplicated.default(), unique.default() and match() on arbitrary lists? If so, we can speed up duplicated.array and friends considerably by using list() instead of paste(collapse="\r") 2) while duplicated.default() is very fast even on lists, match() is very slow on lists. Why is the internal conversion to character necessary? If the hashtable behind duplicated() in unique.c work for lists, why can't we use them for match()? If conversion to character is unavoidable, a better scaling alternative could be serializing and compressing to md5: even with final identity check agains unlikely collisions this is much faster in many cases (break even seems to be for quite small list elements like 2 doubles). 1) the new versions should also work for lists with a dim attribute (old versions has as.vector() which does not work for lists) Factor 10 speedup for row duplicates (here atomic matrices) > system.time(duplicated(x, hashFUN=function(x)paste(x, collapse="\r"))) user system elapsed 2.370.022.45 > system.time(duplicated(x, hashFUN=md5)) user system elapsed 0.510.000.51 > system.time(duplicated(x, hashFUN=list)) user system elapsed 0.170.000.17 2) Speedup potential for list matching (md5 results below) > x <- as.list(runif(10)) > system.time(duplicated(x)) user system elapsed 0.010.000.02 > system.time(match(x,x)) user system elapsed 2.010.002.03 Please find below more comments and tests, new code for duplicated.array() and friends, suggestions for new classes 'hash' (requiring digest) and 'id' (and if you are curious: first code drafts for the respective ff methods). Best regards Jens Oehlschlägel # Hashing of large objects in ff # (c) 2007 Jens Oehlschägel # Licence: GPL2 # Created: 2007-10-30 # Last changed: 2007-10-30 require(digest) # digest maintainer: Dirk Eddelbuettel <[EMAIL PROTECTED]> # { --- available hash functions --- # perfect projection: list # NOTE that the 'easiest hash function' is 'list' # it is faster than everything else when calculating duplicated or unique, but it is extremely slow for 'match' (currently, R-2.6.0) # thus for matching list elements, it is faster converting the list elements with md5 # no projection for vectors only none <- function(x)x # concatenation of as.character as currently (R-2.6.1) in duplicated.array, match.array (pairs of projections may erroneously apear as identical when the vectors are very similar, RAM expensive) pasteid <- function(x)paste(x, collapse="\r") # perfectly identity preserving projection (but even more RAM expensive) id1 <- function(x)paste(.Call("R_serialize", x, NULL, FALSE, NULL, PACKAGE = "base")[-(1:14)], collapse="") # 32 byte projection md5 <- function(x).Call("digest", .Call("R_serialize", x, NULL, FALSE, NULL, PACKAGE = "base"), 1L, -1L, 14L, PACKAGE = "digest") # 40 byte projection sha1 <- function(x).Call("digest", .Call("R_serialize", x, NULL, FALSE, NULL, PACKAGE = "base"), 2L, -1L, 14L, PACKAGE = "digest") # 8 byte projection: more collisions crc32 <- function(x).Call("digest", .Call("R_serialize", x, NULL, FALSE, NULL, PACKAGE = "base"), 3L, -1L, 14L, PACKAGE = "digest") #! \name{md5} #! \alias{md5} #! \title{ faster shortcut functions for in-memory digest } #! \description{ #! These functions project (serialize or hash) their input object and return a string. Because they avoid any R overhead they are better suitable for sapply() than using the more general function 'digest' #! } #! \usage{ #! md5(x) #! sha1(x) #! crc32(x) #! id1(x) #! } #! %- maybe also 'usage' for other objects documented here. #! \arguments{ #! \item{x}{ a fully serializable R object } #! } #! \value{ #! character scalar #! } #! \seealso{ \code{\link{digest}}, \code{\link[base]{serialize}} } #! \examples{ #! md5(pi) #! sha1(pi) #! crc32(pi) #! id1(pi) #! #! dontshow{ #! if (!identical(paste(serialize(list(str="a string", double=pi), connection=NULL)[-(1:14)], collapse=""), id1(list(str="a string", double=pi #! stop("something has changed in serialization, please fix the internal .Calls in function 'id1', 'md5, 'sha1', 'crc32'") #! #! if (!identical(digest(list(str="a string", double=pi), algo="md5"), md5(list(str="a string", double=pi #! stop("something has changed in package 'digest' or in serialization, please fix the internal .Calls in function 'md5'") #! #! if (!identical(digest(list(str="a string", double=pi), algo="sha1"), sha1(list(str="a string", double=pi #! stop("something has changed in package 'digest' or in serialization, please fix the internal .Calls in function 'sha1'") #! #! if (!identical(digest(list(str="a string", double=pi), algo="crc32"), crc32(list(s
Re: [Rd] Cell or PS3 Port
The main core of the Cell (the PPE) uses IBM's version of hyperthreading to expose two logical, main CPU's to the OS, so code that is "simply" multi-threaded should still see an advantage. In addition, IBM provides an SDK which includes workflow management as well as libraries to support common linear algebra and other math functions on the sub-processors (called SPE's). They also provide an interface to a hardware RNG as well as 3 software types (2 psuedo, 1 quasi) that are coded for the SPE. Each SPE has its own small, local memory store and communicates with main memory using a DMA queue. It seems to be a question of breaking up each task into units that are small enough to offload to an SPE. My initial direction will be to set up a rudimentary workflow manager. As an optimized function is encountered, a sufficient number of SPE threads will be spawned and execution of the main thread will wait for all results. As for the optimized functions, I intend to start with the ones who already have an analogous implementation in the IBM math libraries. MPI has been employed by some Cell developers to allow multiple SPE's working on sections of the same task to communicate with each other. I like the idea of this approach, since it lays the groundwork to allow multiple Cell (or really any) processors to be clustered. Luke Tierney wrote: > I have been experimenting with ways of parallelizing many of the > functions in the math library. There are two experimental packages > available in http://www.stat.uiowa.edu/~luke/R/experimental: pnmath, > based on OpenMP, and pnmath0, based on basic pthreads. I'm not sure > to what degree the approach there would carry over to GPUs or Cell > where the additional processors are different from the main processor > and may not share memory (I forget how that works on Cell). > > The first issue is that you need some modifications to the some > functions to ensure they are thread-safe. For the most part these are > minor; a few functions would require major changes and I have not > tackled them for now (Bessel functions, wilcox, signrank I believe). > RNG functions are also not suitable for parallelization given the > dependence on the sequential underlying RNG. > > It is not too hard to get parallel versions to use all available > processor cores. The challenge is to make sure that the parallel > versions don't run slower than the serial versions. They may if the > amount of data is too small. What is too small for each function > depends on the OS and the processor/memory architecture; if memory is > not shared this gets more complicated still. For some very simple > functions (floor, ceiling, sign) I could not see any reliable benefit > of parallelization for reasonable data sizes on the systems I was > using so I left those alone for now. __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel