[Rd] Accessing ENVSXP and CLOSXP while processing parsed R code

2011-11-07 Thread Rob Anderson
Hello Guys,

Following up my earlier mail where I am trying to write an alternative
front-end for R,  I had a question about accessing the closures and
environments in R code.

Here's the function taken  and modified a little from "*Lexical Scope and
Statistical Computing*"

=
f<-function(){
  Rmlfun<-function(x){
sumx  <-  sum(x)
n <-  length(x)
function(mu)
  n*log(mu)-mu*sumx
  }
  efun  <-Rmlfun(1:10)
  y1  <-  efun(3)
  print(y1)
  efun2  <-  Rmlfun(20:30)
  y2  <-  efun2(3)
  print(y2)
}
=

Now assignment efun  <-Rmlfun(1:10) creates a closure where
*function(mu) n*log(mu)-mu*sumx *is returned and *sumx* and *n *are added
to the existing environment.

I can parse the code using *PROTECT(e =
R_ParseVector(tmp,1,&status,R_NilValue));* where tmp is the buffer
containing the same source. I can walk the resultant parser output and
build and alternative Abstract syntax tree(AST).

I would like to include the information about closure/environments in my
AST so that I can possibly do some optimizations.

My question is, how can I get hold of this information?

One thing I noticed while 'walking' through the parser output, I never
encounter a CLOSXP (which I check using TYPEOF()) , even though in the
above code, closure is created. Is it the case that this information is
meant just for the internal "eval*" function and not exposed application
writers?

Thanks,
Rob

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Accessing ENVSXP and CLOSXP while processing parsed R code

2011-11-07 Thread Duncan Murdoch

On 11-11-07 5:24 AM, Rob Anderson wrote:

Hello Guys,

Following up my earlier mail where I am trying to write an alternative
front-end for R,  I had a question about accessing the closures and
environments in R code.

Here's the function taken  and modified a little from "*Lexical Scope and
Statistical Computing*"

=
f<-function(){
   Rmlfun<-function(x){
 sumx<-  sum(x)
 n<-  length(x)
 function(mu)
   n*log(mu)-mu*sumx
   }
   efun<-Rmlfun(1:10)
   y1<-  efun(3)
   print(y1)
   efun2<-  Rmlfun(20:30)
   y2<-  efun2(3)
   print(y2)
}
=

Now assignment efun<-Rmlfun(1:10) creates a closure where
*function(mu) n*log(mu)-mu*sumx *is returned and *sumx* and *n *are added
to the existing environment.


That's not correct.  When you call Rmlfun, an evaluation frame 
(environment) is created.  It contains the argument x.  Then sumx and n 
are added to it.  Then the anonymous closure is created, with body 
n*log(mu)-mu*sumx, and the closure's environment is the evaluation frame 
from the call to Rmlfun.





I can parse the code using *PROTECT(e =
R_ParseVector(tmp,1,&status,R_NilValue));* where tmp is the buffer
containing the same source. I can walk the resultant parser output and
build and alternative Abstract syntax tree(AST).

I would like to include the information about closure/environments in my
AST so that I can possibly do some optimizations.


It's not there, except potentially.  When you call the function 
"function" to create the closure, that's when the closure is created. 
That doesn't happen at parse time.  Rmlfun is created when you evaluate 
f() and then the anonymous function is created when you call Rmlfun() 
within it.




My question is, how can I get hold of this information?

One thing I noticed while 'walking' through the parser output, I never
encounter a CLOSXP (which I check using TYPEOF()) , even though in the
above code, closure is created. Is it the case that this information is
meant just for the internal "eval*" function and not exposed application
writers?


No, there's nothing hidden, it just didn't exist at the time you were 
looking for it.


Duncan Murdoch

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Efficiency of factor objects

2011-11-07 Thread Matthew Dowle
Stavros Macrakis  alum.mit.edu> writes:
> 
> data.table certainly has some useful mechanisms, and I've been
> experimenting with it as an implementation mechanism, though it's not a
> drop-in substitute for factors.  Also, though it is efficient for set
> operations between small sets and large sets, it is not very efficient for
> operations between two large sets

As a general statement that could do with some clarification ;) data.table 
likes keys consisting of multiple ordered columns, e.g. (id,date). It is (I 
believe) efficient for joining two large 2+ column keyed data sets because the 
upper bound of each row's one-sided binary search is localised in that case (by 
group of the previous key column).

As I understand it, Stavros has a different type of 'two large datasets' : 
English language website data. Each set is one large vector of uniformly 
distributed unique strings. That appears to be quite a different problem to 
multiple columns of many times duplicated data.

Matthew

> Thanks everyone, and if you do come across a relevant CRAN package, I'd be
> very interested in hearing about it.
> 
>   -s
>

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Efficiency of factor objects

2011-11-07 Thread Milan Bouchet-Valat
Le dimanche 06 novembre 2011 à 19:00 -0500, Stavros Macrakis a écrit :
> Milan, Jeff, Patrick,
> 
> 
> Thank you for your comments and suggestions.
> 
> 
> Milan,
> 
> 
> This is far from a "completely theoretical problem".  I am performing
> text analytics on a corpus of about 2m documents.  There are tens of
> thousands of distinct words (lemmata).  It seems to me that the
> natural representation of words is as an "enumeration type" -- in R
> terms, a "factor".
Interesting. What does your data look like? I've used the tm package,
and for me there are only two representations of text corpora: a list of
texts, which are basically a character string with attributes; a
document-term matrix, with documents as rows, terms as columns, and
counts at their intersection.


So I wonder how you're using factors. Do you have a factor containing
words for each text?

> Why do I think factors are the "natural way" of representing such
> things?  Because for most kinds of analysis, only their identity
> matters (not their spelling as words), but the human user would like
> to see names, not numbers. That is pretty much the definition of an
> enumeration type. In terms of R implementation, R is very efficient in
> dealing with integer identities and indexing (e.g. tabulate) and not
> very efficient in dealing with character identities -- indeed, 'table'
> first converts strings into factors.  Of course I could represent the
> lemmata as integers, and perform the translation between integers and
> strings myself, but that would just be duplicating the function of an
> enumeration type.
My point was that the efficiency of factors is due to the redundancy of
their levels. You usually have very few levels, and many observations
(in my work, often 10 levels and 100,000s of observations). If each
level only appears a few times on average, you don't save that much
memory by using a factor.

Since you have a real use case for that, I withdraw my criticism of your
suggestion being useless. ;-) But I'm still not sure R core devs would
like to follow it, since your application can be considered
non-standard, and worth a specialized class.


Cheers

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] CRAN: How to list a non-Sweave doc under "Vignettes:" on package page?

2011-11-07 Thread Michael Friendly

On 11/6/2011 7:41 PM, Henrik Bengtsson wrote:

Hi,

is it possible to have non-Sweave vignettes(*) in inst/doc/ be listed
under 'Downloads' on CRAN package pages?  For instance, in my R.rsp
package I have a inst/doc/report.pdf (part of the source *.tar.gz)
that is not detected/listed.  The PDF is not based on a Sweave
vignette but an *.tex.rsp vignette that is dynamically created via
inst/doc/Makefile.  It is listed

(*) BTW, can the term "vignette" be used for any inst/doc/ document,
or should it be reserved for Sweave+LaTeX-based documents?



I have a related problem/question and a request to R-Core to consider 
relaxing the requirements for vignettes when, for one reason or another,
they cannot be built entirely via Sweave.  In such cases, perhaps 
package authors can provide alternative metadata in the form of an

00index.html or something similar to allow such vignettes to be
more visible.

Those who face this problem would then be able to figure out a
Makefile or manual way to maintain the metadata file. What would be
required to implement this?

In my case, my Guerry package has a vignette, inst/doc/MultiSpat.pdf, 
originally built entirely with Sweave.  However, the vignette require()d 
a package only on R-Forge, which the author does not wish
to release to CRAN.  At some point, ~ R 2.10, this triggered a 
WARNING/ERROR from the CRAN check daemon, in spite of the fact that the
vignette .Rnw file contained the following hack designed to make sure 
that all necessary packages were available anywhere:


\subsection{Installation and loading of required packages}
Several packages must be installed to run the different analyses:
<=FALSE, width=7, height=7>>=

pkg <- c("maptools","spdep","ade4","Guerry","spacemakeR")
inst.pkg <- row.names(installed.packages())
pkg2inst <- pmatch(pkg,inst.pkg)
if(any(is.na(pkg2inst[1:4]))) 
install.packages(pkg[which(is.na(pkg2inst[1:4]))],repos="http://cran.at.r-project.org";)

if(is.na(pkg2inst[5]))
  install.packages("spacemakeR", repos="http://R-Forge.R-project.org";)

library(maptools)
library(ade4)
library(spdep)
library(spacemakeR)
library(Guerry)
@

However, this hack was deemed unacceptable for a CRAN package vignette.
In the end, the only solution that would satisfy the CRAN check daemon 
was to delete the source inst/doc/MultiSpat.Rnw file from the package.


Consequently, the .pdf vignette remains in the package, but it is not 
listed as a vignette on CRAN, nor found via vignette()


best,
-Michael


--
Michael Friendly Email: friendly AT yorku DOT ca
Professor, Psychology Dept.
York University  Voice: 416 736-5115 x66249 Fax: 416 736-5814
4700 Keele StreetWeb:   http://www.datavis.ca
Toronto, ONT  M3J 1P3 CANADA

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Efficiency of factor objects

2011-11-07 Thread Stavros Macrakis
Matthew,

Yes, the case I am thinking of is a 1-column key; sorry for the
overgeneralization.  I haven't thought much about the multi-column key case.

-s

On Mon, Nov 7, 2011 at 12:48, Matthew Dowle  wrote:

> Stavros Macrakis  alum.mit.edu> writes:
> >
> > data.table certainly has some useful mechanisms, and I've been
> > experimenting with it as an implementation mechanism, though it's not a
> > drop-in substitute for factors.  Also, though it is efficient for set
> > operations between small sets and large sets, it is not very efficient
> for
> > operations between two large sets
>
> As a general statement that could do with some clarification ;) data.table
> likes keys consisting of multiple ordered columns, e.g. (id,date). It is (I
> believe) efficient for joining two large 2+ column keyed data sets because
> the
> upper bound of each row's one-sided binary search is localised in that
> case (by
> group of the previous key column).
>
> As I understand it, Stavros has a different type of 'two large datasets' :
> English language website data. Each set is one large vector of uniformly
> distributed unique strings. That appears to be quite a different problem to
> multiple columns of many times duplicated data.
>
> Matthew
>
> > Thanks everyone, and if you do come across a relevant CRAN package, I'd
> be
> > very interested in hearing about it.
> >
> >   -s
> >
>
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] CRAN: How to list a non-Sweave doc under "Vignettes:" on package page?

2011-11-07 Thread Berwin A Turlach
G'day Henrik,

On Sun, 6 Nov 2011 16:41:22 -0800
Henrik Bengtsson  wrote:

> is it possible to have non-Sweave vignettes(*) in inst/doc/ be listed
> under 'Downloads' on CRAN package pages?  

As far as I know, only by a little trick.  Create an Sweave based
vignette that uses the pdfpages package to include the .pdf file that
you want to have listed.  This dummy vignette should then be listed on
CRAN.

See the lasso2 package for an example.

The vignette in inst/doc/ in that package is actually a bit more
complicated than necessary.  As I think there is no point of having two
nearly identical copies of PDF files in a package, I use .buildignores
to have the original PDF file not included in the source package.  This
started to create a problem when R decided to rebuild vignettes during
the checking process and pdfpages decided to hang if the PDF file to be
included was missing.  

HTH.

Cheers,

Berwin

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] CRAN: How to list a non-Sweave doc under "Vignettes:" on package page?

2011-11-07 Thread Hadley Wickham
> How CRAN behaves and how the help package system behaves may be two
> different problems.  My question is specifically on how CRAN works.
>
> To have inst/doc/ documents to be listed on the package's help page,
> you can add an inst/doc/index.html file, cf. Section 'Writing package
> vignettes' in 'Writing R Extensions'. You can use the following
> index.html file as a template:

But they still won't be listed under vignette() - I see this solution
as a temporary hack until non-Sweave vignettes become first class
citizens.

Hadley

-- 
Assistant Professor / Dobelman Family Junior Chair
Department of Statistics / Rice University
http://had.co.nz/

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel