Re: [Rd] fast version of split.data.frame or conversion from data.frame to list of its rows

2012-05-01 Thread Matthew Dowle

Antonio Piccolboni  piccolboni.info> writes:
> Hi,
> I was wondering if there is anything more efficient than split to do the
> kind of conversion in the subject. If I create a data frame as in
> 
> system.time({fd =  data.frame(x=1:2000, y = rnorm(2000), id = paste("x",
> 1:2000, sep =""))})
>   user  system elapsed
>   0.004   0.000   0.004
> 
> and then I try to split it
> 
> > system.time(split(fd, 1:nrow(fd)))
>user  system elapsed
>   0.333   0.031   0.415
> 
> You will be quick to notice the roughly two orders of magnitude difference
> in time between creation and conversion. Granted, it's not written anywhere
> that they should be similar but the latter seems interpreter-slow to me
> (split is implemented with a lapply in the data frame case) There is also a
> memory issue when I hit about 2 elements (allocating 3GB when
> interrupted). So before I resort to Rcpp, despite the electrifying feeling
> of approaching the bare metal and for the sake of getting things done, I
> thought I would ask the experts. Thanks
> 
> Antonio

Perhaps r-help or Stack Overflow would have been more appropriate to try first, 
before r-devel. If you did, please say so.

Answering anyway. Do you really want to split every single row? What's the 
bigger picture? Perhaps you don't need to split at all.

On the off chance that the example was just for exposition, and applying some 
(biased) guesswork, have you seen the data.table package? It doesn't use the 
split-apply-combine paradigm because, as your (extreme) example shows, that 
doesn't scale. When you use the 'by' argument of [.data.table, it allocates 
memory once for the largest group. Then it reuses that same memory for each 
group. That's one reason it's fast and memory efficient at grouping (an order 
of magnitude faster than tapply).

Independent timings :
http://www.r-bloggers.com/comparison-of-ave-ddply-and-data-table/

If you really do want to split every single row, then
DT[,,by=1:nrow(DT)]
will give perhaps two orders of magnitude speedup, but that's an unfair example 
because it isn't very realistic. Scaling applies to the size of the data.frame, 
and, how much you want to split it up. Your example is extreme in the latter 
but not the former. data.table scales in both.

It's nothing to do with the interpreter, btw, just memory usage.

Matthew

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] fast version of split.data.frame or conversion from data.frame to list of its rows

2012-05-01 Thread Prof Brian Ripley

On 01/05/2012 00:28, Antonio Piccolboni wrote:

Hi,
I was wondering if there is anything more efficient than split to do the
kind of conversion in the subject. If I create a data frame as in

system.time({fd =  data.frame(x=1:2000, y = rnorm(2000), id = paste("x",
1:2000, sep =""))})
   user  system elapsed
   0.004   0.000   0.004

and then I try to split it


system.time(split(fd, 1:nrow(fd)))

user  system elapsed
   0.333   0.031   0.415


You will be quick to notice the roughly two orders of magnitude difference
in time between creation and conversion. Granted, it's not written anywhere


Unsurprising when you create three orders of magnitude more data frames, 
is it?  That's a list of 2000 data frames.  Try


system.time(for(i in 1:2000) data.frame(x = i, y = rnorm(1), id = 
paste0("x", i)))




that they should be similar but the latter seems interpreter-slow to me
(split is implemented with a lapply in the data frame case) There is also a
memory issue when I hit about 2 elements (allocating 3GB when
interrupted). So before I resort to Rcpp, despite the electrifying feeling
of approaching the bare metal and for the sake of getting things done, I
thought I would ask the experts. Thanks


You need to re-think your data structures: 1-row data frames are not 
sensible.






Antonio

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel



--
Brian D. Ripley,  rip...@stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel:  +44 1865 272861 (self)
1 South Parks Road, +44 1865 272866 (PA)
Oxford OX1 3TG, UKFax:  +44 1865 272595

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] The constant part of the log-likelihood in StructTS

2012-05-01 Thread Ravi Varadhan
This is not a problem at all.  The log likelihood function is a function of the 
model parameters and the data, but it is defined up to an additive arbitrary 
constant, i.e. L(\theta) and L(\theta) + k are completely equivalent, for any 
k. This does not affect model comparisons or hypothesis tests.

Ravi

From: r-devel-boun...@r-project.org [r-devel-boun...@r-project.org] on behalf 
of Jouni Helske [jounihel...@gmail.com]
Sent: Monday, April 30, 2012 7:37 AM
To: r-devel@r-project.org
Subject: [Rd] The constant part of the log-likelihood in StructTS

Dear all,

I'd like to discuss about a possible bug in function StructTS of stats
package. It seems that the function returns wrong value of the
log-likelihood, as the added constant to the relevant part of the
log-likelihood is misspecified. Here is an simple example:

> data(Nile)
> fit <- StructTS(Nile, type = "level")
> fit$loglik
[1] -367.5194

When computing the log-likelihood with other packages such as KFAS and FKF,
the loglikelihood value is around -645.

For the local level model, the likelihood is defined by -0.5*n*log(2*pi) -
0.5*sum(log(F_t) + v_t^2/sqrt(F_t)) (see for example  Durbin and Koopman
(2001, page 30). But in StructTS, the likelihood is computed like this:

loglik <- -length(y) * res$value + length(y) * log(2 * pi),

where the first part coincides with the last part of the definition, but
the constant part has wrong sign and it is not multiplied by 0.5. Also in
case of missing observations, I think there should be sum(!is.na(y))
instead of length(y) in the constant term, as the likelihood is only
computed for those y which are observed.

This does not affect in estimation of model parameters, but it could have
effects in model comparison or some other cases.

Is there some reason for this kind of constant, or is it just a bug?

Best regards,

Jouni Helske
PhD student in Statistics
University of Jyväskylä
Finland

[[alternative HTML version deleted]]
__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] fast version of split.data.frame or conversion from data.frame to list of its rows

2012-05-01 Thread Antonio Piccolboni
It seems like people need to hear more context, happy to provide it. I am
implementing a serialization format (typedbytes, HADOOP-1722 if people want
the gory details) to make R and Hadoop interoperate better (RHadoop
project, package rmr). It is a row first format and it's already
implemented as a C extension for R for lists and atomic vectors, where each
element  of a vector is a row. I need to extend it to accept data frames
and I was wondering if I can use the existing C code by converting a data
frame to a list of its rows. It sounds like the answer is that it is not a
good idea, that's helpful too in a way because it restricts the options. I
thought I may be missing a simple primitive, like a t() for data frames
(that doesn't coerce to matrix). Thanks

Antonio

On Tue, May 1, 2012 at 5:46 AM, Prof Brian Ripley wrote:

> On 01/05/2012 00:28, Antonio Piccolboni wrote:
>
>> Hi,
>> I was wondering if there is anything more efficient than split to do the
>> kind of conversion in the subject. If I create a data frame as in
>>
>> system.time({fd =  data.frame(x=1:2000, y = rnorm(2000), id = paste("x",
>> 1:2000, sep =""))})
>>   user  system elapsed
>>   0.004   0.000   0.004
>>
>> and then I try to split it
>>
>>  system.time(split(fd, 1:nrow(fd)))
>>>
>>user  system elapsed
>>   0.333   0.031   0.415
>>
>>
>> You will be quick to notice the roughly two orders of magnitude difference
>> in time between creation and conversion. Granted, it's not written
>> anywhere
>>
>
> Unsurprising when you create three orders of magnitude more data frames,
> is it?  That's a list of 2000 data frames.  Try
>
> system.time(for(i in 1:2000) data.frame(x = i, y = rnorm(1), id =
> paste0("x", i)))
>
>
>
>  that they should be similar but the latter seems interpreter-slow to me
>> (split is implemented with a lapply in the data frame case) There is also
>> a
>> memory issue when I hit about 2 elements (allocating 3GB when
>> interrupted). So before I resort to Rcpp, despite the electrifying feeling
>> of approaching the bare metal and for the sake of getting things done, I
>> thought I would ask the experts. Thanks
>>
>
> You need to re-think your data structures: 1-row data frames are not
> sensible.
>
>
>
>>
>> Antonio
>>
>>[[alternative HTML version deleted]]
>>
>>
>> __**
>> R-devel@r-project.org mailing list
>> https://stat.ethz.ch/mailman/**listinfo/r-devel
>>
>
>
> --
> Brian D. Ripley,  rip...@stats.ox.ac.uk
> Professor of Applied Statistics,  
> http://www.stats.ox.ac.uk/~**ripley/
> University of Oxford, Tel:  +44 1865 272861 (self)
> 1 South Parks Road, +44 1865 272866 (PA)
> Oxford OX1 3TG, UKFax:  +44 1865 272595
>

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] The constant part of the log-likelihood in StructTS

2012-05-01 Thread Jouni Helske
Ok, it seems that R's AIC and BIC functions warn about different constants,
so that's probably enough. The constants are not irrelevant though, if you
compute the log-likelihood of one model using StructTS, and then fit
alternative model using other functions such as arima(), which do take
account the constant term, and use those loglikelihoods for computing for
example BIC, you get wrong results when checking which model gives lower
BIC value. Hadn't though about it before, have to be more careful in future
when checking results from different packages etc.

Jouni


On Tue, May 1, 2012 at 4:51 PM, Ravi Varadhan  wrote:

> This is not a problem at all.  The log likelihood function is a function
> of the model parameters and the data, but it is defined up to an additive
> arbitrary constant, i.e. L(\theta) and L(\theta) + k are completely
> equivalent, for any k. This does not affect model comparisons or hypothesis
> tests.
>
> Ravi
> 
> From: r-devel-boun...@r-project.org [r-devel-boun...@r-project.org] on
> behalf of Jouni Helske [jounihel...@gmail.com]
> Sent: Monday, April 30, 2012 7:37 AM
> To: r-devel@r-project.org
> Subject: [Rd] The constant part of the log-likelihood in StructTS
>
> Dear all,
>
> I'd like to discuss about a possible bug in function StructTS of stats
> package. It seems that the function returns wrong value of the
> log-likelihood, as the added constant to the relevant part of the
> log-likelihood is misspecified. Here is an simple example:
>
> > data(Nile)
> > fit <- StructTS(Nile, type = "level")
> > fit$loglik
> [1] -367.5194
>
> When computing the log-likelihood with other packages such as KFAS and FKF,
> the loglikelihood value is around -645.
>
> For the local level model, the likelihood is defined by -0.5*n*log(2*pi) -
> 0.5*sum(log(F_t) + v_t^2/sqrt(F_t)) (see for example  Durbin and Koopman
> (2001, page 30). But in StructTS, the likelihood is computed like this:
>
> loglik <- -length(y) * res$value + length(y) * log(2 * pi),
>
> where the first part coincides with the last part of the definition, but
> the constant part has wrong sign and it is not multiplied by 0.5. Also in
> case of missing observations, I think there should be sum(!is.na(y))
> instead of length(y) in the constant term, as the likelihood is only
> computed for those y which are observed.
>
> This does not affect in estimation of model parameters, but it could have
> effects in model comparison or some other cases.
>
> Is there some reason for this kind of constant, or is it just a bug?
>
> Best regards,
>
> Jouni Helske
> PhD student in Statistics
> University of Jyväskylä
> Finland
>
> [[alternative HTML version deleted]]

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] fast version of split.data.frame or conversion from data.frame to list of its rows

2012-05-01 Thread Simon Urbanek

On May 1, 2012, at 1:26 PM, Antonio Piccolboni  wrote:

> It seems like people need to hear more context, happy to provide it. I am
> implementing a serialization format (typedbytes, HADOOP-1722 if people want
> the gory details) to make R and Hadoop interoperate better (RHadoop
> project, package rmr). It is a row first format and it's already
> implemented as a C extension for R for lists and atomic vectors, where each
> element  of a vector is a row. I need to extend it to accept data frames
> and I was wondering if I can use the existing C code by converting a data
> frame to a list of its rows. It sounds like the answer is that it is not a
> good idea,

Just think about it -- data frames are lists of *columns* because the type of 
each column is fixed. Treating them row-wise is extremely inefficient, because 
you can't use any vector type to represent such thing (other than a generic 
vector containing vectors of length 1).


> that's helpful too in a way because it restricts the options. I
> thought I may be missing a simple primitive, like a t() for data frames
> (that doesn't coerce to matrix).

See above - I think you are misunderstanding data frames - t() makes no sense 
for data frames.

Cheers,
Simon



> On Tue, May 1, 2012 at 5:46 AM, Prof Brian Ripley 
> wrote:
> 
>> On 01/05/2012 00:28, Antonio Piccolboni wrote:
>> 
>>> Hi,
>>> I was wondering if there is anything more efficient than split to do the
>>> kind of conversion in the subject. If I create a data frame as in
>>> 
>>> system.time({fd =  data.frame(x=1:2000, y = rnorm(2000), id = paste("x",
>>> 1:2000, sep =""))})
>>>  user  system elapsed
>>>  0.004   0.000   0.004
>>> 
>>> and then I try to split it
>>> 
>>> system.time(split(fd, 1:nrow(fd)))
 
>>>   user  system elapsed
>>>  0.333   0.031   0.415
>>> 
>>> 
>>> You will be quick to notice the roughly two orders of magnitude difference
>>> in time between creation and conversion. Granted, it's not written
>>> anywhere
>>> 
>> 
>> Unsurprising when you create three orders of magnitude more data frames,
>> is it?  That's a list of 2000 data frames.  Try
>> 
>> system.time(for(i in 1:2000) data.frame(x = i, y = rnorm(1), id =
>> paste0("x", i)))
>> 
>> 
>> 
>> that they should be similar but the latter seems interpreter-slow to me
>>> (split is implemented with a lapply in the data frame case) There is also
>>> a
>>> memory issue when I hit about 2 elements (allocating 3GB when
>>> interrupted). So before I resort to Rcpp, despite the electrifying feeling
>>> of approaching the bare metal and for the sake of getting things done, I
>>> thought I would ask the experts. Thanks
>>> 
>> 
>> You need to re-think your data structures: 1-row data frames are not
>> sensible.
>> 
>> 
>> 
>>> 
>>> Antonio
>>> 
>>>   [[alternative HTML version deleted]]
>>> 
>>> 
>>> __**
>>> R-devel@r-project.org mailing list
>>> https://stat.ethz.ch/mailman/**listinfo/r-devel
>>> 
>> 
>> 
>> --
>> Brian D. Ripley,  rip...@stats.ox.ac.uk
>> Professor of Applied Statistics,  
>> http://www.stats.ox.ac.uk/~**ripley/
>> University of Oxford, Tel:  +44 1865 272861 (self)
>> 1 South Parks Road, +44 1865 272866 (PA)
>> Oxford OX1 3TG, UKFax:  +44 1865 272595
>> 
> 
>   [[alternative HTML version deleted]]
> 
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
> 
> 

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] A doubt about substitute() after delayedAssign()

2012-05-01 Thread Philippe Grosjean

On 29/04/12 13:50, Duncan Murdoch wrote:

On 12-04-29 3:30 AM, Philippe Grosjean wrote:
 > Hello,
 >
 > ?delayedAssign presents substitute() as a way to look at the expression
 > in the promise. However,
 >
 > msg<- "old"
 > delayedAssign("x", msg)
 > msg<- "new!"
 > x #- new!
 > substitute(x) #- x (was 'msg' ?)
 >
 > Here, we just got 'x'... shouldn't we got 'msg'?
 >
 > Same result when the promise is not evaluated yet:
 >
 > delayedAssign("x", msg)
 > substitute(x)
 >
 > In a function, that works:
 >
 > foo<- function (x = msg) substitute(x)
 > foo()
 >
 > Did I misunderstood something? It seems to me that substitute() does not
 > behaves as documented for promises created using delayedAssign().

I don't think this is well documented, but substitute() doesn't act the
same when its "env" argument is the global environment. So this works
the way you'd expect:

e <- new.env()
msg <- "old"
delayedAssign("x", msg, assign=e)
msg <- "new"
e$x
substitute(x, e)

I forget what the motivation was for special-casing globalenv().

Duncan Murdoch


In the corresponding C code, there is a comment telling that it is for 
"historical reasons". Are these historical reasons that important that 
there is no way using R code (not C code) to know if a symbol is bind to 
a promise in .GlobalEnv? Anyway, I have filled a bug report because, at 
least the documentation of ?delayedAssign and ?substitute should be 
clarified, as well as, the example for delayedAssign... But, unless for 
a good reason, it would be better to perform substitution, even in 
.GlobalEnv, or alternatively, to provide a function like promiseExpr() 
to get it.


Here are a couple of potentially useful functions (using the inline 
package for convenience, and also note that I had to use a trick of 
passing the substituted name of the variable to get the promise at the C 
level... which would be unnecessary if these would be special base 
functions that pass unevaluated arguments):


## is.promise(): check if a name is bind to a promise
require(inline)
code <- '
  SEXP obj;
  if (!isString(name) || length(name) != 1)
error("name is not a single string");
  if (!isEnvironment(envir))
error("envir should be an environment");
  obj = findVar(install(CHAR(STRING_ELT(name, 0))), envir);
  return ScalarLogical(TYPEOF(obj) == PROMSXP);
'
is.promise <- cfunction(signature(name = "character", envir = 
"environment"),

code)
formals(is.promise) <- alist(x =, name = deparse(substitute(x)),
envir = parent.frame(1))

## isEvaluated(), determine if a promise has already been evaluated
## return always TRUE is the name is bind to something else
## than a promise
code <- '
  SEXP obj;
  if (!isString(name) || length(name) != 1)
error("name is not a single string");
  if (!isEnvironment(envir))
error("envir should be an environment");
  obj = findVar(install(CHAR(STRING_ELT(name, 0))), envir);
  if (TYPEOF(obj) == PROMSXP && PRVALUE(obj) == R_UnboundValue) {
return ScalarLogical(FALSE);
  } else {
/* if it is not a promise, it is always evaluated! */
return ScalarLogical(TRUE);
  }
'   
isEvaluated <- cfunction(signature(name = "character", envir = 
"environment"),

code)
formals(isEvaluated) <- alist(x =, name = deparse(substitute(x)),
envir = parent.frame(1))

## promiseExpr() retrieve the expression associated with a promise...
## even if it is in .GlobalEnv, what subsitute() does not!
code <- '
  SEXP obj;
  if (!isString(name) || length(name) != 1)
error("name is not a single string");
  if (!isEnvironment(envir))
error("envir should be an environment");
  obj = findVar(install(CHAR(STRING_ELT(name, 0))), envir);
  if (TYPEOF(obj) == PROMSXP) {
return PREXPR(obj);
  } else {
return R_NilValue;
  }
'   
promiseExpr <- cfunction(signature(name = "character", envir = 
"environment"),

code)
formals(promiseExpr) <- alist(x =, name = deparse(substitute(x)),
envir = parent.frame(1))

## promiseEnv() get the evaluation environment associated with a promise
code <- '
  SEXP obj;
  if (!isString(name) || length(name) != 1)
error("name is not a single string");
  if (!isEnvironment(envir))
error("envir should be an environment");
  obj = findVar(install(CHAR(STRING_ELT(name, 0))), envir);
  if (TYPEOF(obj) == PROMSXP) {
return PRENV(obj);
  } else {
return R_NilValue;
  }
'   
promiseEnv <- cfunction(signature(name = "character", envir = 
"environment"),

code)
formals(promiseEnv) <- alist(x =, name = deparse(substitute(x)),
envir = parent.frame(1))

## reeval() reavaluate a promise that has been already evaluated,
## An environment for the evaluation is required since PRENV is set
## to NULL on promise evaluation
code <- '
  SEXP obj;
  if (!isString(name) || length(name) != 1)
error("name is not a single string");
  if (!isEnvironment(envir))
error("envir should be an environment");
  if (!is

Re: [Rd] A doubt about substitute() after delayedAssign()

2012-05-01 Thread Duncan Murdoch

On 12-05-01 4:21 PM, Philippe Grosjean wrote:

On 29/04/12 13:50, Duncan Murdoch wrote:

On 12-04-29 3:30 AM, Philippe Grosjean wrote:
  >  Hello,
  >
  >  ?delayedAssign presents substitute() as a way to look at the expression
  >  in the promise. However,
  >
  >  msg<- "old"
  >  delayedAssign("x", msg)
  >  msg<- "new!"
  >  x #- new!
  >  substitute(x) #- x (was 'msg' ?)
  >
  >  Here, we just got 'x'... shouldn't we got 'msg'?
  >
  >  Same result when the promise is not evaluated yet:
  >
  >  delayedAssign("x", msg)
  >  substitute(x)
  >
  >  In a function, that works:
  >
  >  foo<- function (x = msg) substitute(x)
  >  foo()
  >
  >  Did I misunderstood something? It seems to me that substitute() does not
  >  behaves as documented for promises created using delayedAssign().

I don't think this is well documented, but substitute() doesn't act the
same when its "env" argument is the global environment. So this works
the way you'd expect:

e<- new.env()
msg<- "old"
delayedAssign("x", msg, assign=e)
msg<- "new"
e$x
substitute(x, e)

I forget what the motivation was for special-casing globalenv().

Duncan Murdoch


In the corresponding C code, there is a comment telling that it is for
"historical reasons". Are these historical reasons that important that
there is no way using R code (not C code) to know if a symbol is bind to
a promise in .GlobalEnv?


I don't know.  I believe I lost an argument similar to yours a few years 
ago, so I won't spend time on this again.


Duncan Murdoch

Anyway, I have filled a bug report because, at

least the documentation of ?delayedAssign and ?substitute should be
clarified, as well as, the example for delayedAssign... But, unless for
a good reason, it would be better to perform substitution, even in
.GlobalEnv, or alternatively, to provide a function like promiseExpr()
to get it.

Here are a couple of potentially useful functions (using the inline
package for convenience, and also note that I had to use a trick of
passing the substituted name of the variable to get the promise at the C
level... which would be unnecessary if these would be special base
functions that pass unevaluated arguments):

## is.promise(): check if a name is bind to a promise
require(inline)
code<- '
SEXP obj;
if (!isString(name) || length(name) != 1)
  error("name is not a single string");
if (!isEnvironment(envir))
  error("envir should be an environment");
obj = findVar(install(CHAR(STRING_ELT(name, 0))), envir);
return ScalarLogical(TYPEOF(obj) == PROMSXP);
'
is.promise<- cfunction(signature(name = "character", envir =
"environment"),
code)
formals(is.promise)<- alist(x =, name = deparse(substitute(x)),
envir = parent.frame(1))

## isEvaluated(), determine if a promise has already been evaluated
## return always TRUE is the name is bind to something else
## than a promise
code<- '
SEXP obj;
if (!isString(name) || length(name) != 1)
  error("name is not a single string");
if (!isEnvironment(envir))
  error("envir should be an environment");
obj = findVar(install(CHAR(STRING_ELT(name, 0))), envir);
if (TYPEOF(obj) == PROMSXP&&  PRVALUE(obj) == R_UnboundValue) {
return ScalarLogical(FALSE);
} else {
/* if it is not a promise, it is always evaluated! */
return ScalarLogical(TRUE);
}
'   
isEvaluated<- cfunction(signature(name = "character", envir =
"environment"),
code)
formals(isEvaluated)<- alist(x =, name = deparse(substitute(x)),
envir = parent.frame(1))

## promiseExpr() retrieve the expression associated with a promise...
## even if it is in .GlobalEnv, what subsitute() does not!
code<- '
SEXP obj;
if (!isString(name) || length(name) != 1)
  error("name is not a single string");
if (!isEnvironment(envir))
  error("envir should be an environment");
obj = findVar(install(CHAR(STRING_ELT(name, 0))), envir);
if (TYPEOF(obj) == PROMSXP) {
return PREXPR(obj);
} else {
return R_NilValue;
}
'   
promiseExpr<- cfunction(signature(name = "character", envir =
"environment"),
code)
formals(promiseExpr)<- alist(x =, name = deparse(substitute(x)),
envir = parent.frame(1))

## promiseEnv() get the evaluation environment associated with a promise
code<- '
SEXP obj;
if (!isString(name) || length(name) != 1)
  error("name is not a single string");
if (!isEnvironment(envir))
  error("envir should be an environment");
obj = findVar(install(CHAR(STRING_ELT(name, 0))), envir);
if (TYPEOF(obj) == PROMSXP) {
return PRENV(obj);
} else {
return R_NilValue;
}
'   
promiseEnv<- cfunction(signature(name = "character", envir =
"environment"),
code)
formals(promiseEnv)<- alist(x =, name = deparse(substitute(x)),
envir = parent.frame(1))

## reeval() reavaluate a promise that has been already evaluated,
## An environment for the evaluatio

Re: [Rd] fast version of split.data.frame or conversion from data.frame to list of its rows

2012-05-01 Thread Antonio Piccolboni
On Tue, May 1, 2012 at 11:29 AM, Simon Urbanek
wrote:

>
> On May 1, 2012, at 1:26 PM, Antonio Piccolboni 
> wrote:
>
> > It seems like people need to hear more context, happy to provide it. I am
> > implementing a serialization format (typedbytes, HADOOP-1722 if people
> want
> > the gory details) to make R and Hadoop interoperate better (RHadoop
> > project, package rmr). It is a row first format and it's already
> > implemented as a C extension for R for lists and atomic vectors, where
> each
> > element  of a vector is a row. I need to extend it to accept data frames
> > and I was wondering if I can use the existing C code by converting a data
> > frame to a list of its rows. It sounds like the answer is that it is not
> a
> > good idea,
>
> Just think about it -- data frames are lists of *columns* because the type
> of each column is fixed. Treating them row-wise is extremely inefficient,
> because you can't use any vector type to represent such thing (other than a
> generic vector containing vectors of length 1).
>

Thanks, let's say this together with the experiments and other converging
opinions lays the question to rest.


>  > that's helpful too in a way because it restricts the options. I
> > thought I may be missing a simple primitive, like a t() for data frames
> > (that doesn't coerce to matrix).
>
> See above - I think you are misunderstanding data frames - t() makes no
> sense for data frames.
>

I think you are misunderstanding my use of t(). Thanks


Antonio


>
> Cheers,
> Simon
>
>
>
> > On Tue, May 1, 2012 at 5:46 AM, Prof Brian Ripley  >wrote:
> >
> >> On 01/05/2012 00:28, Antonio Piccolboni wrote:
> >>
> >>> Hi,
> >>> I was wondering if there is anything more efficient than split to do
> the
> >>> kind of conversion in the subject. If I create a data frame as in
> >>>
> >>> system.time({fd =  data.frame(x=1:2000, y = rnorm(2000), id =
> paste("x",
> >>> 1:2000, sep =""))})
> >>>  user  system elapsed
> >>>  0.004   0.000   0.004
> >>>
> >>> and then I try to split it
> >>>
> >>> system.time(split(fd, 1:nrow(fd)))
> 
> >>>   user  system elapsed
> >>>  0.333   0.031   0.415
> >>>
> >>>
> >>> You will be quick to notice the roughly two orders of magnitude
> difference
> >>> in time between creation and conversion. Granted, it's not written
> >>> anywhere
> >>>
> >>
> >> Unsurprising when you create three orders of magnitude more data frames,
> >> is it?  That's a list of 2000 data frames.  Try
> >>
> >> system.time(for(i in 1:2000) data.frame(x = i, y = rnorm(1), id =
> >> paste0("x", i)))
> >>
> >>
> >>
> >> that they should be similar but the latter seems interpreter-slow to me
> >>> (split is implemented with a lapply in the data frame case) There is
> also
> >>> a
> >>> memory issue when I hit about 2 elements (allocating 3GB when
> >>> interrupted). So before I resort to Rcpp, despite the electrifying
> feeling
> >>> of approaching the bare metal and for the sake of getting things done,
> I
> >>> thought I would ask the experts. Thanks
> >>>
> >>
> >> You need to re-think your data structures: 1-row data frames are not
> >> sensible.
> >>
> >>
> >>
> >>>
> >>> Antonio
> >>>
> >>>   [[alternative HTML version deleted]]
> >>>
> >>>
> >>> __**
> >>> R-devel@r-project.org mailing list
> >>> https://stat.ethz.ch/mailman/**listinfo/r-devel<
> https://stat.ethz.ch/mailman/listinfo/r-devel>
> >>>
> >>
> >>
> >> --
> >> Brian D. Ripley,  rip...@stats.ox.ac.uk
> >> Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~**ripley/<
> http://www.stats.ox.ac.uk/~ripley/>
> >> University of Oxford, Tel:  +44 1865 272861 (self)
> >> 1 South Parks Road, +44 1865 272866 (PA)
> >> Oxford OX1 3TG, UKFax:  +44 1865 272595
> >>
> >
> >   [[alternative HTML version deleted]]
> >
> > __
> > R-devel@r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-devel
> >
> >
>
>

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel