Re: [Rd] arbitrary size data frame or other stcucts, curious about issues invovled.

2011-06-21 Thread Mike Marchywka



Thanks,

http://cran.r-project.org/doc/manuals/R-ints.html#Future-directions

Normally I'd take more time to digest these things before commenting but
a few things struck me right away. First, use of floating point or double 
as a replacement for int strikes me as "going the wrong way" as often
to get predictable performance you try to tell the compiler you have
ints rather than any floating time for which it is free to "round."  This
is even ignoring any performance issue. The other thing is that scaling
should not just be an issue of "make everything bigger" as the growth in
both data needs and computer resources is not uniform. 

I guess my first thought to these constraints and resource issues
is to consider a paged dataframe depending upon the point at which
the 32-bit int constraint is imposed. A random access data struct 
does not always get accessed randomly, and often it is purely sequential.
Further down the road, it would be nice if algorithms were implemented in a
block mode or could communicate their access patterns to the ds or
at least tell it to prefetch things that should be needed soon. 

I guess I'm thinking mostly along the lines of things I've seen from Intel
such as ( first things I could find on their site as I have not looked in detail
in quite a while),


http://www.google.com/search?hl=en&source=hp&q=site%3Aintel.com+performance+optimization

as once you get around thrashing virtual memory, you'd like to preserve the
lower level memory cache hit rates too etc. These are probably not just 
niceities, 
at least with VM, as personally I've seen impl related speed issues make simple 
analyses impractical.

















> Subject: RE: arbitrary size data frame or other stcucts, curious about issues 
> invovled.
> From: jayemer...@gmail.com
> To: marchy...@hotmail.com; r-devel@r-project.org
> 
> Mike,
> 
> 
> Neither bigmemory nor ff are "drop in" solutions -- though useful,
> they are primarily for data storage and management and allowing
> convenient access to subsets of the data.  Direct analysis of the full
> objects via most R functions is not possible.  There are many issues
> that could be discussed here (and have, previously), including the use
> of 32-bit integer indexing.  There is a nice section "Future
> Directions" in the R Internals manual that you might want to look at.
> 
> Jay
> 
> 
> -  Original message:
> 
> We keep getting questions on r-help about memory limits  and
> I was curious to know what issues are involved in making
> common classes like dataframe work with disk and intelligent
> swapping? That is, sure you can always rely on OS for VM
> but in theory it should be possible to make a data structure
> that somehow knows what pieces you will access next and
> can keep thos somewhere fast. Now of course algorithms
> "should" act locally and be block oriented but in any case
> could communicate with data structures on upcoming
> access patterns, see a few ms into the future and have the
> right stuff prefetched.
> 
> I think things like "bigmemory" exist but perhaps one
> issue was that this could not just drop in for data.frame
> or does it already solve all the problems?
> 
> Is memory management just a non-issue or is there something
> that needs to be done  to make large data structures work well?
> 
> 
> -- 
> John W. Emerson (Jay)
> Associate Professor of Statistics
> Department of Statistics
> Yale University
> http://www.stat.yale.edu/~jay
  
[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] Error using RcppGSL

2011-06-21 Thread oyvfos
Hi, I get an error using RcppGSL: fatal error: gsl/gsl_vector.h:No such file
or directory.  What is the best way to install these files as they seem to
be missing?
Thanks,
Oyvind


--
View this message in context: 
http://r.789695.n4.nabble.com/Error-using-RcppGSL-tp3613535p3613535.html
Sent from the R devel mailing list archive at Nabble.com.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] requesting a mentor for R development

2011-06-21 Thread Ben Bolker
Paul Johnson  gmail.com> writes:

> 
> I'd like to learn the process of revising R functions & packages and
> then submitting proposed patches to the R Core team.  Would someone be
> willing to mentor me through one example?

  I don't know about mentoring, but I'll give my two cents (some
specific and some general thoughts, and some of my own questions).

> For starters, consider an example.  I'd like to revise the t.test
> function to return the stderr value to the user.  We only need to
> change the "rval" in the third-from-the end line of
> stats:::t.test.default.
> 
> Change this:
> 
>  rval <- list(statistic = tstat, parameter = df, p.value = pval,
> conf.int = cint, estimate = estimate, null.value = mu,
> alternative = alternative, method = method, data.name = dname)
> class(rval) <- "htest"
> return(rval)
> 
> To this:
> 
>  rval <- list(statistic = tstat, parameter = df, stderr=stderr, p.value = 
> pval,
> conf.int = cint, estimate = estimate, null.value = mu,
> alternative = alternative, method = method, data.name = dname)
> class(rval) <- "htest"
> return(rval)
> 
> Here is where I need help.
> 
> 1. What other changes in the R code & documents would be required for
> consistency?

  You would certainly want to modify the corresponding manual page.
  
  There is an issue with fitting this specific change into the existing
R framework.  't.test' returns an object of class 'htest', which is
printed using 'stats:::print.htest'.  I haven't actually tried it,
but it looks to me as though print.htest (which is intended to
be a generic class for statistical test -- maybe h stands for
"hypothesis", I don't know) will simply ignore "stderr" or
any other non-default components of an 'htest' object.
There is a '...' argument to 'print.htest', but it functions
to pass options to print() for the estimate and the null values.

  I haven't been able to find any specific documentation about
htest.  print.htest() is used for a wide range of tests (in the
stats package and many contributed packages), so obviously it
would be bad to make non-backward-compatible changes.  I think
it might be best to modify print.htest() so it looked
for any additional/non-standard components of the object and
printed them after all the other output (they would need to
have reasonable labels incorporated somehow too).


> 
> 2. How does R Core Team expect/accept patches?
> I do understand a bit about SVN and CVS.  I think I could mark this up
> and make a diff, but I'm uncertain about the details.

  I think that patches via "svn diff" against the latest SVN are
preferred. The only routes I know for submission are (1) sending
patch files to R-devel or (2) submitting them to the R bugs system
as a "wishlist" item.  I don't know which is preferred.

> 3. How would I best  prepare a suggestion like this so as to get it
> accepted into R?

  I think the accepted method (although I don't know whether this
is documented anywhere) is to do what you have done, submitting a
"request for discussion" on r-devel, getting it discussed, posting
patches for comment, and hoping for the best.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Error using RcppGSL

2011-06-21 Thread Douglas Bates
Questions like this would get a faster response on the Rcpp-devel list, to
which I am copying this reply.
On Jun 21, 2011 6:35 AM, "oyvfos"  wrote:
> Hi, I get an error using RcppGSL: fatal error: gsl/gsl_vector.h:No such
file
> or directory. What is the best way to install these files as they seem to
> be missing?
> Thanks,
> Oyvind
>
>
> --
> View this message in context:
http://r.789695.n4.nabble.com/Error-using-RcppGSL-tp3613535p3613535.html
> Sent from the R devel mailing list archive at Nabble.com.
>
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] arbitrary size data frame or other stcucts, curious about issues invovled.

2011-06-21 Thread Simon Urbanek
Mike,

this is all nice, but AFAICS the first part misses the point that there is no 
64-bit integer type in the API so there is simply no alternative at the moment. 
You just said that you don't like it, but you failed to provide a solution ... 
As for the second part, the idea is not new and is noble, but AFAIK no one so 
far was able to draft any good proposal as of what the API would look like. It 
would be very desirable if someone did, though. (BTW your link is useless - 
linking google searches is pointless as the results vary by request location, 
user setting etc.).

Cheers,
Simon


On Jun 21, 2011, at 6:33 AM, Mike Marchywka wrote:

> Thanks,
> 
> http://cran.r-project.org/doc/manuals/R-ints.html#Future-directions
> 
> Normally I'd take more time to digest these things before commenting but
> a few things struck me right away. First, use of floating point or double 
> as a replacement for int strikes me as "going the wrong way" as often
> to get predictable performance you try to tell the compiler you have
> ints rather than any floating time for which it is free to "round."  This
> is even ignoring any performance issue. The other thing is that scaling
> should not just be an issue of "make everything bigger" as the growth in
> both data needs and computer resources is not uniform. 
> 
> I guess my first thought to these constraints and resource issues
> is to consider a paged dataframe depending upon the point at which
> the 32-bit int constraint is imposed. A random access data struct 
> does not always get accessed randomly, and often it is purely sequential.
> Further down the road, it would be nice if algorithms were implemented in a
> block mode or could communicate their access patterns to the ds or
> at least tell it to prefetch things that should be needed soon. 
> 
> I guess I'm thinking mostly along the lines of things I've seen from Intel
> such as ( first things I could find on their site as I have not looked in 
> detail
> in quite a while),
> 
> 
> http://www.google.com/search?hl=en&source=hp&q=site%3Aintel.com+performance+optimization
> 
> as once you get around thrashing virtual memory, you'd like to preserve the
> lower level memory cache hit rates too etc. These are probably not just 
> niceities, 
> at least with VM, as personally I've seen impl related speed issues make 
> simple analyses impractical.
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
>> Subject: RE: arbitrary size data frame or other stcucts, curious about 
>> issues invovled.
>> From: jayemer...@gmail.com
>> To: marchy...@hotmail.com; r-devel@r-project.org
>> 
>> Mike,
>> 
>> 
>> Neither bigmemory nor ff are "drop in" solutions -- though useful,
>> they are primarily for data storage and management and allowing
>> convenient access to subsets of the data.  Direct analysis of the full
>> objects via most R functions is not possible.  There are many issues
>> that could be discussed here (and have, previously), including the use
>> of 32-bit integer indexing.  There is a nice section "Future
>> Directions" in the R Internals manual that you might want to look at.
>> 
>> Jay
>> 
>> 
>> -  Original message:
>> 
>> We keep getting questions on r-help about memory limits  and
>> I was curious to know what issues are involved in making
>> common classes like dataframe work with disk and intelligent
>> swapping? That is, sure you can always rely on OS for VM
>> but in theory it should be possible to make a data structure
>> that somehow knows what pieces you will access next and
>> can keep thos somewhere fast. Now of course algorithms
>> "should" act locally and be block oriented but in any case
>> could communicate with data structures on upcoming
>> access patterns, see a few ms into the future and have the
>> right stuff prefetched.
>> 
>> I think things like "bigmemory" exist but perhaps one
>> issue was that this could not just drop in for data.frame
>> or does it already solve all the problems?
>> 
>> Is memory management just a non-issue or is there something
>> that needs to be done  to make large data structures work well?
>> 
>> 
>> -- 
>> John W. Emerson (Jay)
>> Associate Professor of Statistics
>> Department of Statistics
>> Yale University
>> http://www.stat.yale.edu/~jay
> 
>   [[alternative HTML version deleted]]
> 
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
> 
> 

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] arbitrary size data frame or other stcucts, curious about issues invovled.

2011-06-21 Thread Mike Marchywka



> 
> Mike,
> 
> this is all nice, but AFAICS the first part misses the point that there is no 
> 64-bit integer type in the API so there is simply no alternative at the 
> moment. You just said that you don't like it, but you failed to provide a 
> solution ... As for the second part, the idea is not new and is noble, but 
> AFAIK no one so far was able to draft any good proposal as of what the API 
> would look like. It would be very desirable if someone did, though. (BTW your 
> link is useless - linking google searches is pointless as the results vary by 
> request location, user setting etc.).

I guess in reverse order, the google link is intended for convenience for those 
interested as I could
not find a specific link and didn't expect much spam to be there ( "its all 
good" ) so results may
not be preidctable but just like floating point should be close enough for the 
curious analyst.
I'm not trying to provide a solution until I understand the problem. 

There are many issues with "big data" and I'll try to explain my concerns but 
they require
talking about them in a bit of an integrated way to see how they relate  and to 
see if my
understandings are correct about R ( before I dig into it, want to look for the 
right things). 


The 32 bit int still has cardinality of multi-gigs and there are issues of 
indexes versus memory size.
A typical data frame may point to thousands of rows with many colums of mixed 
type, non being
less than 4 bytes of content. So, to simply avoid using up phyiscal memory I 
would not think
the 32 bit issue is a limitation, certainly a square array already has the 64 
bit pointer to
a given element ( 32*2LOL). An arbitrary size frame, up to the limits of the 
indexing,
could easily exceed physical memory but as I understand it R can bomb at that 
point or 
even with VM have speed issues.  

Simply being able to select the storage order could be a big deal depending on
the access pattern: rows, columns bit reversed, etc. This could prevent VM 
thrashing
well before you hit a 32 bit API limit and be transparent beyond adding a new 
ctor method.
And in fact you may have many larger operands, you may want to tell a give df 
subclass
to ONLY keep so much in physical memory at a time. Resource contention and 
starvation,
fighting for food(data) can be a bottleneck.


data.frame( storage="bit_reversed", physical_mem_limit="some absolute or 
relative thing here").




In any case, you may be able to imagine adding something like a paging method 
to a 32 bit
api for example that would be transparent to small data sets although I'd have 
to give it some thought.
This would only make  sense in cases where aceesses tend to occur in blocks
but this could be a lot of situations. 

I guess I can look at the big memory and related classes for some idea of what 
is going
on here. 

For purely sequential access I guess I was looking for some kind of streaming
data source and then anything related to size may be well contained.








> 
> Cheers,
> Simon
> 
> 
> On Jun 21, 2011, at 6:33 AM, Mike Marchywka wrote:
> 
> > Thanks,
> > 
> > http://cran.r-project.org/doc/manuals/R-ints.html#Future-directions
> > 
> > Normally I'd take more time to digest these things before commenting but
> > a few things struck me right away. First, use of floating point or double 
> > as a replacement for int strikes me as "going the wrong way" as often
> > to get predictable performance you try to tell the compiler you have
> > ints rather than any floating time for which it is free to "round."  This
> > is even ignoring any performance issue. The other thing is that scaling
> > should not just be an issue of "make everything bigger" as the growth in
> > both data needs and computer resources is not uniform. 
> > 
> > I guess my first thought to these constraints and resource issues
> > is to consider a paged dataframe depending upon the point at which
> > the 32-bit int constraint is imposed. A random access data struct 
> > does not always get accessed randomly, and often it is purely sequential.
> > Further down the road, it would be nice if algorithms were implemented in a
> > block mode or could communicate their access patterns to the ds or
> > at least tell it to prefetch things that should be needed soon. 
> > 
> > I guess I'm thinking mostly along the lines of things I've seen from Intel
> > such as ( first things I could find on their site as I have not looked in 
> > detail
> > in quite a while),
> > 
> > 
> > http://www.google.com/search?hl=en&source=hp&q=site%3Aintel.com+performance+optimization
> > 
> > as once you get around thrashing virtual memory, you'd like to preserve the
> > lower level memory cache hit rates too etc. These are probably not just 
> > niceities, 
> > at least with VM, as personally I've seen impl related speed issues make 
> > simple analyses impractical.
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> >> Subject: RE: arbitrary s

Re: [Rd] [Rcpp-devel] Error using RcppGSL

2011-06-21 Thread Dirk Eddelbuettel

On 21 June 2011 at 09:50, Douglas Bates wrote:
| Questions like this would get a faster response on the Rcpp-devel list, to
| which I am copying this reply.

Quite right.  Dropping r-devel now.  You need to subscribe to rcpp-devel if
you want to post there.
 
| On Jun 21, 2011 6:35 AM, "oyvfos"  wrote:
| > Hi, I get an error using RcppGSL: fatal error: gsl/gsl_vector.h:No such file
| > or directory. What is the best way to install these files as they seem to
| > be missing?

What platform, ie what operating system?

Do you have GSL installed?  Including development headers and libraries?  Did
you _ever_ compile anything against GSL?

Dirk

| > Thanks,
| > Oyvind
| >
| >
| > --
| > View this message in context: http://r.789695.n4.nabble.com/
| Error-using-RcppGSL-tp3613535p3613535.html
| > Sent from the R devel mailing list archive at Nabble.com.
| >
| > __
| > R-devel@r-project.org mailing list
| > https://stat.ethz.ch/mailman/listinfo/r-devel
| 
| --
| ___
| Rcpp-devel mailing list
| rcpp-de...@lists.r-forge.r-project.org
| https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/rcpp-devel

-- 
Gauss once played himself in a zero-sum game and won $50.
  -- #11 at http://www.gaussfacts.com

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel