Re: [Rd] arbitrary size data frame or other stcucts, curious about issues invovled.
Thanks, http://cran.r-project.org/doc/manuals/R-ints.html#Future-directions Normally I'd take more time to digest these things before commenting but a few things struck me right away. First, use of floating point or double as a replacement for int strikes me as "going the wrong way" as often to get predictable performance you try to tell the compiler you have ints rather than any floating time for which it is free to "round." This is even ignoring any performance issue. The other thing is that scaling should not just be an issue of "make everything bigger" as the growth in both data needs and computer resources is not uniform. I guess my first thought to these constraints and resource issues is to consider a paged dataframe depending upon the point at which the 32-bit int constraint is imposed. A random access data struct does not always get accessed randomly, and often it is purely sequential. Further down the road, it would be nice if algorithms were implemented in a block mode or could communicate their access patterns to the ds or at least tell it to prefetch things that should be needed soon. I guess I'm thinking mostly along the lines of things I've seen from Intel such as ( first things I could find on their site as I have not looked in detail in quite a while), http://www.google.com/search?hl=en&source=hp&q=site%3Aintel.com+performance+optimization as once you get around thrashing virtual memory, you'd like to preserve the lower level memory cache hit rates too etc. These are probably not just niceities, at least with VM, as personally I've seen impl related speed issues make simple analyses impractical. > Subject: RE: arbitrary size data frame or other stcucts, curious about issues > invovled. > From: jayemer...@gmail.com > To: marchy...@hotmail.com; r-devel@r-project.org > > Mike, > > > Neither bigmemory nor ff are "drop in" solutions -- though useful, > they are primarily for data storage and management and allowing > convenient access to subsets of the data. Direct analysis of the full > objects via most R functions is not possible. There are many issues > that could be discussed here (and have, previously), including the use > of 32-bit integer indexing. There is a nice section "Future > Directions" in the R Internals manual that you might want to look at. > > Jay > > > - Original message: > > We keep getting questions on r-help about memory limits and > I was curious to know what issues are involved in making > common classes like dataframe work with disk and intelligent > swapping? That is, sure you can always rely on OS for VM > but in theory it should be possible to make a data structure > that somehow knows what pieces you will access next and > can keep thos somewhere fast. Now of course algorithms > "should" act locally and be block oriented but in any case > could communicate with data structures on upcoming > access patterns, see a few ms into the future and have the > right stuff prefetched. > > I think things like "bigmemory" exist but perhaps one > issue was that this could not just drop in for data.frame > or does it already solve all the problems? > > Is memory management just a non-issue or is there something > that needs to be done to make large data structures work well? > > > -- > John W. Emerson (Jay) > Associate Professor of Statistics > Department of Statistics > Yale University > http://www.stat.yale.edu/~jay [[alternative HTML version deleted]] __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] Error using RcppGSL
Hi, I get an error using RcppGSL: fatal error: gsl/gsl_vector.h:No such file or directory. What is the best way to install these files as they seem to be missing? Thanks, Oyvind -- View this message in context: http://r.789695.n4.nabble.com/Error-using-RcppGSL-tp3613535p3613535.html Sent from the R devel mailing list archive at Nabble.com. __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] requesting a mentor for R development
Paul Johnson gmail.com> writes: > > I'd like to learn the process of revising R functions & packages and > then submitting proposed patches to the R Core team. Would someone be > willing to mentor me through one example? I don't know about mentoring, but I'll give my two cents (some specific and some general thoughts, and some of my own questions). > For starters, consider an example. I'd like to revise the t.test > function to return the stderr value to the user. We only need to > change the "rval" in the third-from-the end line of > stats:::t.test.default. > > Change this: > > rval <- list(statistic = tstat, parameter = df, p.value = pval, > conf.int = cint, estimate = estimate, null.value = mu, > alternative = alternative, method = method, data.name = dname) > class(rval) <- "htest" > return(rval) > > To this: > > rval <- list(statistic = tstat, parameter = df, stderr=stderr, p.value = > pval, > conf.int = cint, estimate = estimate, null.value = mu, > alternative = alternative, method = method, data.name = dname) > class(rval) <- "htest" > return(rval) > > Here is where I need help. > > 1. What other changes in the R code & documents would be required for > consistency? You would certainly want to modify the corresponding manual page. There is an issue with fitting this specific change into the existing R framework. 't.test' returns an object of class 'htest', which is printed using 'stats:::print.htest'. I haven't actually tried it, but it looks to me as though print.htest (which is intended to be a generic class for statistical test -- maybe h stands for "hypothesis", I don't know) will simply ignore "stderr" or any other non-default components of an 'htest' object. There is a '...' argument to 'print.htest', but it functions to pass options to print() for the estimate and the null values. I haven't been able to find any specific documentation about htest. print.htest() is used for a wide range of tests (in the stats package and many contributed packages), so obviously it would be bad to make non-backward-compatible changes. I think it might be best to modify print.htest() so it looked for any additional/non-standard components of the object and printed them after all the other output (they would need to have reasonable labels incorporated somehow too). > > 2. How does R Core Team expect/accept patches? > I do understand a bit about SVN and CVS. I think I could mark this up > and make a diff, but I'm uncertain about the details. I think that patches via "svn diff" against the latest SVN are preferred. The only routes I know for submission are (1) sending patch files to R-devel or (2) submitting them to the R bugs system as a "wishlist" item. I don't know which is preferred. > 3. How would I best prepare a suggestion like this so as to get it > accepted into R? I think the accepted method (although I don't know whether this is documented anywhere) is to do what you have done, submitting a "request for discussion" on r-devel, getting it discussed, posting patches for comment, and hoping for the best. __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Error using RcppGSL
Questions like this would get a faster response on the Rcpp-devel list, to which I am copying this reply. On Jun 21, 2011 6:35 AM, "oyvfos" wrote: > Hi, I get an error using RcppGSL: fatal error: gsl/gsl_vector.h:No such file > or directory. What is the best way to install these files as they seem to > be missing? > Thanks, > Oyvind > > > -- > View this message in context: http://r.789695.n4.nabble.com/Error-using-RcppGSL-tp3613535p3613535.html > Sent from the R devel mailing list archive at Nabble.com. > > __ > R-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel [[alternative HTML version deleted]] __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] arbitrary size data frame or other stcucts, curious about issues invovled.
Mike, this is all nice, but AFAICS the first part misses the point that there is no 64-bit integer type in the API so there is simply no alternative at the moment. You just said that you don't like it, but you failed to provide a solution ... As for the second part, the idea is not new and is noble, but AFAIK no one so far was able to draft any good proposal as of what the API would look like. It would be very desirable if someone did, though. (BTW your link is useless - linking google searches is pointless as the results vary by request location, user setting etc.). Cheers, Simon On Jun 21, 2011, at 6:33 AM, Mike Marchywka wrote: > Thanks, > > http://cran.r-project.org/doc/manuals/R-ints.html#Future-directions > > Normally I'd take more time to digest these things before commenting but > a few things struck me right away. First, use of floating point or double > as a replacement for int strikes me as "going the wrong way" as often > to get predictable performance you try to tell the compiler you have > ints rather than any floating time for which it is free to "round." This > is even ignoring any performance issue. The other thing is that scaling > should not just be an issue of "make everything bigger" as the growth in > both data needs and computer resources is not uniform. > > I guess my first thought to these constraints and resource issues > is to consider a paged dataframe depending upon the point at which > the 32-bit int constraint is imposed. A random access data struct > does not always get accessed randomly, and often it is purely sequential. > Further down the road, it would be nice if algorithms were implemented in a > block mode or could communicate their access patterns to the ds or > at least tell it to prefetch things that should be needed soon. > > I guess I'm thinking mostly along the lines of things I've seen from Intel > such as ( first things I could find on their site as I have not looked in > detail > in quite a while), > > > http://www.google.com/search?hl=en&source=hp&q=site%3Aintel.com+performance+optimization > > as once you get around thrashing virtual memory, you'd like to preserve the > lower level memory cache hit rates too etc. These are probably not just > niceities, > at least with VM, as personally I've seen impl related speed issues make > simple analyses impractical. > > > > > > > > > > > > > > > > > >> Subject: RE: arbitrary size data frame or other stcucts, curious about >> issues invovled. >> From: jayemer...@gmail.com >> To: marchy...@hotmail.com; r-devel@r-project.org >> >> Mike, >> >> >> Neither bigmemory nor ff are "drop in" solutions -- though useful, >> they are primarily for data storage and management and allowing >> convenient access to subsets of the data. Direct analysis of the full >> objects via most R functions is not possible. There are many issues >> that could be discussed here (and have, previously), including the use >> of 32-bit integer indexing. There is a nice section "Future >> Directions" in the R Internals manual that you might want to look at. >> >> Jay >> >> >> - Original message: >> >> We keep getting questions on r-help about memory limits and >> I was curious to know what issues are involved in making >> common classes like dataframe work with disk and intelligent >> swapping? That is, sure you can always rely on OS for VM >> but in theory it should be possible to make a data structure >> that somehow knows what pieces you will access next and >> can keep thos somewhere fast. Now of course algorithms >> "should" act locally and be block oriented but in any case >> could communicate with data structures on upcoming >> access patterns, see a few ms into the future and have the >> right stuff prefetched. >> >> I think things like "bigmemory" exist but perhaps one >> issue was that this could not just drop in for data.frame >> or does it already solve all the problems? >> >> Is memory management just a non-issue or is there something >> that needs to be done to make large data structures work well? >> >> >> -- >> John W. Emerson (Jay) >> Associate Professor of Statistics >> Department of Statistics >> Yale University >> http://www.stat.yale.edu/~jay > > [[alternative HTML version deleted]] > > __ > R-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel > > __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] arbitrary size data frame or other stcucts, curious about issues invovled.
> > Mike, > > this is all nice, but AFAICS the first part misses the point that there is no > 64-bit integer type in the API so there is simply no alternative at the > moment. You just said that you don't like it, but you failed to provide a > solution ... As for the second part, the idea is not new and is noble, but > AFAIK no one so far was able to draft any good proposal as of what the API > would look like. It would be very desirable if someone did, though. (BTW your > link is useless - linking google searches is pointless as the results vary by > request location, user setting etc.). I guess in reverse order, the google link is intended for convenience for those interested as I could not find a specific link and didn't expect much spam to be there ( "its all good" ) so results may not be preidctable but just like floating point should be close enough for the curious analyst. I'm not trying to provide a solution until I understand the problem. There are many issues with "big data" and I'll try to explain my concerns but they require talking about them in a bit of an integrated way to see how they relate and to see if my understandings are correct about R ( before I dig into it, want to look for the right things). The 32 bit int still has cardinality of multi-gigs and there are issues of indexes versus memory size. A typical data frame may point to thousands of rows with many colums of mixed type, non being less than 4 bytes of content. So, to simply avoid using up phyiscal memory I would not think the 32 bit issue is a limitation, certainly a square array already has the 64 bit pointer to a given element ( 32*2LOL). An arbitrary size frame, up to the limits of the indexing, could easily exceed physical memory but as I understand it R can bomb at that point or even with VM have speed issues. Simply being able to select the storage order could be a big deal depending on the access pattern: rows, columns bit reversed, etc. This could prevent VM thrashing well before you hit a 32 bit API limit and be transparent beyond adding a new ctor method. And in fact you may have many larger operands, you may want to tell a give df subclass to ONLY keep so much in physical memory at a time. Resource contention and starvation, fighting for food(data) can be a bottleneck. data.frame( storage="bit_reversed", physical_mem_limit="some absolute or relative thing here"). In any case, you may be able to imagine adding something like a paging method to a 32 bit api for example that would be transparent to small data sets although I'd have to give it some thought. This would only make sense in cases where aceesses tend to occur in blocks but this could be a lot of situations. I guess I can look at the big memory and related classes for some idea of what is going on here. For purely sequential access I guess I was looking for some kind of streaming data source and then anything related to size may be well contained. > > Cheers, > Simon > > > On Jun 21, 2011, at 6:33 AM, Mike Marchywka wrote: > > > Thanks, > > > > http://cran.r-project.org/doc/manuals/R-ints.html#Future-directions > > > > Normally I'd take more time to digest these things before commenting but > > a few things struck me right away. First, use of floating point or double > > as a replacement for int strikes me as "going the wrong way" as often > > to get predictable performance you try to tell the compiler you have > > ints rather than any floating time for which it is free to "round." This > > is even ignoring any performance issue. The other thing is that scaling > > should not just be an issue of "make everything bigger" as the growth in > > both data needs and computer resources is not uniform. > > > > I guess my first thought to these constraints and resource issues > > is to consider a paged dataframe depending upon the point at which > > the 32-bit int constraint is imposed. A random access data struct > > does not always get accessed randomly, and often it is purely sequential. > > Further down the road, it would be nice if algorithms were implemented in a > > block mode or could communicate their access patterns to the ds or > > at least tell it to prefetch things that should be needed soon. > > > > I guess I'm thinking mostly along the lines of things I've seen from Intel > > such as ( first things I could find on their site as I have not looked in > > detail > > in quite a while), > > > > > > http://www.google.com/search?hl=en&source=hp&q=site%3Aintel.com+performance+optimization > > > > as once you get around thrashing virtual memory, you'd like to preserve the > > lower level memory cache hit rates too etc. These are probably not just > > niceities, > > at least with VM, as personally I've seen impl related speed issues make > > simple analyses impractical. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >> Subject: RE: arbitrary s
Re: [Rd] [Rcpp-devel] Error using RcppGSL
On 21 June 2011 at 09:50, Douglas Bates wrote: | Questions like this would get a faster response on the Rcpp-devel list, to | which I am copying this reply. Quite right. Dropping r-devel now. You need to subscribe to rcpp-devel if you want to post there. | On Jun 21, 2011 6:35 AM, "oyvfos" wrote: | > Hi, I get an error using RcppGSL: fatal error: gsl/gsl_vector.h:No such file | > or directory. What is the best way to install these files as they seem to | > be missing? What platform, ie what operating system? Do you have GSL installed? Including development headers and libraries? Did you _ever_ compile anything against GSL? Dirk | > Thanks, | > Oyvind | > | > | > -- | > View this message in context: http://r.789695.n4.nabble.com/ | Error-using-RcppGSL-tp3613535p3613535.html | > Sent from the R devel mailing list archive at Nabble.com. | > | > __ | > R-devel@r-project.org mailing list | > https://stat.ethz.ch/mailman/listinfo/r-devel | | -- | ___ | Rcpp-devel mailing list | rcpp-de...@lists.r-forge.r-project.org | https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/rcpp-devel -- Gauss once played himself in a zero-sum game and won $50. -- #11 at http://www.gaussfacts.com __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel