Re: [Rd] Efficiency of factor objects

2011-11-07 Thread Stavros Macrakis
Matthew, Yes, the case I am thinking of is a 1-column key; sorry for the overgeneralization. I haven't thought much about the multi-column key case. -s On Mon, Nov 7, 2011 at 12:48, Matthew Dowle wrote: > Stavros Macrakis alum.mit.edu> writes: > > > > data.table certainly has some us

Re: [Rd] Efficiency of factor objects

2011-11-07 Thread Milan Bouchet-Valat
Le dimanche 06 novembre 2011 à 19:00 -0500, Stavros Macrakis a écrit : > Milan, Jeff, Patrick, > > > Thank you for your comments and suggestions. > > > Milan, > > > This is far from a "completely theoretical problem". I am performing > text analytics on a corpus of about 2m documents. There

Re: [Rd] Efficiency of factor objects

2011-11-07 Thread Matthew Dowle
Stavros Macrakis alum.mit.edu> writes: > > data.table certainly has some useful mechanisms, and I've been > experimenting with it as an implementation mechanism, though it's not a > drop-in substitute for factors. Also, though it is efficient for set > operations between small sets and large set

[Rd] Efficiency of factor objects

2011-11-06 Thread Stavros Macrakis
Milan, Jeff, Patrick, Thank you for your comments and suggestions. Milan, This is far from a "completely theoretical problem". I am performing text analytics on a corpus of about 2m documents. There are tens of thousands of distinct words (lemmata). It seems to me that the natural representat

Re: [Rd] Efficiency of factor objects

2011-11-05 Thread Patrick Burns
Perhaps 'data.table' would be a package on CRAN that would be acceptable. On 05/11/2011 16:45, Jeffrey Ryan wrote: Or better still, extend R via the mechanisms in place. Something akin to a fast factor package. Any change to R causes downstream issues in (hundreds of?) millions of lines of dep

Re: [Rd] Efficiency of factor objects

2011-11-05 Thread Jeffrey Ryan
Or better still, extend R via the mechanisms in place. Something akin to a fast factor package. Any change to R causes downstream issues in (hundreds of?) millions of lines of deployed code. It almost seems hard to fathom that a package for this doesn't already exist. Have you searched CRAN? Je

Re: [Rd] Efficiency of factor objects

2011-11-05 Thread Milan Bouchet-Valat
Le vendredi 04 novembre 2011 à 19:19 -0400, Stavros Macrakis a écrit : > R factors are the natural way to represent factors -- and should be > efficient since they use small integers. But in fact, for many (but > not all) operations, R factors are considerably slower than integers, > or even chara

[Rd] Efficiency of factor objects

2011-11-04 Thread Stavros Macrakis
R factors are the natural way to represent factors -- and should be efficient since they use small integers. But in fact, for many (but not all) operations, R factors are considerably slower than integers, or even character strings. This appears to be because whenever a factor vector is subsetted