[Rd] reshape scaling with large numbers of times/rows

2006-08-23 Thread Mitch Skinner
Hello,

I'm mailing r-devel because I think the performance problem I'm having
is best solved by re-writing reshape() in C.  I've been reading the
"writing R extensions" documentation and I have some questions about
they best way to write the C bits, but first let me describe my problem:

I'm trying to reshape a long data frame with ~70 subjects measured at
~4500 "times" (in my case, it's ~4500 genetic loci on a chromosome) into
a wide data frame with one column per locus ("time").  On my data
(~300,000 rows for chromosome 1) it takes about twenty minutes on a
3.4GHz P4.  Here's an R session that demonstrates it (this is pretty
close to how my data actually looks):

> version
   _
platform   i686-redhat-linux-gnu
arch   i686
os linux-gnu
system i686, linux-gnu
status
major  2
minor  3.1
year   2006
month  06
day01
svn rev38247
language   R
version.string Version 2.3.1 (2006-06-01)
> test=data.frame(subject=I(as.character(as.integer(runif(3e5, 1, 70) +
1000))), locus=I(as.character(as.integer(runif(3e5, 1, 4500) + 1e6))),
genotype=I(as.character(as.integer(runif(3e5, 1, 100)
> system.time(testWide <- reshape(test, v.names=c("genotype"),
timevar="locus", idvar="subject", direction="wide"), gcFirst=TRUE)
[1] 1107.506  120.755 1267.5680.0000.000

I believe that this can be done a lot faster, and I think the problem
comes from this loop in reshape.R (in reshapeWide):

for(i in seq(length = length(times))) {
thistime <- data[data[,timevar] %in% times[i],]
rval[,varying[,i]] <-
thistime[match(rval[,idvar],thistime[,idvar]),
v.names]
}

I don't really understand everything going on under the hood here, but I
believe the complexity of this loop is something like
O(length(times)*(nrow(data)+(nrow(rval)*length(varying))).  The profile
shows the lion's share (90%) of the time being spent in [.data.frame.

What I'd like to do is write a C loop to go through the source (long)
data frame once (rather than once per time), and put the values into the
destination rows/columns using hash tables to look up the right
row/column.

Assuming the hashing is constant-time bounded, then the reshape becomes
O(nrow(data)*length(varying)).

I'd like to use the abitrary-R-vector hashing functionality from
unique.c, but I'm not sure how to do it.  I think that functionality is
useful to other C code, but the functions that I'm interested in are not
part of the R api (they're defined with "static").  Assuming copy/paste
is out, I can see two options: 1. to remove "static" from the
declarations of some of the functions, and add prototypes for those
functions to a new src/library/stats/src/reshape.c file (or to one of
the header files in src/include), or 2. to add C functions to do the
reshaping to src/main/unique.c and call those from
src/library/stats/R/reshape.R.

This is all assuming that it makes sense to include this in mainline
R--obviously I think it's worthwhile, and I'm surprised other people
aren't complaining more.  I would be happy to write/rewrite until y'all
are happy with how it's done, of course.

I've written a proof-of-concept by copying and pasting the hashing
functions, which (on the above data frame) runs 20 times faster than the
R version of reshape.  It still needs some debugging, to put it mildly
(doesn't work properly on reals), but the basic idea appears to work.

The change I made to the reshape R function looks like this:
=
 for(i in seq(length = length(times))) {
-thistime <- data[data[,timevar] %in% times[i],]
-rval[,varying[,i]] <-
thistime[match(rval[,idvar],thistime[,idvar]),
-   v.names]
+  for (j in seq(length(v.names))) {
+rval[,varying[j,i]] <-
rep(vector(mode=typeof(data[[v.names[j]]]), 0),
+   length.out=nrow(rval))
+  }
 }

+colMap <- match(varying, names(rval))
+.Call("do_reshape_wide",
+  data[[idvar]], data[[timevar]], data[v.names],
+  rval, colMap,
+  v.names, times, rval[[idvar]])
+
 if (!is.null(new.row.names))
 row.names(rval) <- new.row.names
=

This part:
rep(vector(mode=typeof(data[[v.names[j]]]), 0), length.out=nrow(rval))
is to initialize the output with appropriately-typed vectors full of
NAs; if there's a better/right way to do this please let me know.

The do_reshape_wide C function looks like this:
=
SEXP do_reshape_wide(SEXP longIds, SEXP longTimes, SEXP longData, 
 SEXP wideFrame, SEXP colMap,
 SEXP vnames, SEXP times, SEXP wideIds) {
HashData idHash, timeHash;
int longRows, numVarying;
int rowNum, varying;
int wideRow, curTime;
SEXP wideCol, longCol;

HashTableSetup(wideIds, &idHash);
PROTE

Re: [Rd] reshape scaling with large numbers of times/rows

2006-08-24 Thread Mitch Skinner
I'd like to thank everyone that's replied so far--more inline:

On Thu, 2006-08-24 at 11:16 +0100, Prof Brian Ripley wrote:
> Your example does not correspond to your description.  You have taken a 
> random number of loci for each subject and measured each a random number 
> of times:

You're right.  I was trying to come up with an example that didn't
require sending out a big hunk of data.  The overall number of
rows/columns and the data types/sizes in the example were true to life
but the relationship between columns was not.  Also, in my testing the
run time of the random example was pretty close to (actually faster
than) the run time on my real data.

In the real data, there's about one row per subject/locus pair (some
combinations are missing).  The genotype data does have character type;
I'd have to think a bit to see if I could make it into an integer
vector.  Aside from just making it a factor, of course.

Thanks to Gabor Grothendieck for demonstrating gl():

> betterTest=data.frame(subject=as.character(1:70),
locus=as.character(gl(4500, 70)),
genotype=as.character(as.integer(runif(4500*70, 1, 20
> sapply(betterTest, is.factor)
  subjectlocus genotype
TRUE TRUE TRUE
> system.time(wideTest <- reshape(betterTest, v.names="genotype",
timevar="locus", idvar="subject", direction="wide"), gcFirst=TRUE)
[1] 1356.209  178.867 2071.6400.0000.000
> dim(wideTest)
[1]   70 4501
> dim(betterTest)
[1] 315000  3

This was on a different machine (a 2.2 Ghz Athlon 64).  The only
difference I can think of between betterTest and my actual data is that
betterTest is ordered.

> Also, subject and locus are archetypal factors, and forcing them to be 
> character vectors is just making efficiency problems for yourself.

H, that's the way they're coming out of the database.  I'm using
RdbiPgSQL from Bioconductor, and I assumed there was a reason why the
database interface wasn't turning things into factors.  Given my (low)
level of R knowledge, I'd have to think for a while to convince myself
that doing so wouldn't make a difference aside from being faster.  Of
course, if you're asserting that that's the case I'll take your word for
it.

> I have an R-level solution that takes 0.2 s on my machine, and involves no 
> changes to R.
> 
> However, you did not give your affiliation and I do not like giving free 
> consultancy to undisclosed commercial organizations.  Please in future use 
> a proper signature block so that helpers are aware of your provenance.

Ah, I hadn't really thought about this, but I see where you're coming
from.  I work here (my name and this email address are on the page):
http://egcrc.org/pis/white-c.htm
Please forgive my r-devel-newbieness; this is less of an issue on the
other mailing lists I follow.

When there's a chance (however slim, in this case) that something I
write will end up getting used by someone else, I usually use my
personal email address and general identity, because I know it'll follow
me if I change jobs.  The concern, of course, being that someone using
it will want to get in touch with me sometime in the far future.  I
don't exactly have a tenured position.

I really am trying to give at least as much as I'm taking; hopefully my
first email shows that I did a healthy bit of
thinking/reading/googling/coding before posting (maybe too much).
Apparently the c-solution isn't necessary, but doing this in 0.2s is
pretty amazing.  On the same size data frame?

Thanks,
Mitch SkinnerTel: 510-985-3192
Programmer/Analyst
Ernest Gallo Clinic & Research Center
University of California, San Francisco

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] reshape scaling with large numbers of times/rows

2006-08-24 Thread Mitch Skinner
On Thu, 2006-08-24 at 08:57 -0400, Gabor Grothendieck wrote:
> If your Z in reality is not naturally numeric try representing it as a
> factor and using
> the numeric levels as your numbers and then put the level labels back on:
> 
> m <- n <- 5
> DF <- data.frame(X = gl(m*n, 1), Y = gl(m, n), Z = letters[1:25])
> Zn <- as.numeric(DF$Z)
> system.time(w1 <- reshape(DF, timevar = "X", idvar = "Y", dir = "wide"))
> system.time({Zn <- as.numeric(DF$Z)
>w2 <- xtabs(Zn ~ Y + X, DF)
>w2[w2 > 0] <- levels(DF$Z)[w2]
>w2[w2 == 0] <- NA
> })

This is pretty slick, thanks.  It looks like it works for me.  For the
archives, this is how I got back to a data frame (as.data.frame(w2)
gives me a long version again):

> m <- 4500
> n <- 70
> DF <- data.frame(X = gl(m, n), Y = 1:n, Z = letters[1:25])
> system.time({Zn <- as.numeric(DF$Z)
+w2 <- xtabs(Zn ~ Y + X, DF)
+w2[w2 > 0] <- levels(DF$Z)[w2]
+w2[w2 == 0] <- NA
+WDF <- data.frame(Y=dimnames(w2)$Y)
+for (col in dimnames(w2)$X) { WDF[col]=w2[,col] }
+ })
[1] 131.888   1.240 135.945   0.000   0.000
> dim(WDF)
[1]   70 4501

I'll have to look; maybe I can just use w2 as is.  Next time I guess
I'll try R-help first.

Thanks again,
Mitch

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] OT: authorship and contacts for releasing packages (Re: reshape scaling with large numbers of times/rows)

2006-08-25 Thread Mitch Skinner
If people are annoyed by the off-topic-ness, just let me know and I'll
take it off list.  Also, I don't expect to necessarily convert anyone to
my point of view, but I would like to try and articulate it a bit better
(sorry for the length).  More inline:

On Thu, 2006-08-24 at 20:23 +0100, Hin-Tak Leung wrote: 
> (1) in many contexts, your employer owns any result of
> "work for hire", so if you are paid to program on something,
> while you can put your names down, your employers are also entitled
> to erase all your claims of copyright or ownership on it. This sounds 
> very hash, but that's how the reality is.

You're absolutely right about this.  My employer can certainly put their
name on it if they want, but so far they haven't said so.  If someone
asked them for help down the road I don't think they'd have much
interest/expertise in responding.

> (2) for most users, some reasonable expectation of *continual*
> support, by an organization, or at least more than 1 individual,
> is important. Using an anonymous or "consumer" mailer, doesn't give
> good impression about the _quality_  of the work, nor the
> _continuity_ of it. While you think it helps
> people to contact you *after* they have adopted your software,
> having an unprofessionally-looking addresses attached to it
> may deter people from *adopting* in the first place, so you lose
> before it even starts.

I have to disagree with this one.  I have a hard time imagining anyone
else at my work getting into the R internals.  However, even though I
don't own the work I do for my employer, I do feel attached to it enough
to help people with it down the road even if I change jobs.  So in my
case the best chance of continuing support is my personal contact
information.

I also disagree with the suggestion that my first messages were
anonymous because they didn't include an affiliation.  I know that a lot
of people find it easier to think primarily in terms of organizations,
but in my opinion one of the big strengths of free/open source software
developed collaboratively over the network is that individuals can get
involved.  This may be a little grandiose but I think there's a secular
shift (especially in software) away from the formal organization and
toward a less formal community of interested people.  I think the
individual is primary, rather than the organization.  Point being, it's
not anonymous because I put my name right on it.

As for whether or not that's unprofessional, I would hope that people
judge the quality of the work by the work itself.  As you know, with
free software this is much easier to do than it would have been
otherwise.

> (3) if somebody wants to contact you for a purpose that's important to
> them, they will find a way - e.g. look it up to see who might be in
> the same division or who might have been your boss, if e-mail bounces,
> askes the postmaster, etc, or brunt-force googling. What do you
> expect people will want to contact you for?

I've seen people on the linux kernel mailing list complaining that
contributors are often hard to track down after the fact.  Obviously, as
projects like R grow, they're going to attract contributions from a
larger number of people.  I'm just trying to spare people the potential
effort of tracking down the ex-boss or brute-force googling.

My main concern is that something I wrote will turn out to have been
incomplete or confusing, and someone using/extending it will want to ask
what the heck I was thinking.

> (4) lastly, there seems to be some wishful thinking between
> "...don't exactly have a tenured position..." and "...want to get
> in touch with me sometime in the far future...". Tenure is seldomly
> a result of "oh, I read-a-paper/use-a-software/whatever and it
> impresses me so much that I have to hunt down the person who did it,
> *whatever he is like*, *wherever he is*, *whatever he is doing*,
> and hire him to work for me...".

I'm not after tenure; I'm just a staff person.  The tenure thing was
just to remark that most people I know change jobs frequently enough
that work email addresses are not particularly stable.

The "far future" thing turns out to have been wishful thinking, yes,
though on the time scale of email-address-longevity "far future" isn't
terribly long.

> I have been hunted down twice for work I did as far as I remember, after
> I moved on from the context. One did a google, the other got it
> from my ex-boss - the latter wanted some answers on
> details, which I provided, and I did get an extended "thank you",
> plus a "if you ever want a reference, I am willing
> to testify to the quality of your work" statement, which is nice,
> but cannot be taken literally nor seriously.

I think it's worthwhile to make things like this easier.  How many
people would have contacted you if you had the same email address?
Also, maybe I'm naive, but I would have taken that "if you ever want a
reference" comment seriously.

I did diagnose and fix a bug in my code for an