Re: [Rd] reshape scaling with large numbers of times/rows

2006-08-24 Thread Gabor Grothendieck
If for each combination of X and Y there is at most one Z and Z is numeric
(it could be made so in your example) then you could use xtabs which is
faster:

> m <- n <- 10
> DF <- data.frame(X = gl(m*n, 1), Y = gl(m, n), Z = 10*(1:(n*m)))
> system.time(w1 <- reshape(DF, timevar = "X", idvar = "Y", dir = "wide"))
[1] 0.33 0.00 0.33   NA   NA
> system.time(w2 <- xtabs(Z ~ Y + X, DF))
[1] 0.01 0.00 0.01   NA   NA

On 8/24/06, Mitch Skinner <[EMAIL PROTECTED]> wrote:
> Hello,
>
> I'm mailing r-devel because I think the performance problem I'm having
> is best solved by re-writing reshape() in C.  I've been reading the
> "writing R extensions" documentation and I have some questions about
> they best way to write the C bits, but first let me describe my problem:
>
> I'm trying to reshape a long data frame with ~70 subjects measured at
> ~4500 "times" (in my case, it's ~4500 genetic loci on a chromosome) into
> a wide data frame with one column per locus ("time").  On my data
> (~300,000 rows for chromosome 1) it takes about twenty minutes on a
> 3.4GHz P4.  Here's an R session that demonstrates it (this is pretty
> close to how my data actually looks):
>
> > version
>   _
> platform   i686-redhat-linux-gnu
> arch   i686
> os linux-gnu
> system i686, linux-gnu
> status
> major  2
> minor  3.1
> year   2006
> month  06
> day01
> svn rev38247
> language   R
> version.string Version 2.3.1 (2006-06-01)
> > test=data.frame(subject=I(as.character(as.integer(runif(3e5, 1, 70) +
> 1000))), locus=I(as.character(as.integer(runif(3e5, 1, 4500) + 1e6))),
> genotype=I(as.character(as.integer(runif(3e5, 1, 100)
> > system.time(testWide <- reshape(test, v.names=c("genotype"),
> timevar="locus", idvar="subject", direction="wide"), gcFirst=TRUE)
> [1] 1107.506  120.755 1267.5680.0000.000
>
> I believe that this can be done a lot faster, and I think the problem
> comes from this loop in reshape.R (in reshapeWide):
>
>for(i in seq(length = length(times))) {
>thistime <- data[data[,timevar] %in% times[i],]
>rval[,varying[,i]] <-
> thistime[match(rval[,idvar],thistime[,idvar]),
> v.names]
>}
>
> I don't really understand everything going on under the hood here, but I
> believe the complexity of this loop is something like
> O(length(times)*(nrow(data)+(nrow(rval)*length(varying))).  The profile
> shows the lion's share (90%) of the time being spent in [.data.frame.
>
> What I'd like to do is write a C loop to go through the source (long)
> data frame once (rather than once per time), and put the values into the
> destination rows/columns using hash tables to look up the right
> row/column.
>
> Assuming the hashing is constant-time bounded, then the reshape becomes
> O(nrow(data)*length(varying)).
>
> I'd like to use the abitrary-R-vector hashing functionality from
> unique.c, but I'm not sure how to do it.  I think that functionality is
> useful to other C code, but the functions that I'm interested in are not
> part of the R api (they're defined with "static").  Assuming copy/paste
> is out, I can see two options: 1. to remove "static" from the
> declarations of some of the functions, and add prototypes for those
> functions to a new src/library/stats/src/reshape.c file (or to one of
> the header files in src/include), or 2. to add C functions to do the
> reshaping to src/main/unique.c and call those from
> src/library/stats/R/reshape.R.
>
> This is all assuming that it makes sense to include this in mainline
> R--obviously I think it's worthwhile, and I'm surprised other people
> aren't complaining more.  I would be happy to write/rewrite until y'all
> are happy with how it's done, of course.
>
> I've written a proof-of-concept by copying and pasting the hashing
> functions, which (on the above data frame) runs 20 times faster than the
> R version of reshape.  It still needs some debugging, to put it mildly
> (doesn't work properly on reals), but the basic idea appears to work.
>
> The change I made to the reshape R function looks like this:
> =
> for(i in seq(length = length(times))) {
> -thistime <- data[data[,timevar] %in% times[i],]
> -rval[,varying[,i]] <-
> thistime[match(rval[,idvar],thistime[,idvar]),
> -   v.names]
> +  for (j in seq(length(v.names))) {
> +rval[,varying[j,i]] <-
> rep(vector(mode=typeof(data[[v.names[j]]]), 0),
> +   length.out=nrow(rval))
> +  }
> }
>
> +colMap <- match(varying, names(rval))
> +.Call("do_reshape_wide",
> +  data[[idvar]], data[[timevar]], data[v.names],
> +  rval, colMap,
> +  v.names, times, rval[[idvar]])
> +
> if (!is.null(new.row.names))
> row.names(rval) <- new.row.names
> =
>
> This

Re: [Rd] reshape scaling with large numbers of times/rows

2006-08-24 Thread Mitchell Skinner
I wrote:
> It still needs some debugging, to put it mildly
> (doesn't work properly on reals), but the basic idea appears to work.

It works for reals on a 64-bit machine, but not on a 32-bit machine.  I
figure the culprit is this bit of c code:

SET_VECTOR_ELT(wideCol, wideRow, VECTOR_ELT(longCol, rowNum));

It seems like VECTOR_ELT and/or SET_VECTOR_ELT assumes that the vector
elements are word-sized.  If that's right, then they break on (at least)
real and complex vectors, right?  Is there a version of them that works
for arbitrary vectors?  Or am I using them wrong?

In other words, is there some way to do what I want in one line, or do I
have to do a switch(TYPEOF(longCol))?  I was hoping to keep the C code
type-agnostic.

Regards,
Mitch

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] reshape scaling with large numbers of times/rows

2006-08-24 Thread Martin Maechler
> "Mitchell" == Mitchell Skinner <[EMAIL PROTECTED]>
> on Thu, 24 Aug 2006 00:26:52 -0700 writes:

Mitchell> I wrote:
>> It still needs some debugging, to put it mildly
>> (doesn't work properly on reals), but the basic idea appears to work.

Mitchell> It works for reals on a 64-bit machine, but not on
Mitchell> a 32-bit machine.  I figure the culprit is this
Mitchell> bit of c code:


Mitchell> SET_VECTOR_ELT(wideCol, wideRow, VECTOR_ELT(longCol, rowNum));


Mitchell> It seems like VECTOR_ELT and/or SET_VECTOR_ELT
Mitchell> assumes that the vector elements are word-sized.

No. It assumes that use what the C API calls 'VECTOR' and R users
usually call list()s !!!

Do reread that part in the "Writing R Extensions" and rather
look at   REAL(.)[.]  and  REAL(.)[] = ..

Martin

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] reshape scaling with large numbers of times/rows

2006-08-24 Thread Prof Brian Ripley
On Thu, 24 Aug 2006, Martin Maechler wrote:

> > "Mitchell" == Mitchell Skinner <[EMAIL PROTECTED]>
> > on Thu, 24 Aug 2006 00:26:52 -0700 writes:
> 
> Mitchell> I wrote:
> >> It still needs some debugging, to put it mildly
> >> (doesn't work properly on reals), but the basic idea appears to work.
> 
> Mitchell> It works for reals on a 64-bit machine, but not on
> Mitchell> a 32-bit machine.  I figure the culprit is this
> Mitchell> bit of c code:
> 
> 
> Mitchell> SET_VECTOR_ELT(wideCol, wideRow, VECTOR_ELT(longCol, rowNum));
> 
> 
> Mitchell> It seems like VECTOR_ELT and/or SET_VECTOR_ELT
> Mitchell> assumes that the vector elements are word-sized.
> 
> No. It assumes that use what the C API calls 'VECTOR' and R users
> usually call list()s !!!

Since this is R-devel, it might be worth giving a fuller explanation (and 
I had started writing one before seeing Martin's reply).

VECTOR_ELT is used for several types of SEXP, most commonly VECSXP (called 
'list' at R level) but also EXPRSXP ('expression') and WEAKREFSXP ('weak 
reference'), and can also be used for STRSXP ('character vector', but 
STRING_ELT is better).

The R-devel version of R catches such user errors, so it is well be worth 
using for code development.

> Do reread that part in the "Writing R Extensions" and rather
> look at   REAL(.)[.]  and  REAL(.)[] = ..

And on the comment on wanting 'type-agnostic' C code: this is impossible 
as C stores its types differently, and also note that REAL(.)[] is allowed 
but VECTOR_ELT(x,i) = is not.

Another hint: if you find yourself writing REAL(.)[.] repeatedly, rewrite 
it as

double *ra = REAL(a);

ra[i] = ...

since REAL(a) is a function call and can add appreciably to the execution 
time if used inside a loop.  (This is not true for internal code in R, 
BTW, where REAL is a macro.)

And yes, I know there are examples like this in "Writing R Extensions" in 
R 2.3.1, so this is another reason to use R-devel 

-- 
Brian D. Ripley,  [EMAIL PROTECTED]
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel:  +44 1865 272861 (self)
1 South Parks Road, +44 1865 272866 (PA)
Oxford OX1 3TG, UKFax:  +44 1865 272595

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] reshape scaling with large numbers of times/rows

2006-08-24 Thread Prof Brian Ripley
Your example does not correspond to your description.  You have taken a 
random number of loci for each subject and measured each a random number 
of times:

> with(test, table(table(subject, locus)))

 0  1  2  3  4  5  6  7  8
118021 114340  54963  17848   4288814136 16  5

Why would you want to make that into a wide data frame?  'testWide' only 
contains a subset of the original data frame since you are violating the 
assumptions of reshape().

Also, subject and locus are archetypal factors, and forcing them to be 
character vectors is just making efficiency problems for yourself.

I have an R-level solution that takes 0.2 s on my machine, and involves no 
changes to R.

However, you did not give your affiliation and I do not like giving free 
consultancy to undisclosed commercial organizations.  Please in future use 
a proper signature block so that helpers are aware of your provenance.


On Wed, 23 Aug 2006, Mitch Skinner wrote:

> Hello,
> 
> I'm mailing r-devel because I think the performance problem I'm having
> is best solved by re-writing reshape() in C.  I've been reading the
> "writing R extensions" documentation and I have some questions about
> they best way to write the C bits, but first let me describe my problem:
> 
> I'm trying to reshape a long data frame with ~70 subjects measured at
> ~4500 "times" (in my case, it's ~4500 genetic loci on a chromosome) into
> a wide data frame with one column per locus ("time").  On my data
> (~300,000 rows for chromosome 1) it takes about twenty minutes on a
> 3.4GHz P4.  Here's an R session that demonstrates it (this is pretty
> close to how my data actually looks):
> 
> > version
>_
> platform   i686-redhat-linux-gnu
> arch   i686
> os linux-gnu
> system i686, linux-gnu
> status
> major  2
> minor  3.1
> year   2006
> month  06
> day01
> svn rev38247
> language   R
> version.string Version 2.3.1 (2006-06-01)
> > test=data.frame(subject=I(as.character(as.integer(runif(3e5, 1, 70) +
> 1000))), locus=I(as.character(as.integer(runif(3e5, 1, 4500) + 1e6))),
> genotype=I(as.character(as.integer(runif(3e5, 1, 100)
> > system.time(testWide <- reshape(test, v.names=c("genotype"),
> timevar="locus", idvar="subject", direction="wide"), gcFirst=TRUE)
> [1] 1107.506  120.755 1267.5680.0000.000
> 
> I believe that this can be done a lot faster, and I think the problem
> comes from this loop in reshape.R (in reshapeWide):
> 
> for(i in seq(length = length(times))) {
> thistime <- data[data[,timevar] %in% times[i],]
> rval[,varying[,i]] <-
> thistime[match(rval[,idvar],thistime[,idvar]),
> v.names]
> }
> 
> I don't really understand everything going on under the hood here, but I
> believe the complexity of this loop is something like
> O(length(times)*(nrow(data)+(nrow(rval)*length(varying))).  The profile
> shows the lion's share (90%) of the time being spent in [.data.frame.
> 
> What I'd like to do is write a C loop to go through the source (long)
> data frame once (rather than once per time), and put the values into the
> destination rows/columns using hash tables to look up the right
> row/column.
> 
> Assuming the hashing is constant-time bounded, then the reshape becomes
> O(nrow(data)*length(varying)).
> 
> I'd like to use the abitrary-R-vector hashing functionality from
> unique.c, but I'm not sure how to do it.  I think that functionality is
> useful to other C code, but the functions that I'm interested in are not
> part of the R api (they're defined with "static").  Assuming copy/paste
> is out, I can see two options: 1. to remove "static" from the
> declarations of some of the functions, and add prototypes for those
> functions to a new src/library/stats/src/reshape.c file (or to one of
> the header files in src/include), or 2. to add C functions to do the
> reshaping to src/main/unique.c and call those from
> src/library/stats/R/reshape.R.
> 
> This is all assuming that it makes sense to include this in mainline
> R--obviously I think it's worthwhile, and I'm surprised other people
> aren't complaining more.  I would be happy to write/rewrite until y'all
> are happy with how it's done, of course.
> 
> I've written a proof-of-concept by copying and pasting the hashing
> functions, which (on the above data frame) runs 20 times faster than the
> R version of reshape.  It still needs some debugging, to put it mildly
> (doesn't work properly on reals), but the basic idea appears to work.
> 
> The change I made to the reshape R function looks like this:
> =
>  for(i in seq(length = length(times))) {
> -thistime <- data[data[,timevar] %in% times[i],]
> -rval[,varying[,i]] <-
> thistime[match(rval[,idvar],thistime[,idvar]),
> -   v.names]
> +  for (j in seq(

Re: [Rd] reshape scaling with large numbers of times/rows

2006-08-24 Thread Mitch Skinner
I'd like to thank everyone that's replied so far--more inline:

On Thu, 2006-08-24 at 11:16 +0100, Prof Brian Ripley wrote:
> Your example does not correspond to your description.  You have taken a 
> random number of loci for each subject and measured each a random number 
> of times:

You're right.  I was trying to come up with an example that didn't
require sending out a big hunk of data.  The overall number of
rows/columns and the data types/sizes in the example were true to life
but the relationship between columns was not.  Also, in my testing the
run time of the random example was pretty close to (actually faster
than) the run time on my real data.

In the real data, there's about one row per subject/locus pair (some
combinations are missing).  The genotype data does have character type;
I'd have to think a bit to see if I could make it into an integer
vector.  Aside from just making it a factor, of course.

Thanks to Gabor Grothendieck for demonstrating gl():

> betterTest=data.frame(subject=as.character(1:70),
locus=as.character(gl(4500, 70)),
genotype=as.character(as.integer(runif(4500*70, 1, 20
> sapply(betterTest, is.factor)
  subjectlocus genotype
TRUE TRUE TRUE
> system.time(wideTest <- reshape(betterTest, v.names="genotype",
timevar="locus", idvar="subject", direction="wide"), gcFirst=TRUE)
[1] 1356.209  178.867 2071.6400.0000.000
> dim(wideTest)
[1]   70 4501
> dim(betterTest)
[1] 315000  3

This was on a different machine (a 2.2 Ghz Athlon 64).  The only
difference I can think of between betterTest and my actual data is that
betterTest is ordered.

> Also, subject and locus are archetypal factors, and forcing them to be 
> character vectors is just making efficiency problems for yourself.

H, that's the way they're coming out of the database.  I'm using
RdbiPgSQL from Bioconductor, and I assumed there was a reason why the
database interface wasn't turning things into factors.  Given my (low)
level of R knowledge, I'd have to think for a while to convince myself
that doing so wouldn't make a difference aside from being faster.  Of
course, if you're asserting that that's the case I'll take your word for
it.

> I have an R-level solution that takes 0.2 s on my machine, and involves no 
> changes to R.
> 
> However, you did not give your affiliation and I do not like giving free 
> consultancy to undisclosed commercial organizations.  Please in future use 
> a proper signature block so that helpers are aware of your provenance.

Ah, I hadn't really thought about this, but I see where you're coming
from.  I work here (my name and this email address are on the page):
http://egcrc.org/pis/white-c.htm
Please forgive my r-devel-newbieness; this is less of an issue on the
other mailing lists I follow.

When there's a chance (however slim, in this case) that something I
write will end up getting used by someone else, I usually use my
personal email address and general identity, because I know it'll follow
me if I change jobs.  The concern, of course, being that someone using
it will want to get in touch with me sometime in the far future.  I
don't exactly have a tenured position.

I really am trying to give at least as much as I'm taking; hopefully my
first email shows that I did a healthy bit of
thinking/reading/googling/coding before posting (maybe too much).
Apparently the c-solution isn't necessary, but doing this in 0.2s is
pretty amazing.  On the same size data frame?

Thanks,
Mitch SkinnerTel: 510-985-3192
Programmer/Analyst
Ernest Gallo Clinic & Research Center
University of California, San Francisco

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] reshape scaling with large numbers of times/rows

2006-08-24 Thread Gabor Grothendieck
If your Z in reality is not naturally numeric try representing it as a
factor and using
the numeric levels as your numbers and then put the level labels back on:

m <- n <- 5
DF <- data.frame(X = gl(m*n, 1), Y = gl(m, n), Z = letters[1:25])
Zn <- as.numeric(DF$Z)
system.time(w1 <- reshape(DF, timevar = "X", idvar = "Y", dir = "wide"))
system.time({Zn <- as.numeric(DF$Z)
   w2 <- xtabs(Zn ~ Y + X, DF)
   w2[w2 > 0] <- levels(DF$Z)[w2]
   w2[w2 == 0] <- NA
})


On 8/24/06, Gabor Grothendieck <[EMAIL PROTECTED]> wrote:
> If for each combination of X and Y there is at most one Z and Z is numeric
> (it could be made so in your example) then you could use xtabs which is
> faster:
>
> > m <- n <- 10
> > DF <- data.frame(X = gl(m*n, 1), Y = gl(m, n), Z = 10*(1:(n*m)))
> > system.time(w1 <- reshape(DF, timevar = "X", idvar = "Y", dir = "wide"))
> [1] 0.33 0.00 0.33   NA   NA
> > system.time(w2 <- xtabs(Z ~ Y + X, DF))
> [1] 0.01 0.00 0.01   NA   NA
>
> On 8/24/06, Mitch Skinner <[EMAIL PROTECTED]> wrote:
> > Hello,
> >
> > I'm mailing r-devel because I think the performance problem I'm having
> > is best solved by re-writing reshape() in C.  I've been reading the
> > "writing R extensions" documentation and I have some questions about
> > they best way to write the C bits, but first let me describe my problem:
> >
> > I'm trying to reshape a long data frame with ~70 subjects measured at
> > ~4500 "times" (in my case, it's ~4500 genetic loci on a chromosome) into
> > a wide data frame with one column per locus ("time").  On my data
> > (~300,000 rows for chromosome 1) it takes about twenty minutes on a
> > 3.4GHz P4.  Here's an R session that demonstrates it (this is pretty
> > close to how my data actually looks):
> >
> > > version
> >   _
> > platform   i686-redhat-linux-gnu
> > arch   i686
> > os linux-gnu
> > system i686, linux-gnu
> > status
> > major  2
> > minor  3.1
> > year   2006
> > month  06
> > day01
> > svn rev38247
> > language   R
> > version.string Version 2.3.1 (2006-06-01)
> > > test=data.frame(subject=I(as.character(as.integer(runif(3e5, 1, 70) +
> > 1000))), locus=I(as.character(as.integer(runif(3e5, 1, 4500) + 1e6))),
> > genotype=I(as.character(as.integer(runif(3e5, 1, 100)
> > > system.time(testWide <- reshape(test, v.names=c("genotype"),
> > timevar="locus", idvar="subject", direction="wide"), gcFirst=TRUE)
> > [1] 1107.506  120.755 1267.5680.0000.000
> >
> > I believe that this can be done a lot faster, and I think the problem
> > comes from this loop in reshape.R (in reshapeWide):
> >
> >for(i in seq(length = length(times))) {
> >thistime <- data[data[,timevar] %in% times[i],]
> >rval[,varying[,i]] <-
> > thistime[match(rval[,idvar],thistime[,idvar]),
> > v.names]
> >}
> >
> > I don't really understand everything going on under the hood here, but I
> > believe the complexity of this loop is something like
> > O(length(times)*(nrow(data)+(nrow(rval)*length(varying))).  The profile
> > shows the lion's share (90%) of the time being spent in [.data.frame.
> >
> > What I'd like to do is write a C loop to go through the source (long)
> > data frame once (rather than once per time), and put the values into the
> > destination rows/columns using hash tables to look up the right
> > row/column.
> >
> > Assuming the hashing is constant-time bounded, then the reshape becomes
> > O(nrow(data)*length(varying)).
> >
> > I'd like to use the abitrary-R-vector hashing functionality from
> > unique.c, but I'm not sure how to do it.  I think that functionality is
> > useful to other C code, but the functions that I'm interested in are not
> > part of the R api (they're defined with "static").  Assuming copy/paste
> > is out, I can see two options: 1. to remove "static" from the
> > declarations of some of the functions, and add prototypes for those
> > functions to a new src/library/stats/src/reshape.c file (or to one of
> > the header files in src/include), or 2. to add C functions to do the
> > reshaping to src/main/unique.c and call those from
> > src/library/stats/R/reshape.R.
> >
> > This is all assuming that it makes sense to include this in mainline
> > R--obviously I think it's worthwhile, and I'm surprised other people
> > aren't complaining more.  I would be happy to write/rewrite until y'all
> > are happy with how it's done, of course.
> >
> > I've written a proof-of-concept by copying and pasting the hashing
> > functions, which (on the above data frame) runs 20 times faster than the
> > R version of reshape.  It still needs some debugging, to put it mildly
> > (doesn't work properly on reals), but the basic idea appears to work.
> >
> > The change I made to the reshape R function looks like this:
> > =
> > for(i in seq(length = length(times))) {
> > -thistime <- data[data[,timevar] %in% times[i],]
> > -   

Re: [Rd] reshape scaling with large numbers of times/rows

2006-08-24 Thread Mitch Skinner
On Thu, 2006-08-24 at 08:57 -0400, Gabor Grothendieck wrote:
> If your Z in reality is not naturally numeric try representing it as a
> factor and using
> the numeric levels as your numbers and then put the level labels back on:
> 
> m <- n <- 5
> DF <- data.frame(X = gl(m*n, 1), Y = gl(m, n), Z = letters[1:25])
> Zn <- as.numeric(DF$Z)
> system.time(w1 <- reshape(DF, timevar = "X", idvar = "Y", dir = "wide"))
> system.time({Zn <- as.numeric(DF$Z)
>w2 <- xtabs(Zn ~ Y + X, DF)
>w2[w2 > 0] <- levels(DF$Z)[w2]
>w2[w2 == 0] <- NA
> })

This is pretty slick, thanks.  It looks like it works for me.  For the
archives, this is how I got back to a data frame (as.data.frame(w2)
gives me a long version again):

> m <- 4500
> n <- 70
> DF <- data.frame(X = gl(m, n), Y = 1:n, Z = letters[1:25])
> system.time({Zn <- as.numeric(DF$Z)
+w2 <- xtabs(Zn ~ Y + X, DF)
+w2[w2 > 0] <- levels(DF$Z)[w2]
+w2[w2 == 0] <- NA
+WDF <- data.frame(Y=dimnames(w2)$Y)
+for (col in dimnames(w2)$X) { WDF[col]=w2[,col] }
+ })
[1] 131.888   1.240 135.945   0.000   0.000
> dim(WDF)
[1]   70 4501

I'll have to look; maybe I can just use w2 as is.  Next time I guess
I'll try R-help first.

Thanks again,
Mitch

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] reshape scaling with large numbers of times/rows

2006-08-24 Thread Prof Brian Ripley
Here's the essence of a solution (0.14 sec for this bit)

res <- with(betterTest, {
subjects <- levels(subject)
loci <- levels(locus)
## replace "" by as.character(NA) if you prefer
res <- matrix("", length(subjects), length(loci),
  dimnames = list(subjects, loci))
ind <- cbind(as.integer(subject), as.integer(locus))
res[ind] <- as.character(genotype)
res
})

This produces a character matrix, mainly because that is what I had 
before.  However, I think a matrix is probably a better data structure 
than a wide dataframe, but it can easily be converted (which will take a 
little longer, 1.5s if you use as.data.frame but there are much faster 
ways).  It is certainly possible to do the same thing with a factor matrix 
result, but there seems to be a problem with the subsetting method for 
such objects.

I would suggest generally avoiding data frames where all the columns are 
of one type and you are concerned with efficiency.


On Thu, 24 Aug 2006, Mitch Skinner wrote:

> I'd like to thank everyone that's replied so far--more inline:
> 
> On Thu, 2006-08-24 at 11:16 +0100, Prof Brian Ripley wrote:
> > Your example does not correspond to your description.  You have taken a 
> > random number of loci for each subject and measured each a random number 
> > of times:
> 
> You're right.  I was trying to come up with an example that didn't
> require sending out a big hunk of data.  The overall number of
> rows/columns and the data types/sizes in the example were true to life
> but the relationship between columns was not.  Also, in my testing the
> run time of the random example was pretty close to (actually faster
> than) the run time on my real data.
> 
> In the real data, there's about one row per subject/locus pair (some
> combinations are missing).  The genotype data does have character type;
> I'd have to think a bit to see if I could make it into an integer
> vector.  Aside from just making it a factor, of course.
> 
> Thanks to Gabor Grothendieck for demonstrating gl():
> 
> > betterTest=data.frame(subject=as.character(1:70),
> locus=as.character(gl(4500, 70)),
> genotype=as.character(as.integer(runif(4500*70, 1, 20
> > sapply(betterTest, is.factor)
>   subjectlocus genotype
> TRUE TRUE TRUE
> > system.time(wideTest <- reshape(betterTest, v.names="genotype",
> timevar="locus", idvar="subject", direction="wide"), gcFirst=TRUE)
> [1] 1356.209  178.867 2071.6400.0000.000
> > dim(wideTest)
> [1]   70 4501
> > dim(betterTest)
> [1] 315000  3
> 
> This was on a different machine (a 2.2 Ghz Athlon 64).  The only
> difference I can think of between betterTest and my actual data is that
> betterTest is ordered.
> 
> > Also, subject and locus are archetypal factors, and forcing them to be 
> > character vectors is just making efficiency problems for yourself.
> 
> H, that's the way they're coming out of the database.  I'm using
> RdbiPgSQL from Bioconductor, and I assumed there was a reason why the
> database interface wasn't turning things into factors.  Given my (low)
> level of R knowledge, I'd have to think for a while to convince myself
> that doing so wouldn't make a difference aside from being faster.  Of
> course, if you're asserting that that's the case I'll take your word for
> it.
> 
> > I have an R-level solution that takes 0.2 s on my machine, and involves no 
> > changes to R.
> > 
> > However, you did not give your affiliation and I do not like giving free 
> > consultancy to undisclosed commercial organizations.  Please in future use 
> > a proper signature block so that helpers are aware of your provenance.
> 
> Ah, I hadn't really thought about this, but I see where you're coming
> from.  I work here (my name and this email address are on the page):
> http://egcrc.org/pis/white-c.htm
> Please forgive my r-devel-newbieness; this is less of an issue on the
> other mailing lists I follow.
> 
> When there's a chance (however slim, in this case) that something I
> write will end up getting used by someone else, I usually use my
> personal email address and general identity, because I know it'll follow
> me if I change jobs.  The concern, of course, being that someone using
> it will want to get in touch with me sometime in the far future.  I
> don't exactly have a tenured position.
> 
> I really am trying to give at least as much as I'm taking; hopefully my
> first email shows that I did a healthy bit of
> thinking/reading/googling/coding before posting (maybe too much).
> Apparently the c-solution isn't necessary, but doing this in 0.2s is
> pretty amazing.  On the same size data frame?
> 
> Thanks,
> Mitch SkinnerTel: 510-985-3192
> Programmer/Analyst
> Ernest Gallo Clinic & Research Center
> University of California, San Francisco
> 

-- 
Brian D. Ripley,  [EMAIL PROTECTED]
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel:  +44 1865 272861 (self)
1 Sout

Re: [Rd] reshape scaling with large numbers of times/rows

2006-08-24 Thread Gabor Grothendieck
On 8/24/06, Mitch Skinner <[EMAIL PROTECTED]> wrote:
> On Thu, 2006-08-24 at 08:57 -0400, Gabor Grothendieck wrote:
> > If your Z in reality is not naturally numeric try representing it as a
> > factor and using
> > the numeric levels as your numbers and then put the level labels back on:
> >
> > m <- n <- 5
> > DF <- data.frame(X = gl(m*n, 1), Y = gl(m, n), Z = letters[1:25])
> > Zn <- as.numeric(DF$Z)
> > system.time(w1 <- reshape(DF, timevar = "X", idvar = "Y", dir = "wide"))
> > system.time({Zn <- as.numeric(DF$Z)
> >w2 <- xtabs(Zn ~ Y + X, DF)
> >w2[w2 > 0] <- levels(DF$Z)[w2]
> >w2[w2 == 0] <- NA
> > })
>
> This is pretty slick, thanks.  It looks like it works for me.  For the
> archives, this is how I got back to a data frame (as.data.frame(w2)
> gives me a long version again):
>
> > m <- 4500
> > n <- 70
> > DF <- data.frame(X = gl(m, n), Y = 1:n, Z = letters[1:25])
> > system.time({Zn <- as.numeric(DF$Z)
> +w2 <- xtabs(Zn ~ Y + X, DF)
> +w2[w2 > 0] <- levels(DF$Z)[w2]
> +w2[w2 == 0] <- NA
> +WDF <- data.frame(Y=dimnames(w2)$Y)
> +for (col in dimnames(w2)$X) { WDF[col]=w2[,col] }
> + })
> [1] 131.888   1.240 135.945   0.000   0.000
> > dim(WDF)
> [1]   70 4501
>
> I'll have to look; maybe I can just use w2 as is.  Next time I guess
> I'll try R-help first.
>
> Thanks again,
> Mitch
>

Also try
  na.omit(as.data.frame(w2))

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] Huge matrices in .Call

2006-08-24 Thread Dominick Samperi
I assume that the fact that SEXP's passed through .Call
are read-only is a convention, consistent with R semantics.

This can be awkward when huge matrices are passed
as in:
newP <- .Call("myfunc", P)
because this would cause the huge matrix to be copied.

Is there any danger if I violate the convention in cases
like this and modify the SEXP matrix parameter directly,
instead of forcing R to make a copy of it every time
myfunc is called? (This seems to work in my testing.)

Thanks,
ds

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Huge matrices in .Call

2006-08-24 Thread Hin-Tak Leung
Dominick Samperi wrote:
> I assume that the fact that SEXP's passed through .Call
> are read-only is a convention, consistent with R semantics.
> 
> This can be awkward when huge matrices are passed
> as in:
> newP <- .Call("myfunc", P)
> because this would cause the huge matrix to be copied.

Yes, I think that's the practice - basically what you want
is in-place modification of objects, which is contrary to
what most R people would recommend.

> Is there any danger if I violate the convention in cases
> like this and modify the SEXP matrix parameter directly,
> instead of forcing R to make a copy of it every time
> myfunc is called? (This seems to work in my testing.)

Prof. Ripley is probably going to say something about this...
but I think as long as (1) you tell the garbage collector
about you going to mess around with the input with
PROTECT() (this is probably automatic for input to .Call()
so may not be necessary), (2) you are doing exact in-place
changes, no resizing, no messing around the attributes,
no relocations elsewhere, etc, then nobody needs to get upset
about you messing around with the internal of a matrix, so to speak,
and technically it should work, *although doing so is very much
frown upon*.

In which case, if I were writing this, I would get myfunc() to
return logical(TRUE/FALSE) to indicate success/failure or some
more detail operation status. return'ing SEXP *input is probably
dangerous and will probably confuse the garbage collector. I don't know
for sure, but I would avoid it. so I would write it as:

status <- .Call("myfunc", P)
if (status) {
cat("Ha, I have got a new P!\n")
print(P)
} else {
cat("things don't work out and P is probably very broken\n")
print(P)
}

(somebody will yell at me on coroborating this, but I have uses
for in-place modification of big objects too...)

HTL

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Huge matrices in .Call

2006-08-24 Thread Thomas Lumley
On Thu, 24 Aug 2006, Hin-Tak Leung wrote:

> Dominick Samperi wrote:
>> I assume that the fact that SEXP's passed through .Call
>> are read-only is a convention, consistent with R semantics.

Yes

>> This can be awkward when huge matrices are passed
>> as in:
>> newP <- .Call("myfunc", P)
>> because this would cause the huge matrix to be copied.

Indeed.

>> Is there any danger if I violate the convention in cases
>> like this and modify the SEXP matrix parameter directly,
>> instead of forcing R to make a copy of it every time
>> myfunc is called? (This seems to work in my testing.)
>
> Prof. Ripley is probably going to say something about this...
> but I think as long as (1) you tell the garbage collector
> about you going to mess around with the input with
> PROTECT() (this is probably automatic for input to .Call()
> so may not be necessary), (2) you are doing exact in-place
> changes, no resizing, no messing around the attributes,
> no relocations elsewhere, etc, then nobody needs to get upset
> about you messing around with the internal of a matrix, so to speak,
> and technically it should work, *although doing so is very much
> frown upon*.

That isn't quite sufficient.  If another variable (perhaps in a function 
that calls your function) is using the same storage as your variable P 
then both will be changed.  This may not be what the user expects.

Also, if you return newP as well as modifying P they will point to the 
same storage so modifying one will modify the other in the future (at 
least until one of them gets duplicated).

It's probably less dangerous with a matrix than with a list or data frame 
that might share only part of its memory with another variable. However, R 
does provide external pointer objects as a way to pass objects by 
reference without breaking the behaviour of the rest of the language. It's 
more work to use them but cleaner.

-thomas

Thomas Lumley   Assoc. Professor, Biostatistics
[EMAIL PROTECTED]   University of Washington, Seattle

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] OT: authorship and contacts for releasing packages (Re: reshape scaling with large numbers of times/rows)

2006-08-24 Thread Hin-Tak Leung
Rather off-topic, not really about R nor development at all,
but I think the issue is vaguely relevant for potential package writers.

Mitch Skinner wrote:

> When there's a chance (however slim, in this case) that something I
> write will end up getting used by someone else, I usually use my
> personal email address and general identity, because I know it'll follow
> me if I change jobs.  The concern, of course, being that someone using
> it will want to get in touch with me sometime in the far future.  I
> don't exactly have a tenured position.


I think refering to annoymous-mailers/personal contact addresses in
professional work is unwise. I usually use the affiliation/context
for which the work is done in codes/documentations, because:

(1) in many contexts, your employer owns any result of
"work for hire", so if you are paid to program on something,
while you can put your names down, your employers are also entitled
to erase all your claims of copyright or ownership on it. This sounds 
very hash, but that's how the reality is.

(2) for most users, some reasonable expectation of *continual*
support, by an organization, or at least more than 1 individual,
is important. Using an anonymous or "consumer" mailer, doesn't give
good impression about the _quality_  of the work, nor the
_continuity_ of it. While you think it helps
people to contact you *after* they have adopted your software,
having an unprofessionally-looking addresses attached to it
may deter people from *adopting* in the first place, so you lose
before it even starts.

(3) if somebody wants to contact you for a purpose that's important to
them, they will find a way - e.g. look it up to see who might be in
the same division or who might have been your boss, if e-mail bounces,
askes the postmaster, etc, or brunt-force googling. What do you
expect people will want to contact you for? I do recommend putting down
your full name (including middle ones, if one have a fairly common
surname and first name), but your work really should be associated
with the context in which it is in, if you actually want them to
be adopted by others.

(4) lastly, there seems to be some wishful thinking between
"...don't exactly have a tenured position..." and "...want to get
in touch with me sometime in the far future...". Tenure is seldomly
a result of "oh, I read-a-paper/use-a-software/whatever and it
impresses me so much that I have to hunt down the person who did it,
*whatever he is like*, *wherever he is*, *whatever he is doing*,
and hire him to work for me...".

I have been hunted down twice for work I did as far as I remember, after
I moved on from the context. One did a google, the other got it
from my ex-boss - the latter wanted some answers on
details, which I provided, and I did get an extended "thank you",
plus a "if you ever want a reference, I am willing
to testify to the quality of your work" statement, which is nice,
but cannot be taken literally nor seriously. I can just about imagine
the conversation went further as "I know somebody who might be 
interested in your expertises, and I am willing to put your name
forward, if you are interested". But that's *after* they obtain
what they contacted you for. It is almost unthinkable that you
get hunted down for a job offer based on some
software/mailing-list-postings etc you do...

HTL

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] Command line length limits in R

2006-08-24 Thread Prof Brian Ripley
I've been trying to track down some of the issues with command line length 
limits, and those writing GUIs/front-ends need to pay some attention to 
the issue.

src/unix/system.txt says

 *int   R_ReadConsole(char *prompt, char *buf, int buflen, int hist)
 *
 *  This function prints the given prompt at the console and then
 *  does a gets(3)-like operation, transferring up to "buflen" characters
 *  into the buffer "buf".  The last two characters are set to "\n\0"
 *  to preserve sanity.

but that isn't actually true for some of the standard interfaces and seems 
undesirable.  (Also, for 'characters' read 'bytes'.)

What happens is that all calls to R_ReadConsole have buflen=1024.  (These 
can be an input line of R code, or from scan() or from a stdin() 
connection.) If this is command input, the result is parsed, and if 
incomplete and not an error, more input is requested until it is complete.

Suppose the user entered a very long line. Should the first 1024 bytes be 
syntactically complete, this will not do what he expected and it will be 
as if a LF had been inserted after 1024 bytes.  But that is unlikely, and 
he may well get away with it, unless R_ReadConsole did actually does as 
documented and inserts "\n\0" (the Rgui and readline consoles do, but 
reading from a file in Linux or Windows does not).

It seems the correct advice is that R_ReadConsole should only send a "\n" 
when there is no more input available, e.g. on EOF.  I am changing the 
calling code to have a 1025 char buffer with last '\0' to ensure 
null-termination.

Some consoles try to ensure that the user cannot enter more than 1024 
bytes.  That's a bit awkward in a MBCS, and also when input is being 
pasted in (possibly in the middle of a line).  Generally this does not 
work too well: e.g. the readline console truncates at 1022 chars and 
appends a \n.

I have no idea how an R user was expected to know there was a line limit.  
I've added some documentation to R-intro (but I will need to correct it as 
I didn't know the readline console used 1022).  The limit applies to 
redirected input from a file, but not to source().


GUI writers: please take a look at how your interface handles very long 
input lines, particularly quoted strings (which seem the most likely 
reason for long lines).

-- 
Brian D. Ripley,  [EMAIL PROTECTED]
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel:  +44 1865 272861 (self)
1 South Parks Road, +44 1865 272866 (PA)
Oxford OX1 3TG, UKFax:  +44 1865 272595

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel