Re: [Rd] reshape scaling with large numbers of times/rows
If for each combination of X and Y there is at most one Z and Z is numeric (it could be made so in your example) then you could use xtabs which is faster: > m <- n <- 10 > DF <- data.frame(X = gl(m*n, 1), Y = gl(m, n), Z = 10*(1:(n*m))) > system.time(w1 <- reshape(DF, timevar = "X", idvar = "Y", dir = "wide")) [1] 0.33 0.00 0.33 NA NA > system.time(w2 <- xtabs(Z ~ Y + X, DF)) [1] 0.01 0.00 0.01 NA NA On 8/24/06, Mitch Skinner <[EMAIL PROTECTED]> wrote: > Hello, > > I'm mailing r-devel because I think the performance problem I'm having > is best solved by re-writing reshape() in C. I've been reading the > "writing R extensions" documentation and I have some questions about > they best way to write the C bits, but first let me describe my problem: > > I'm trying to reshape a long data frame with ~70 subjects measured at > ~4500 "times" (in my case, it's ~4500 genetic loci on a chromosome) into > a wide data frame with one column per locus ("time"). On my data > (~300,000 rows for chromosome 1) it takes about twenty minutes on a > 3.4GHz P4. Here's an R session that demonstrates it (this is pretty > close to how my data actually looks): > > > version > _ > platform i686-redhat-linux-gnu > arch i686 > os linux-gnu > system i686, linux-gnu > status > major 2 > minor 3.1 > year 2006 > month 06 > day01 > svn rev38247 > language R > version.string Version 2.3.1 (2006-06-01) > > test=data.frame(subject=I(as.character(as.integer(runif(3e5, 1, 70) + > 1000))), locus=I(as.character(as.integer(runif(3e5, 1, 4500) + 1e6))), > genotype=I(as.character(as.integer(runif(3e5, 1, 100) > > system.time(testWide <- reshape(test, v.names=c("genotype"), > timevar="locus", idvar="subject", direction="wide"), gcFirst=TRUE) > [1] 1107.506 120.755 1267.5680.0000.000 > > I believe that this can be done a lot faster, and I think the problem > comes from this loop in reshape.R (in reshapeWide): > >for(i in seq(length = length(times))) { >thistime <- data[data[,timevar] %in% times[i],] >rval[,varying[,i]] <- > thistime[match(rval[,idvar],thistime[,idvar]), > v.names] >} > > I don't really understand everything going on under the hood here, but I > believe the complexity of this loop is something like > O(length(times)*(nrow(data)+(nrow(rval)*length(varying))). The profile > shows the lion's share (90%) of the time being spent in [.data.frame. > > What I'd like to do is write a C loop to go through the source (long) > data frame once (rather than once per time), and put the values into the > destination rows/columns using hash tables to look up the right > row/column. > > Assuming the hashing is constant-time bounded, then the reshape becomes > O(nrow(data)*length(varying)). > > I'd like to use the abitrary-R-vector hashing functionality from > unique.c, but I'm not sure how to do it. I think that functionality is > useful to other C code, but the functions that I'm interested in are not > part of the R api (they're defined with "static"). Assuming copy/paste > is out, I can see two options: 1. to remove "static" from the > declarations of some of the functions, and add prototypes for those > functions to a new src/library/stats/src/reshape.c file (or to one of > the header files in src/include), or 2. to add C functions to do the > reshaping to src/main/unique.c and call those from > src/library/stats/R/reshape.R. > > This is all assuming that it makes sense to include this in mainline > R--obviously I think it's worthwhile, and I'm surprised other people > aren't complaining more. I would be happy to write/rewrite until y'all > are happy with how it's done, of course. > > I've written a proof-of-concept by copying and pasting the hashing > functions, which (on the above data frame) runs 20 times faster than the > R version of reshape. It still needs some debugging, to put it mildly > (doesn't work properly on reals), but the basic idea appears to work. > > The change I made to the reshape R function looks like this: > = > for(i in seq(length = length(times))) { > -thistime <- data[data[,timevar] %in% times[i],] > -rval[,varying[,i]] <- > thistime[match(rval[,idvar],thistime[,idvar]), > - v.names] > + for (j in seq(length(v.names))) { > +rval[,varying[j,i]] <- > rep(vector(mode=typeof(data[[v.names[j]]]), 0), > + length.out=nrow(rval)) > + } > } > > +colMap <- match(varying, names(rval)) > +.Call("do_reshape_wide", > + data[[idvar]], data[[timevar]], data[v.names], > + rval, colMap, > + v.names, times, rval[[idvar]]) > + > if (!is.null(new.row.names)) > row.names(rval) <- new.row.names > = > > This
Re: [Rd] reshape scaling with large numbers of times/rows
I wrote: > It still needs some debugging, to put it mildly > (doesn't work properly on reals), but the basic idea appears to work. It works for reals on a 64-bit machine, but not on a 32-bit machine. I figure the culprit is this bit of c code: SET_VECTOR_ELT(wideCol, wideRow, VECTOR_ELT(longCol, rowNum)); It seems like VECTOR_ELT and/or SET_VECTOR_ELT assumes that the vector elements are word-sized. If that's right, then they break on (at least) real and complex vectors, right? Is there a version of them that works for arbitrary vectors? Or am I using them wrong? In other words, is there some way to do what I want in one line, or do I have to do a switch(TYPEOF(longCol))? I was hoping to keep the C code type-agnostic. Regards, Mitch __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] reshape scaling with large numbers of times/rows
> "Mitchell" == Mitchell Skinner <[EMAIL PROTECTED]> > on Thu, 24 Aug 2006 00:26:52 -0700 writes: Mitchell> I wrote: >> It still needs some debugging, to put it mildly >> (doesn't work properly on reals), but the basic idea appears to work. Mitchell> It works for reals on a 64-bit machine, but not on Mitchell> a 32-bit machine. I figure the culprit is this Mitchell> bit of c code: Mitchell> SET_VECTOR_ELT(wideCol, wideRow, VECTOR_ELT(longCol, rowNum)); Mitchell> It seems like VECTOR_ELT and/or SET_VECTOR_ELT Mitchell> assumes that the vector elements are word-sized. No. It assumes that use what the C API calls 'VECTOR' and R users usually call list()s !!! Do reread that part in the "Writing R Extensions" and rather look at REAL(.)[.] and REAL(.)[] = .. Martin __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] reshape scaling with large numbers of times/rows
On Thu, 24 Aug 2006, Martin Maechler wrote: > > "Mitchell" == Mitchell Skinner <[EMAIL PROTECTED]> > > on Thu, 24 Aug 2006 00:26:52 -0700 writes: > > Mitchell> I wrote: > >> It still needs some debugging, to put it mildly > >> (doesn't work properly on reals), but the basic idea appears to work. > > Mitchell> It works for reals on a 64-bit machine, but not on > Mitchell> a 32-bit machine. I figure the culprit is this > Mitchell> bit of c code: > > > Mitchell> SET_VECTOR_ELT(wideCol, wideRow, VECTOR_ELT(longCol, rowNum)); > > > Mitchell> It seems like VECTOR_ELT and/or SET_VECTOR_ELT > Mitchell> assumes that the vector elements are word-sized. > > No. It assumes that use what the C API calls 'VECTOR' and R users > usually call list()s !!! Since this is R-devel, it might be worth giving a fuller explanation (and I had started writing one before seeing Martin's reply). VECTOR_ELT is used for several types of SEXP, most commonly VECSXP (called 'list' at R level) but also EXPRSXP ('expression') and WEAKREFSXP ('weak reference'), and can also be used for STRSXP ('character vector', but STRING_ELT is better). The R-devel version of R catches such user errors, so it is well be worth using for code development. > Do reread that part in the "Writing R Extensions" and rather > look at REAL(.)[.] and REAL(.)[] = .. And on the comment on wanting 'type-agnostic' C code: this is impossible as C stores its types differently, and also note that REAL(.)[] is allowed but VECTOR_ELT(x,i) = is not. Another hint: if you find yourself writing REAL(.)[.] repeatedly, rewrite it as double *ra = REAL(a); ra[i] = ... since REAL(a) is a function call and can add appreciably to the execution time if used inside a loop. (This is not true for internal code in R, BTW, where REAL is a macro.) And yes, I know there are examples like this in "Writing R Extensions" in R 2.3.1, so this is another reason to use R-devel -- Brian D. Ripley, [EMAIL PROTECTED] Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UKFax: +44 1865 272595 __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] reshape scaling with large numbers of times/rows
Your example does not correspond to your description. You have taken a random number of loci for each subject and measured each a random number of times: > with(test, table(table(subject, locus))) 0 1 2 3 4 5 6 7 8 118021 114340 54963 17848 4288814136 16 5 Why would you want to make that into a wide data frame? 'testWide' only contains a subset of the original data frame since you are violating the assumptions of reshape(). Also, subject and locus are archetypal factors, and forcing them to be character vectors is just making efficiency problems for yourself. I have an R-level solution that takes 0.2 s on my machine, and involves no changes to R. However, you did not give your affiliation and I do not like giving free consultancy to undisclosed commercial organizations. Please in future use a proper signature block so that helpers are aware of your provenance. On Wed, 23 Aug 2006, Mitch Skinner wrote: > Hello, > > I'm mailing r-devel because I think the performance problem I'm having > is best solved by re-writing reshape() in C. I've been reading the > "writing R extensions" documentation and I have some questions about > they best way to write the C bits, but first let me describe my problem: > > I'm trying to reshape a long data frame with ~70 subjects measured at > ~4500 "times" (in my case, it's ~4500 genetic loci on a chromosome) into > a wide data frame with one column per locus ("time"). On my data > (~300,000 rows for chromosome 1) it takes about twenty minutes on a > 3.4GHz P4. Here's an R session that demonstrates it (this is pretty > close to how my data actually looks): > > > version >_ > platform i686-redhat-linux-gnu > arch i686 > os linux-gnu > system i686, linux-gnu > status > major 2 > minor 3.1 > year 2006 > month 06 > day01 > svn rev38247 > language R > version.string Version 2.3.1 (2006-06-01) > > test=data.frame(subject=I(as.character(as.integer(runif(3e5, 1, 70) + > 1000))), locus=I(as.character(as.integer(runif(3e5, 1, 4500) + 1e6))), > genotype=I(as.character(as.integer(runif(3e5, 1, 100) > > system.time(testWide <- reshape(test, v.names=c("genotype"), > timevar="locus", idvar="subject", direction="wide"), gcFirst=TRUE) > [1] 1107.506 120.755 1267.5680.0000.000 > > I believe that this can be done a lot faster, and I think the problem > comes from this loop in reshape.R (in reshapeWide): > > for(i in seq(length = length(times))) { > thistime <- data[data[,timevar] %in% times[i],] > rval[,varying[,i]] <- > thistime[match(rval[,idvar],thistime[,idvar]), > v.names] > } > > I don't really understand everything going on under the hood here, but I > believe the complexity of this loop is something like > O(length(times)*(nrow(data)+(nrow(rval)*length(varying))). The profile > shows the lion's share (90%) of the time being spent in [.data.frame. > > What I'd like to do is write a C loop to go through the source (long) > data frame once (rather than once per time), and put the values into the > destination rows/columns using hash tables to look up the right > row/column. > > Assuming the hashing is constant-time bounded, then the reshape becomes > O(nrow(data)*length(varying)). > > I'd like to use the abitrary-R-vector hashing functionality from > unique.c, but I'm not sure how to do it. I think that functionality is > useful to other C code, but the functions that I'm interested in are not > part of the R api (they're defined with "static"). Assuming copy/paste > is out, I can see two options: 1. to remove "static" from the > declarations of some of the functions, and add prototypes for those > functions to a new src/library/stats/src/reshape.c file (or to one of > the header files in src/include), or 2. to add C functions to do the > reshaping to src/main/unique.c and call those from > src/library/stats/R/reshape.R. > > This is all assuming that it makes sense to include this in mainline > R--obviously I think it's worthwhile, and I'm surprised other people > aren't complaining more. I would be happy to write/rewrite until y'all > are happy with how it's done, of course. > > I've written a proof-of-concept by copying and pasting the hashing > functions, which (on the above data frame) runs 20 times faster than the > R version of reshape. It still needs some debugging, to put it mildly > (doesn't work properly on reals), but the basic idea appears to work. > > The change I made to the reshape R function looks like this: > = > for(i in seq(length = length(times))) { > -thistime <- data[data[,timevar] %in% times[i],] > -rval[,varying[,i]] <- > thistime[match(rval[,idvar],thistime[,idvar]), > - v.names] > + for (j in seq(
Re: [Rd] reshape scaling with large numbers of times/rows
I'd like to thank everyone that's replied so far--more inline: On Thu, 2006-08-24 at 11:16 +0100, Prof Brian Ripley wrote: > Your example does not correspond to your description. You have taken a > random number of loci for each subject and measured each a random number > of times: You're right. I was trying to come up with an example that didn't require sending out a big hunk of data. The overall number of rows/columns and the data types/sizes in the example were true to life but the relationship between columns was not. Also, in my testing the run time of the random example was pretty close to (actually faster than) the run time on my real data. In the real data, there's about one row per subject/locus pair (some combinations are missing). The genotype data does have character type; I'd have to think a bit to see if I could make it into an integer vector. Aside from just making it a factor, of course. Thanks to Gabor Grothendieck for demonstrating gl(): > betterTest=data.frame(subject=as.character(1:70), locus=as.character(gl(4500, 70)), genotype=as.character(as.integer(runif(4500*70, 1, 20 > sapply(betterTest, is.factor) subjectlocus genotype TRUE TRUE TRUE > system.time(wideTest <- reshape(betterTest, v.names="genotype", timevar="locus", idvar="subject", direction="wide"), gcFirst=TRUE) [1] 1356.209 178.867 2071.6400.0000.000 > dim(wideTest) [1] 70 4501 > dim(betterTest) [1] 315000 3 This was on a different machine (a 2.2 Ghz Athlon 64). The only difference I can think of between betterTest and my actual data is that betterTest is ordered. > Also, subject and locus are archetypal factors, and forcing them to be > character vectors is just making efficiency problems for yourself. H, that's the way they're coming out of the database. I'm using RdbiPgSQL from Bioconductor, and I assumed there was a reason why the database interface wasn't turning things into factors. Given my (low) level of R knowledge, I'd have to think for a while to convince myself that doing so wouldn't make a difference aside from being faster. Of course, if you're asserting that that's the case I'll take your word for it. > I have an R-level solution that takes 0.2 s on my machine, and involves no > changes to R. > > However, you did not give your affiliation and I do not like giving free > consultancy to undisclosed commercial organizations. Please in future use > a proper signature block so that helpers are aware of your provenance. Ah, I hadn't really thought about this, but I see where you're coming from. I work here (my name and this email address are on the page): http://egcrc.org/pis/white-c.htm Please forgive my r-devel-newbieness; this is less of an issue on the other mailing lists I follow. When there's a chance (however slim, in this case) that something I write will end up getting used by someone else, I usually use my personal email address and general identity, because I know it'll follow me if I change jobs. The concern, of course, being that someone using it will want to get in touch with me sometime in the far future. I don't exactly have a tenured position. I really am trying to give at least as much as I'm taking; hopefully my first email shows that I did a healthy bit of thinking/reading/googling/coding before posting (maybe too much). Apparently the c-solution isn't necessary, but doing this in 0.2s is pretty amazing. On the same size data frame? Thanks, Mitch SkinnerTel: 510-985-3192 Programmer/Analyst Ernest Gallo Clinic & Research Center University of California, San Francisco __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] reshape scaling with large numbers of times/rows
If your Z in reality is not naturally numeric try representing it as a factor and using the numeric levels as your numbers and then put the level labels back on: m <- n <- 5 DF <- data.frame(X = gl(m*n, 1), Y = gl(m, n), Z = letters[1:25]) Zn <- as.numeric(DF$Z) system.time(w1 <- reshape(DF, timevar = "X", idvar = "Y", dir = "wide")) system.time({Zn <- as.numeric(DF$Z) w2 <- xtabs(Zn ~ Y + X, DF) w2[w2 > 0] <- levels(DF$Z)[w2] w2[w2 == 0] <- NA }) On 8/24/06, Gabor Grothendieck <[EMAIL PROTECTED]> wrote: > If for each combination of X and Y there is at most one Z and Z is numeric > (it could be made so in your example) then you could use xtabs which is > faster: > > > m <- n <- 10 > > DF <- data.frame(X = gl(m*n, 1), Y = gl(m, n), Z = 10*(1:(n*m))) > > system.time(w1 <- reshape(DF, timevar = "X", idvar = "Y", dir = "wide")) > [1] 0.33 0.00 0.33 NA NA > > system.time(w2 <- xtabs(Z ~ Y + X, DF)) > [1] 0.01 0.00 0.01 NA NA > > On 8/24/06, Mitch Skinner <[EMAIL PROTECTED]> wrote: > > Hello, > > > > I'm mailing r-devel because I think the performance problem I'm having > > is best solved by re-writing reshape() in C. I've been reading the > > "writing R extensions" documentation and I have some questions about > > they best way to write the C bits, but first let me describe my problem: > > > > I'm trying to reshape a long data frame with ~70 subjects measured at > > ~4500 "times" (in my case, it's ~4500 genetic loci on a chromosome) into > > a wide data frame with one column per locus ("time"). On my data > > (~300,000 rows for chromosome 1) it takes about twenty minutes on a > > 3.4GHz P4. Here's an R session that demonstrates it (this is pretty > > close to how my data actually looks): > > > > > version > > _ > > platform i686-redhat-linux-gnu > > arch i686 > > os linux-gnu > > system i686, linux-gnu > > status > > major 2 > > minor 3.1 > > year 2006 > > month 06 > > day01 > > svn rev38247 > > language R > > version.string Version 2.3.1 (2006-06-01) > > > test=data.frame(subject=I(as.character(as.integer(runif(3e5, 1, 70) + > > 1000))), locus=I(as.character(as.integer(runif(3e5, 1, 4500) + 1e6))), > > genotype=I(as.character(as.integer(runif(3e5, 1, 100) > > > system.time(testWide <- reshape(test, v.names=c("genotype"), > > timevar="locus", idvar="subject", direction="wide"), gcFirst=TRUE) > > [1] 1107.506 120.755 1267.5680.0000.000 > > > > I believe that this can be done a lot faster, and I think the problem > > comes from this loop in reshape.R (in reshapeWide): > > > >for(i in seq(length = length(times))) { > >thistime <- data[data[,timevar] %in% times[i],] > >rval[,varying[,i]] <- > > thistime[match(rval[,idvar],thistime[,idvar]), > > v.names] > >} > > > > I don't really understand everything going on under the hood here, but I > > believe the complexity of this loop is something like > > O(length(times)*(nrow(data)+(nrow(rval)*length(varying))). The profile > > shows the lion's share (90%) of the time being spent in [.data.frame. > > > > What I'd like to do is write a C loop to go through the source (long) > > data frame once (rather than once per time), and put the values into the > > destination rows/columns using hash tables to look up the right > > row/column. > > > > Assuming the hashing is constant-time bounded, then the reshape becomes > > O(nrow(data)*length(varying)). > > > > I'd like to use the abitrary-R-vector hashing functionality from > > unique.c, but I'm not sure how to do it. I think that functionality is > > useful to other C code, but the functions that I'm interested in are not > > part of the R api (they're defined with "static"). Assuming copy/paste > > is out, I can see two options: 1. to remove "static" from the > > declarations of some of the functions, and add prototypes for those > > functions to a new src/library/stats/src/reshape.c file (or to one of > > the header files in src/include), or 2. to add C functions to do the > > reshaping to src/main/unique.c and call those from > > src/library/stats/R/reshape.R. > > > > This is all assuming that it makes sense to include this in mainline > > R--obviously I think it's worthwhile, and I'm surprised other people > > aren't complaining more. I would be happy to write/rewrite until y'all > > are happy with how it's done, of course. > > > > I've written a proof-of-concept by copying and pasting the hashing > > functions, which (on the above data frame) runs 20 times faster than the > > R version of reshape. It still needs some debugging, to put it mildly > > (doesn't work properly on reals), but the basic idea appears to work. > > > > The change I made to the reshape R function looks like this: > > = > > for(i in seq(length = length(times))) { > > -thistime <- data[data[,timevar] %in% times[i],] > > -
Re: [Rd] reshape scaling with large numbers of times/rows
On Thu, 2006-08-24 at 08:57 -0400, Gabor Grothendieck wrote: > If your Z in reality is not naturally numeric try representing it as a > factor and using > the numeric levels as your numbers and then put the level labels back on: > > m <- n <- 5 > DF <- data.frame(X = gl(m*n, 1), Y = gl(m, n), Z = letters[1:25]) > Zn <- as.numeric(DF$Z) > system.time(w1 <- reshape(DF, timevar = "X", idvar = "Y", dir = "wide")) > system.time({Zn <- as.numeric(DF$Z) >w2 <- xtabs(Zn ~ Y + X, DF) >w2[w2 > 0] <- levels(DF$Z)[w2] >w2[w2 == 0] <- NA > }) This is pretty slick, thanks. It looks like it works for me. For the archives, this is how I got back to a data frame (as.data.frame(w2) gives me a long version again): > m <- 4500 > n <- 70 > DF <- data.frame(X = gl(m, n), Y = 1:n, Z = letters[1:25]) > system.time({Zn <- as.numeric(DF$Z) +w2 <- xtabs(Zn ~ Y + X, DF) +w2[w2 > 0] <- levels(DF$Z)[w2] +w2[w2 == 0] <- NA +WDF <- data.frame(Y=dimnames(w2)$Y) +for (col in dimnames(w2)$X) { WDF[col]=w2[,col] } + }) [1] 131.888 1.240 135.945 0.000 0.000 > dim(WDF) [1] 70 4501 I'll have to look; maybe I can just use w2 as is. Next time I guess I'll try R-help first. Thanks again, Mitch __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] reshape scaling with large numbers of times/rows
Here's the essence of a solution (0.14 sec for this bit) res <- with(betterTest, { subjects <- levels(subject) loci <- levels(locus) ## replace "" by as.character(NA) if you prefer res <- matrix("", length(subjects), length(loci), dimnames = list(subjects, loci)) ind <- cbind(as.integer(subject), as.integer(locus)) res[ind] <- as.character(genotype) res }) This produces a character matrix, mainly because that is what I had before. However, I think a matrix is probably a better data structure than a wide dataframe, but it can easily be converted (which will take a little longer, 1.5s if you use as.data.frame but there are much faster ways). It is certainly possible to do the same thing with a factor matrix result, but there seems to be a problem with the subsetting method for such objects. I would suggest generally avoiding data frames where all the columns are of one type and you are concerned with efficiency. On Thu, 24 Aug 2006, Mitch Skinner wrote: > I'd like to thank everyone that's replied so far--more inline: > > On Thu, 2006-08-24 at 11:16 +0100, Prof Brian Ripley wrote: > > Your example does not correspond to your description. You have taken a > > random number of loci for each subject and measured each a random number > > of times: > > You're right. I was trying to come up with an example that didn't > require sending out a big hunk of data. The overall number of > rows/columns and the data types/sizes in the example were true to life > but the relationship between columns was not. Also, in my testing the > run time of the random example was pretty close to (actually faster > than) the run time on my real data. > > In the real data, there's about one row per subject/locus pair (some > combinations are missing). The genotype data does have character type; > I'd have to think a bit to see if I could make it into an integer > vector. Aside from just making it a factor, of course. > > Thanks to Gabor Grothendieck for demonstrating gl(): > > > betterTest=data.frame(subject=as.character(1:70), > locus=as.character(gl(4500, 70)), > genotype=as.character(as.integer(runif(4500*70, 1, 20 > > sapply(betterTest, is.factor) > subjectlocus genotype > TRUE TRUE TRUE > > system.time(wideTest <- reshape(betterTest, v.names="genotype", > timevar="locus", idvar="subject", direction="wide"), gcFirst=TRUE) > [1] 1356.209 178.867 2071.6400.0000.000 > > dim(wideTest) > [1] 70 4501 > > dim(betterTest) > [1] 315000 3 > > This was on a different machine (a 2.2 Ghz Athlon 64). The only > difference I can think of between betterTest and my actual data is that > betterTest is ordered. > > > Also, subject and locus are archetypal factors, and forcing them to be > > character vectors is just making efficiency problems for yourself. > > H, that's the way they're coming out of the database. I'm using > RdbiPgSQL from Bioconductor, and I assumed there was a reason why the > database interface wasn't turning things into factors. Given my (low) > level of R knowledge, I'd have to think for a while to convince myself > that doing so wouldn't make a difference aside from being faster. Of > course, if you're asserting that that's the case I'll take your word for > it. > > > I have an R-level solution that takes 0.2 s on my machine, and involves no > > changes to R. > > > > However, you did not give your affiliation and I do not like giving free > > consultancy to undisclosed commercial organizations. Please in future use > > a proper signature block so that helpers are aware of your provenance. > > Ah, I hadn't really thought about this, but I see where you're coming > from. I work here (my name and this email address are on the page): > http://egcrc.org/pis/white-c.htm > Please forgive my r-devel-newbieness; this is less of an issue on the > other mailing lists I follow. > > When there's a chance (however slim, in this case) that something I > write will end up getting used by someone else, I usually use my > personal email address and general identity, because I know it'll follow > me if I change jobs. The concern, of course, being that someone using > it will want to get in touch with me sometime in the far future. I > don't exactly have a tenured position. > > I really am trying to give at least as much as I'm taking; hopefully my > first email shows that I did a healthy bit of > thinking/reading/googling/coding before posting (maybe too much). > Apparently the c-solution isn't necessary, but doing this in 0.2s is > pretty amazing. On the same size data frame? > > Thanks, > Mitch SkinnerTel: 510-985-3192 > Programmer/Analyst > Ernest Gallo Clinic & Research Center > University of California, San Francisco > -- Brian D. Ripley, [EMAIL PROTECTED] Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 Sout
Re: [Rd] reshape scaling with large numbers of times/rows
On 8/24/06, Mitch Skinner <[EMAIL PROTECTED]> wrote: > On Thu, 2006-08-24 at 08:57 -0400, Gabor Grothendieck wrote: > > If your Z in reality is not naturally numeric try representing it as a > > factor and using > > the numeric levels as your numbers and then put the level labels back on: > > > > m <- n <- 5 > > DF <- data.frame(X = gl(m*n, 1), Y = gl(m, n), Z = letters[1:25]) > > Zn <- as.numeric(DF$Z) > > system.time(w1 <- reshape(DF, timevar = "X", idvar = "Y", dir = "wide")) > > system.time({Zn <- as.numeric(DF$Z) > >w2 <- xtabs(Zn ~ Y + X, DF) > >w2[w2 > 0] <- levels(DF$Z)[w2] > >w2[w2 == 0] <- NA > > }) > > This is pretty slick, thanks. It looks like it works for me. For the > archives, this is how I got back to a data frame (as.data.frame(w2) > gives me a long version again): > > > m <- 4500 > > n <- 70 > > DF <- data.frame(X = gl(m, n), Y = 1:n, Z = letters[1:25]) > > system.time({Zn <- as.numeric(DF$Z) > +w2 <- xtabs(Zn ~ Y + X, DF) > +w2[w2 > 0] <- levels(DF$Z)[w2] > +w2[w2 == 0] <- NA > +WDF <- data.frame(Y=dimnames(w2)$Y) > +for (col in dimnames(w2)$X) { WDF[col]=w2[,col] } > + }) > [1] 131.888 1.240 135.945 0.000 0.000 > > dim(WDF) > [1] 70 4501 > > I'll have to look; maybe I can just use w2 as is. Next time I guess > I'll try R-help first. > > Thanks again, > Mitch > Also try na.omit(as.data.frame(w2)) __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] Huge matrices in .Call
I assume that the fact that SEXP's passed through .Call are read-only is a convention, consistent with R semantics. This can be awkward when huge matrices are passed as in: newP <- .Call("myfunc", P) because this would cause the huge matrix to be copied. Is there any danger if I violate the convention in cases like this and modify the SEXP matrix parameter directly, instead of forcing R to make a copy of it every time myfunc is called? (This seems to work in my testing.) Thanks, ds __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Huge matrices in .Call
Dominick Samperi wrote: > I assume that the fact that SEXP's passed through .Call > are read-only is a convention, consistent with R semantics. > > This can be awkward when huge matrices are passed > as in: > newP <- .Call("myfunc", P) > because this would cause the huge matrix to be copied. Yes, I think that's the practice - basically what you want is in-place modification of objects, which is contrary to what most R people would recommend. > Is there any danger if I violate the convention in cases > like this and modify the SEXP matrix parameter directly, > instead of forcing R to make a copy of it every time > myfunc is called? (This seems to work in my testing.) Prof. Ripley is probably going to say something about this... but I think as long as (1) you tell the garbage collector about you going to mess around with the input with PROTECT() (this is probably automatic for input to .Call() so may not be necessary), (2) you are doing exact in-place changes, no resizing, no messing around the attributes, no relocations elsewhere, etc, then nobody needs to get upset about you messing around with the internal of a matrix, so to speak, and technically it should work, *although doing so is very much frown upon*. In which case, if I were writing this, I would get myfunc() to return logical(TRUE/FALSE) to indicate success/failure or some more detail operation status. return'ing SEXP *input is probably dangerous and will probably confuse the garbage collector. I don't know for sure, but I would avoid it. so I would write it as: status <- .Call("myfunc", P) if (status) { cat("Ha, I have got a new P!\n") print(P) } else { cat("things don't work out and P is probably very broken\n") print(P) } (somebody will yell at me on coroborating this, but I have uses for in-place modification of big objects too...) HTL __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Huge matrices in .Call
On Thu, 24 Aug 2006, Hin-Tak Leung wrote: > Dominick Samperi wrote: >> I assume that the fact that SEXP's passed through .Call >> are read-only is a convention, consistent with R semantics. Yes >> This can be awkward when huge matrices are passed >> as in: >> newP <- .Call("myfunc", P) >> because this would cause the huge matrix to be copied. Indeed. >> Is there any danger if I violate the convention in cases >> like this and modify the SEXP matrix parameter directly, >> instead of forcing R to make a copy of it every time >> myfunc is called? (This seems to work in my testing.) > > Prof. Ripley is probably going to say something about this... > but I think as long as (1) you tell the garbage collector > about you going to mess around with the input with > PROTECT() (this is probably automatic for input to .Call() > so may not be necessary), (2) you are doing exact in-place > changes, no resizing, no messing around the attributes, > no relocations elsewhere, etc, then nobody needs to get upset > about you messing around with the internal of a matrix, so to speak, > and technically it should work, *although doing so is very much > frown upon*. That isn't quite sufficient. If another variable (perhaps in a function that calls your function) is using the same storage as your variable P then both will be changed. This may not be what the user expects. Also, if you return newP as well as modifying P they will point to the same storage so modifying one will modify the other in the future (at least until one of them gets duplicated). It's probably less dangerous with a matrix than with a list or data frame that might share only part of its memory with another variable. However, R does provide external pointer objects as a way to pass objects by reference without breaking the behaviour of the rest of the language. It's more work to use them but cleaner. -thomas Thomas Lumley Assoc. Professor, Biostatistics [EMAIL PROTECTED] University of Washington, Seattle __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] OT: authorship and contacts for releasing packages (Re: reshape scaling with large numbers of times/rows)
Rather off-topic, not really about R nor development at all, but I think the issue is vaguely relevant for potential package writers. Mitch Skinner wrote: > When there's a chance (however slim, in this case) that something I > write will end up getting used by someone else, I usually use my > personal email address and general identity, because I know it'll follow > me if I change jobs. The concern, of course, being that someone using > it will want to get in touch with me sometime in the far future. I > don't exactly have a tenured position. I think refering to annoymous-mailers/personal contact addresses in professional work is unwise. I usually use the affiliation/context for which the work is done in codes/documentations, because: (1) in many contexts, your employer owns any result of "work for hire", so if you are paid to program on something, while you can put your names down, your employers are also entitled to erase all your claims of copyright or ownership on it. This sounds very hash, but that's how the reality is. (2) for most users, some reasonable expectation of *continual* support, by an organization, or at least more than 1 individual, is important. Using an anonymous or "consumer" mailer, doesn't give good impression about the _quality_ of the work, nor the _continuity_ of it. While you think it helps people to contact you *after* they have adopted your software, having an unprofessionally-looking addresses attached to it may deter people from *adopting* in the first place, so you lose before it even starts. (3) if somebody wants to contact you for a purpose that's important to them, they will find a way - e.g. look it up to see who might be in the same division or who might have been your boss, if e-mail bounces, askes the postmaster, etc, or brunt-force googling. What do you expect people will want to contact you for? I do recommend putting down your full name (including middle ones, if one have a fairly common surname and first name), but your work really should be associated with the context in which it is in, if you actually want them to be adopted by others. (4) lastly, there seems to be some wishful thinking between "...don't exactly have a tenured position..." and "...want to get in touch with me sometime in the far future...". Tenure is seldomly a result of "oh, I read-a-paper/use-a-software/whatever and it impresses me so much that I have to hunt down the person who did it, *whatever he is like*, *wherever he is*, *whatever he is doing*, and hire him to work for me...". I have been hunted down twice for work I did as far as I remember, after I moved on from the context. One did a google, the other got it from my ex-boss - the latter wanted some answers on details, which I provided, and I did get an extended "thank you", plus a "if you ever want a reference, I am willing to testify to the quality of your work" statement, which is nice, but cannot be taken literally nor seriously. I can just about imagine the conversation went further as "I know somebody who might be interested in your expertises, and I am willing to put your name forward, if you are interested". But that's *after* they obtain what they contacted you for. It is almost unthinkable that you get hunted down for a job offer based on some software/mailing-list-postings etc you do... HTL __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] Command line length limits in R
I've been trying to track down some of the issues with command line length limits, and those writing GUIs/front-ends need to pay some attention to the issue. src/unix/system.txt says *int R_ReadConsole(char *prompt, char *buf, int buflen, int hist) * * This function prints the given prompt at the console and then * does a gets(3)-like operation, transferring up to "buflen" characters * into the buffer "buf". The last two characters are set to "\n\0" * to preserve sanity. but that isn't actually true for some of the standard interfaces and seems undesirable. (Also, for 'characters' read 'bytes'.) What happens is that all calls to R_ReadConsole have buflen=1024. (These can be an input line of R code, or from scan() or from a stdin() connection.) If this is command input, the result is parsed, and if incomplete and not an error, more input is requested until it is complete. Suppose the user entered a very long line. Should the first 1024 bytes be syntactically complete, this will not do what he expected and it will be as if a LF had been inserted after 1024 bytes. But that is unlikely, and he may well get away with it, unless R_ReadConsole did actually does as documented and inserts "\n\0" (the Rgui and readline consoles do, but reading from a file in Linux or Windows does not). It seems the correct advice is that R_ReadConsole should only send a "\n" when there is no more input available, e.g. on EOF. I am changing the calling code to have a 1025 char buffer with last '\0' to ensure null-termination. Some consoles try to ensure that the user cannot enter more than 1024 bytes. That's a bit awkward in a MBCS, and also when input is being pasted in (possibly in the middle of a line). Generally this does not work too well: e.g. the readline console truncates at 1022 chars and appends a \n. I have no idea how an R user was expected to know there was a line limit. I've added some documentation to R-intro (but I will need to correct it as I didn't know the readline console used 1022). The limit applies to redirected input from a file, but not to source(). GUI writers: please take a look at how your interface handles very long input lines, particularly quoted strings (which seem the most likely reason for long lines). -- Brian D. Ripley, [EMAIL PROTECTED] Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UKFax: +44 1865 272595 __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel