Re: [R] Large dataset operations

Claudia Beleites Fri, 11 Mar 2011 13:04:17 -0800

Haakon,

as replicates imply that they all have the same data type, you can putthem into a matrix which is often faster and needs less memory (thoughwhether that can really matter depends of the number of replicates youhave: for small no of replicates you won't have much effect anyways).

But I find it handy to have the matrix of replicates with data$rep.


data <- data.frame (plateNo = a, Well = b, rep = I (cbind (c, d, e)))
> data
   plateNo Well rep.c rep.d rep.e
1        1  A01  1312   963  1172
2        1  A02 10464  6715  5628
3        1  A03  3301  3257  3281
4        1  A04  3895  3350  3496
5        1  A05  8731  7389  5701
6        2  A01  7893  6748  5920
7        2  A02  2912  2385  2586
8        2  A03   985   785   809
9        2  A04  1346  1018  1001
10       2  A05   794   314   486
> dim (data)
[1] 10  3

Then:
data$norm <- data$rep / apply (data$rep, 2, ave, plateNo = data$plateNo)

you can also do the division into the apply:

data$norm <- apply (data$rep, 2, function (x) x / ave(x, plateNo =data$plateNo))

If you always have the sampe number of wells per plate, you could also"fold" the data$rep matrix into an array:

arep <- array (data$rep, dim = c (2, 5, 3))
anorm <- arep / rep (colMeans (arep), each = 2)
dim (anorm) <- dim (data$rep)
data$norm <- anorm


Here are some microbenchmark results:
Unit: nanoeconds
         min      lq  median      uq     max
[1,] 1525160 1561280 1627620 1685020 3575719
[2,] 1505641 1539500 1560301 1649081 3538001
[3,]  113321  115041  115821  116881  155681
[4,] 2589800 2627280 2662540 2794920 4646399

1 and 2 are the two apply versions above.
3 is the array
4 are your loops

HTH

Claudia


Am 11.03.2011 18:38, schrieb hi Berven:


Hello all,

I'm new to R and trying to figure out how to perform calculations on a large dataset (300 000 
datapoints). I have already made some code to do this but it is awfully slow. What I want to do is 
add a new column for each "rep_ " column where I have taken each value and divide it by 
the mean of all values where "PlateNo" is the same. My data is in the following format:

data





PlateNo

Well

rep_1

rep_2

rep_3


1

A01

1312

963

1172


1

A02

10464

6715

5628


1

A03

3301

3257

3281


1

A04

3895

3350

3496


1

A05

8731

7389

5701


2

A01

7893

6748

5920


2

A02

2912

2385

2586


2

A03

985

785

809


2

A04

13462

1018

1001


2

A05

794

314

486

To generate it copy:
a<- c(1, 1, 1, 1, 1, 2, 2, 2, 2, 2)
b<- c("A01", "A02", "A03", "A04", "A05", "A01", "A02", "A03", "A04", "A05")
c<- c(1312, 10464,  3301,  3895,  8731,  7893,  2912,   985,  1346,   794)
d<- c(963, 6715, 3257, 3350, 7389, 6748, 2385, 785, 1018,  314)
e<- c(1172, 5628, 3281, 3496, 5701, 5920, 2586,  809, 1001,  486)
data<- data.frame(plateNo = a, Well = b, rep_1 = c, rep_2 = d, rep_3 = e)

Here is the code I have come up with:

                 rows<- length(data$plateNo)
                 reps<- 3
                 norm<- list()
                 for (rep in 1:reps) {
                                 x<- paste("rep_",rep,sep="")
                                 normx<- paste("normalised_",rep,sep="")
                                 for (row in 1:rows) {
                                                 plateMean<- 
mean(data[[x]][data$plateNo == data$plateNo[row]])
                                                 wellData<- data[[x]][row]
                                                 norm[[normx]][row]<- wellData 
/ plateMean
                                 }
                 }


Any help or tips would be greatly appreciated!
Thanks,
Haakon                                          
        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Large dataset operations

Reply via email to