Re: [R] Conditional editing of rows in a data frame

David Winsemius Thu, 28 Jan 2010 05:28:09 -0800


On Jan 28, 2010, at 7:05 AM, Irene Gallego Romero wrote:

Dear R users,

I have a dataframe (main.table) with ~30,000 rows and 6 columns, of
which here are a few rows:

     id chr window         gene     xp.norm    xp.top
129 1_32   1     32       TAS1R1  1.28882115     FALSE
130 1_32   1     32       ZBTB48  1.28882115     FALSE
131 1_32   1     32       KLHL21  1.28882115     FALSE
132 1_32   1     32        PHF13  1.28882115     FALSE
133 1_33   1     33        PHF13  1.02727430     FALSE
134 1_33   1     33        THAP3  1.02727430     FALSE
135 1_33   1     33      DNAJC11  1.02727430     FALSE
136 1_33   1     33       CAMTA1  1.02727430     FALSE
137 1_34   1     34       CAMTA1  1.40312732      TRUE
138 1_35   1     35       CAMTA1  1.52104538     FALSE
139 1_36   1     36       CAMTA1  1.04853732     FALSE
140 1_37   1     37       CAMTA1  0.64794094     FALSE
141 1_38   1     38       CAMTA1  1.23026086      TRUE
142 1_38   1     38        VAMP3  1.23026086      TRUE
143 1_38   1     38         PER3  1.23026086      TRUE
144 1_39   1     39         PER3  1.18154967      TRUE
145 1_39   1     39         UTS2  1.18154967      TRUE
146 1_39   1     39      TNFRSF9  1.18154967      TRUE
147 1_39   1     39        PARK7  1.18154967      TRUE
148 1_39   1     39       ERRFI1  1.18154967      TRUE
149 1_40   1     40      no_gene  1.79796879     FALSE
150 1_41   1     41      SLC45A1  0.20193560     FALSE

I want to create two new columns, xp.bg and xp.n.top, using the
following criteria:

If gene is the same in consecutive rows, xp.bg is the minimum value of
xp.norm in those rows; if gene is not the same, xp.bg is simply the
value of xp.norm for that row;

Assuming that gene values are adjacent in a dataframe named df1, thenthis would work:


df1$xp.bg<- with(df1, ave(xp.norm, gene, FUN=min))


Likewise, if there's a run of contiguous xp.top = TRUE values,
xp.n.top is the minimum value in that range, and if xp.top is false or
NA, xp.n.top is NA, or 0 (I don't care).


df1$seqgrp <- c(0, diff(df1$xp.top))
df1$seqgrp2 <- cumsum(df1$seqgrp != 0)
df1$xp.n.top <- with(df1, ave(xp.norm, seqgrp2, FUN=min))
is.na(df1$xp.n.top) <- !xp.top

> df1$xp.bg<- with(df1, ave(xp.norm, gene, FUN=min))
> df1

id chr window gene xp.norm xp.top seqgrp seqgrp2xp.n.top xp.bg129 1_32 1 32 TAS1R1 1.2888211 FALSE 0 0 NA1.2888211130 1_32 1 32 ZBTB48 1.2888211 FALSE 0 0 NA1.2888211131 1_32 1 32 KLHL21 1.2888211 FALSE 0 0 NA1.2888211132 1_32 1 32 PHF13 1.2888211 FALSE 0 0 NA1.0272743133 1_33 1 33 PHF13 1.0272743 FALSE 0 0 NA1.0272743134 1_33 1 33 THAP3 1.0272743 FALSE 0 0 NA1.0272743135 1_33 1 33 DNAJC11 1.0272743 FALSE 0 0 NA1.0272743136 1_33 1 33 CAMTA1 1.0272743 FALSE 0 0 NA0.6479409137 1_34 1 34 CAMTA1 1.4031273 TRUE 1 1 1.4031270.6479409138 1_35 1 35 CAMTA1 1.5210454 FALSE -1 2 NA0.6479409139 1_36 1 36 CAMTA1 1.0485373 FALSE 0 2 NA0.6479409140 1_37 1 37 CAMTA1 0.6479409 FALSE 0 2 NA0.6479409141 1_38 1 38 CAMTA1 1.2302609 TRUE 1 3 1.1815500.6479409142 1_38 1 38 VAMP3 1.2302609 TRUE 0 3 1.1815501.2302609143 1_38 1 38 PER3 1.2302609 TRUE 0 3 1.1815501.1815497144 1_39 1 39 PER3 1.1815497 TRUE 0 3 1.1815501.1815497145 1_39 1 39 UTS2 1.1815497 TRUE 0 3 1.1815501.1815497146 1_39 1 39 TNFRSF9 1.1815497 TRUE 0 3 1.1815501.1815497147 1_39 1 39 PARK7 1.1815497 TRUE 0 3 1.1815501.1815497148 1_39 1 39 ERRFI1 1.1815497 TRUE 0 3 1.1815501.1815497149 1_40 1 40 no_gene 1.7979688 FALSE -1 4 NA1.7979688150 1_41 1 41 SLC45A1 0.2019356 FALSE 0 4 NA0.2019356

And if the adjacent-gene assumption of the first request above werenot met, then the first portion of this method could be used insteadto great group indices.


--
David.


So, in the above example,
xp.bg for rows 136:141 should be 0.64794094, and is equal to xp.norm
for all other rows,
xp.n.top for row 137 is 1.40312732, 1.18154967 for rows 141:148, and
0/NA for all other rows.

Is there a way to combine indexing and if statements or some such to
accomplish this? I want to it this without using split(main.table,
main.table$gene), because there's about 20,000 unique entries for
gene, and one of the entries, no_gene, is repeated throughout. I
thought briefly of subsetting the rows where xp.top is TRUE, but I
then don't know how to set the range for min, so that it only looks at
what would originally have been consecutive rows, and searching the
help has not proved particularly useful.

Thanks in advance,
Irene Gallego Romero




David Winsemius, MD
Heritage Laboratories
West Hartford, CT

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Conditional editing of rows in a data frame

Reply via email to