Hi,
I have not followed this thread from the beginning, but have you tried
the foverlaps() function from the data.table package?
Something along the lines of:
---
# create the tables (use as.data.table() or setDT() if you
# start with a data.frame)
mapfile <- data.table(Name = c("S1", "S2", "S3"), Chr = 1,
Position = c(3000, 6000, 1000))
Chr.Arms <- data.table(Chr = 1, Arm = c("p", "q"),
Start = c(0, 5001), End = c(5000, 10000))
# add a dummy variable to be able to define Position as an interval
mapfile[, Position2 := Position]
# add keys
setkey(mapfile, Chr, Position, Position2)
setkey(Chr.Arms, Chr, Start, End)
# use data.table::foverlaps (see ?foverlaps)
mapfile <- foverlaps(mapfile, Chr.Arms, type = "within")
# remove the dummy variable
mapfile[, Position2 := NULL]
# recreate original order
setorder(mapfile, Chr, Name)
---
BTW, there is a typo in your *SOLUTION*. I guess you wanted to write
data.table(Name = c("S1", "S2", "S3"), Chr = 1, Position = c(3000, 6000,
1000), key = "Chr") instead of data.frame(Name = c("S1", "S2", "S3"),
Chr = 1, Position = c(3000, 6000, 1000), key = "Chr").
HTH,
Denes
On 01/30/2016 07:48 PM, Gaius Augustus wrote:
I'll look into the Intervals idea. The data.table code posted might not
work (because I don't believe it would put the rows in the correct order if
the chromosomes are interspersed), however, it did make me think about
possibly assigning based on values...
*SOLUTION*
mapfile <- data.frame(Name = c("S1", "S2", "S3"), Chr = 1, Position =
c(3000, 6000, 1000), key = "Chr")
Chr.Arms <- data.frame(Chr = 1, Arm = c("p", "q"), Start = c(0, 5001), End
= c(5000, 10000), key = "Chr")
for(i in 1:nrow(Chr.Arms)){
cur.row <- Chr.Arms[i, ]
mapfile$Arm[ mapfile$Chr == cur.row$Chr & mapfile$Position >=
cur.row$Start & mapfile$Position <= cur.row$End] <- cur.row$Arm
}
This took out the need for the intermediate table/vector. This worked for
me, and was VERY fast. Took <5 minutes on a dataframe with 35 million rows.
Thanks for the help,
Gaius
On Sat, Jan 30, 2016 at 10:50 AM, Gaius Augustus <gaiusjaugus...@gmail.com>
wrote:
I'll look into the Intervals idea. The data.table code posted might not
work (because I don't believe it would put the rows in the correct order if
the chromosomes are interspersed), however, it did make me think about
possibly assigning based on values...
Something like:
mapfile <- data.table(Name = c("S1", "S2", "S3"), Chr = 1, Position =
c(3000, 6000, 1000), key = "Chr")
Chr.Arms <- data.table(Chr = 1, Arm = c("p", "q"), Start = c(0, 5001), End
= c(5000, 10000), key = "Chr")
for(i in 1:nrow(Chr.Arms)){
cur.row <- Chr.Arms[i, ]
mapfile[ Chr == cur.row$Chr & Position >= cur.row$Start & Position <=
cur.row$End] <- Chr.Arms$Arm
}
This might take out the need for the intermediate table/vector. Not sure
yet if it'll work, but we'll see. I'm interested to know if anyone else
has any ideas, too.
Thanks,
Gaius
On Fri, Jan 29, 2016 at 11:34 PM, Ulrik Stervbo <ulrik.ster...@gmail.com>
wrote:
Hi Gaius,
Could you use data.table and loop over the small Chr.arms?
library(data.table)
mapfile <- data.table(Name = c("S1", "S2", "S3"), Chr = 1, Position =
c(3000, 6000, 1000), key = "Chr")
Chr.Arms <- data.table(Chr = 1, Arm = c("p", "q"), Start = c(0, 5001),
End = c(5000, 10000), key = "Chr")
Arms <- data.table()
for(i in 1:nrow(Chr.Arms)){
cur.row <- Chr.Arms[i, ]
Arm <- mapfile[ Position >= cur.row$Start & Position <= cur.row$End]
Arm <- Arm[ , Arm:=cur.row$Arm][]
Arms <- rbind(Arms, Arm)
}
# Or use plyr to loop over each possible arm
library(plyr)
Arms <- ddply(Chr.Arms, .variables = "Arm", function(cur.row, mapfile){
mapfile <- mapfile[ Position >= cur.row$Start & Position <= cur.row$End]
mapfile <- mapfile[ , Arm:=cur.row$Arm][]
return(mapfile)
}, mapfile = mapfile)
I have just started to use the data.table and I have the feeling the code
above can be greatly improved - maybe the loop can be dropped entirely?
Hope this helps
Ulrik
On Sat, 30 Jan 2016 at 03:29 Gaius Augustus <gaiusjaugus...@gmail.com>
wrote:
I have two dataframes. One has chromosome arm information, and the other
has SNP position information. I am trying to assign each SNP an arm
identity. I'd like to create this new column based on comparing it to
the
reference file.
*1) Mapfile (has millions of rows)*
Name Chr Position
S1 1 3000
S2 1 6000
S3 1 1000
*2) Chr.Arms file (has 39 rows)*
Chr Arm Start End
1 p 0 5000
1 q 5001 10000
*R Script that works, but slow:*
Arms <- c()
for (line in 1:nrow(Mapfile)){
Arms[line] <- Chr.Arms$Arm[ Mapfile$Chr[line] == Chr.Arms$Chr &
Mapfile$Position[line] > Chr.Arms$Start & Mapfile$Position[line] <
Chr.Arms$End]}
}
Mapfile$Arm <- Arms
*Output Table:*
Name Chr Position Arm
S1 1 3000 p
S2 1 6000 q
S3 1 1000 p
In words: I want each line to look up the location ( 1) find the right
Chr,
2) find the line where the START < POSITION < END), then get the ARM
information and place it in a new column.
This R script works, but surely there is a more time/processing efficient
way to do it.
Thanks in advance for any help,
Gaius
[[alternative HTML version deleted]]
______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
[[alternative HTML version deleted]]
______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.