Hi Gaius,
On 01/29/2016 10:52 AM, Gaius Augustus wrote:
I have two dataframes. One has chromosome arm information, and the other
has SNP position information. I am trying to assign each SNP an arm
identity. I'd like to create this new column based on comparing it to the
reference file.
*1) Mapfile (has millions of rows)*
Name Chr Position
S1 1 3000
S2 1 6000
S3 1 1000
*2) Chr.Arms file (has 39 rows)*
Chr Arm Start End
1 p 0 5000
1 q 5001 10000
*R Script that works, but slow:*
Arms <- c()
for (line in 1:nrow(Mapfile)){
Arms[line] <- Chr.Arms$Arm[ Mapfile$Chr[line] == Chr.Arms$Chr &
Mapfile$Position[line] > Chr.Arms$Start & Mapfile$Position[line] <
Chr.Arms$End]}
}
Mapfile$Arm <- Arms
*Output Table:*
Name Chr Position Arm
S1 1 3000 p
S2 1 6000 q
S3 1 1000 p
In words: I want each line to look up the location ( 1) find the right Chr,
2) find the line where the START < POSITION < END), then get the ARM
information and place it in a new column.
This R script works, but surely there is a more time/processing efficient
way to do it.
You could use the GenomicRanges package for this:
1) Turn 'Mapfile' and 'Chr.Arms' into GRanges objects:
library(GenomicRanges)
query <- makeGRangesFromDataFrame(Mapfile, start.field="Position",
end.field="Position")
subject <- makeGRangesFromDataFrame(Chr.Arms)
2) Call findOverlaps() on them:
Mapfile2Chr.Arms <- findOverlaps(query, subject, select="arbitrary")
3) Use the result of findOverlaps() to create the column to add to
'Mapfile':
Mapfile$Arm <- Chr.Arms$Arm[Mapfile2Chr.Arms]
Mapfile
# Name Chr Position Arm
# 1 S1 1 3000 p
# 2 S2 1 6000 q
# 3 S3 1 1000 p
Should be very fast.
Note that GenomicRanges is a Bioconductor package:
http://bioconductor.org/packages/GenomicRanges
Make sure you follow the Installation instructions on that page.
Cheers,
H.
Thanks in advance for any help,
Gaius
[[alternative HTML version deleted]]
______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpa...@fredhutch.org
Phone: (206) 667-5791
Fax: (206) 667-1319
______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.