Hi Gaius,

On 01/29/2016 10:52 AM, Gaius Augustus wrote:
I have two dataframes. One has chromosome arm information, and the other
has SNP position information. I am trying to assign each SNP an arm
identity.  I'd like to create this new column based on comparing it to the
reference file.

*1) Mapfile (has millions of rows)*

Name    Chr   Position
S1      1      3000
S2      1      6000
S3      1      1000

*2) Chr.Arms   file (has 39 rows)*

Chr    Arm    Start   End
1      p      0       5000
1      q      5001    10000


*R Script that works, but slow:*
Arms  <- c()
for (line in 1:nrow(Mapfile)){
       Arms[line] <- Chr.Arms$Arm[ Mapfile$Chr[line] == Chr.Arms$Chr &
  Mapfile$Position[line] > Chr.Arms$Start &  Mapfile$Position[line] <
Chr.Arms$End]}
}
Mapfile$Arm <- Arms


*Output Table:*

Name   Chr   Position   Arm
S1      1     3000      p
S2      1     6000      q
S3      1     1000      p


In words: I want each line to look up the location ( 1) find the right Chr,
2) find the line where the START < POSITION < END), then get the ARM
information and place it in a new column.

This R script works, but surely there is a more time/processing efficient
way to do it.

You could use the GenomicRanges package for this:

1) Turn 'Mapfile' and 'Chr.Arms' into GRanges objects:

  library(GenomicRanges)
  query <- makeGRangesFromDataFrame(Mapfile, start.field="Position",
                                             end.field="Position")
  subject <- makeGRangesFromDataFrame(Chr.Arms)

2) Call findOverlaps() on them:

  Mapfile2Chr.Arms <- findOverlaps(query, subject, select="arbitrary")

3) Use the result of findOverlaps() to create the column to add to
  'Mapfile':

  Mapfile$Arm <- Chr.Arms$Arm[Mapfile2Chr.Arms]
  Mapfile
  #   Name Chr Position Arm
  # 1   S1   1     3000   p
  # 2   S2   1     6000   q
  # 3   S3   1     1000   p

Should be very fast.

Note that GenomicRanges is a Bioconductor package:

  http://bioconductor.org/packages/GenomicRanges

Make sure you follow the Installation instructions on that page.

Cheers,
H.


Thanks in advance for any help,
Gaius

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


--
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpa...@fredhutch.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to