OK, not the SNP's. So look at the "chr"'s. I will bet that you get 0
when you try :
length(intersect(data_lane6_snps$chr, data_lane6_snps_rsid$chr))
... since one is using a format of "chrNN" and the other is using just
"NN". You need to get the chromosome naming convention straightened out.
--
David.
On Apr 6, 2010, at 4:53 PM, Abhishek Pratap wrote:
Just so you know
length(intersect(data_lane6_snps$SNP, data_lane6_snps_rsid$SNP))
796120
I just need to include the chr condition now where I am stuck.
-Abhi
On Tue, Apr 6, 2010 at 4:51 PM, Abhishek Pratap <abhishek....@gmail.com
> wrote:
Hi David
I can understand looking the SNP data values it can be felt that
they are different values and hence no result in merge. However the
columns still have ~700K SNPs common. What I am looking for is a
merge where the SNP and Chr matches. If I match only the SNP column
I get partially correct results since it is possible for two
chromosomes to have a SNP at the same bp location so the merge needs
to take both SNP position and Chromosome into account.
Thanks!
-Abhi
On Tue, Apr 6, 2010 at 4:42 PM, David Winsemius <dwinsem...@comcast.net
> wrote:
On Apr 6, 2010, at 4:03 PM, Abhishek Pratap wrote:
Hi David
Here it is. You can ignore the bio jargon if it sounds confusing.
Sometimes it is essential to have domain details.
The corresponding data type of column (SNP, chr) on which I am
applying merge is same.
merge(data_lane6_snps, data_lane6_snps_rsid , by = c("SNP,"chr"))
str(data_lane6_snps)
'data.frame': 7724462 obs. of 10 variables:
$ chr : Factor w/ 25 levels "chr1","chr10",..: 1 1 1 1 1
1 1 1 1 1 ...
$ SNP : int 100 101 103 108 179 180 191 197 218 222 ...
$ reference : Factor w/ 5 levels "A","C","G","N",..: 2 2 5 2 2
5 2 2 1 5 ...
$ genotype : Factor w/ 10 levels "A","C","G","K",..: 1 1 1 8 2
2 3 8 2 2 ...
$ consensus_qual: int 0 0 0 4 33 33 19 19 19 19 ...
$ snp_qual : int 0 0 0 4 0 33 19 19 19 19 ...
$ rms_qual : int 0 0 0 0 21 21 21 21 21 21 ...
$ depth : int 1 1 1 1 2 2 2 2 2 2 ...
$ bases : Factor w/ 453774 levels "^!,","^!,^!,",..: 5 5 5
410998 49793 155731 284998 416878 133393 133393 ...
$ base_quality : Factor w/ 555104 levels "`","``","```",..: 359
359 359 54813 92856 92856 92856 92856 92539 55424 ...
> str(data_lane6_snps_rsid)
'data.frame': 797807 obs. of 4 variables:
$ chr : Factor w/ 24 levels "1","10","11",..: 3 3 3 3 3 3 3 3 3 3 ...
$ SNP : int 68143872 11071026 69423434 12394791 1302846 95330693
3921381 57122299 41899656 76990037 ...
Looking at this line and the line for "SNP" in the above dataframe I
am not seeing that these are exhibiting much similarity in range.
There are 10 times few observations. What was you plan for the non-
matching cases? Did you really mean that you wanted a right outer
join?
You might get information by trying:
length(intersect(data_lane6_snps$SNP, data_lane6_snps_rsid$SNP))
That would tell you how many potential matches you might have on the
basis of SNP numbers, Although an SNP match might or might not be a
full match given the chr matching that is also being specified.
$ end : int 68143872 11071026 69423434 12394791 1302846 95330693
3921381 57122299 41899656 76990037 ...
$ rsid: Factor w/ 797807 levels "rs10","rs10000010",..: 100229
685690 505395 470219 780326 29342 29263 327909 434159 723152 ...
On Tue, Apr 6, 2010 at 3:59 PM, David Winsemius <dwinsem...@comcast.net
> wrote:
On Apr 6, 2010, at 3:54 PM, Abhishek Pratap wrote:
Hi Guys
I have two data frames which I would like to merge on two conditions.
I am doing the following (abstract form)
new.data.frame <- merge(df1,df2, by=c("Col1","Col2"))
So I am guessing that you really wanted just this:
new.data.frame <- merge(df1,df2)
?merge
Since the default for merge is: by = intersect(names(x), names(y)),
this would have been equivalent to
new.data.frame <- merge(df1,df2, by=c("chr", "SNP") )
See above regarding the possibility that you have non-congruent SNP
labeling problems.
What does
str(df1) ; str(df2)
... show?
It is giving me a null result.
Basically I need to apply two conditions.
I also tried sqldf but it is running forever. Will indexing help ?
temp <- sqldf("select
a.chr,a.SNP,a.snp_qual,a.rms_qual,a.depth,b.rsid FROM
+ data_lane6_snps a,
+ data_lane6_snps_rsid b
+ WHERE
+ a.SNP = b.SNP
+ AND
+ a.chr = b.chr
+ ")
Thanks!
-Abhi
[[alternative HTML version deleted]]
______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
David Winsemius, MD
West Hartford, CT
David Winsemius, MD
West Hartford, CT
David Winsemius, MD
West Hartford, CT
______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.