Re: [R] Merging data frames on two conditions

Abhishek Pratap Tue, 06 Apr 2010 14:14:51 -0700

You got the error. It is different naming convention of chr. I should be
able to fix that pretty easily.


In case the problem persists, I will contact the list.

Thanks!
-Abhi

On Tue, Apr 6, 2010 at 5:01 PM, David Winsemius <dwinsem...@comcast.net>wrote:

> OK, not the SNP's. So look at the "chr"'s. I will bet that you get 0 when
> you try :
>
> length(intersect(data_lane6_snps$chr, data_lane6_snps_rsid$chr))
>
>
> ... since one is using a format of "chrNN" and the other is using just
> "NN". You need to get the chromosome naming convention straightened out.
>
> --
> David.
>
>
> On Apr 6, 2010, at 4:53 PM, Abhishek Pratap wrote:
>
>  Just so you know
>>
>> length(intersect(data_lane6_snps$SNP, data_lane6_snps_rsid$SNP))
>> 796120
>>
>> I just need to include the chr condition now where I am stuck.
>>
>> -Abhi
>>
>> On Tue, Apr 6, 2010 at 4:51 PM, Abhishek Pratap <abhishek....@gmail.com>
>> wrote:
>> Hi David
>>
>> I can understand looking the SNP data values it can be felt that they are
>> different values and hence no result in merge. However the columns still
>> have ~700K SNPs common. What I am looking for is a merge where the SNP and
>> Chr matches. If I match only the SNP column I get partially correct results
>> since it is possible for two chromosomes to have a SNP at the same bp
>> location so the merge needs to take both SNP position and Chromosome into
>> account.
>>
>> Thanks!
>> -Abhi
>>
>>
>> On Tue, Apr 6, 2010 at 4:42 PM, David Winsemius <dwinsem...@comcast.net>
>> wrote:
>>
>> On Apr 6, 2010, at 4:03 PM, Abhishek Pratap wrote:
>>
>> Hi David
>>
>> Here it is. You can ignore the bio jargon if it sounds confusing.
>>
>> Sometimes it is essential to have domain details.
>>
>>
>> The corresponding data type of column (SNP, chr) on which I am applying
>> merge is same.
>>
>> merge(data_lane6_snps, data_lane6_snps_rsid , by = c("SNP,"chr"))
>>
>>
>> str(data_lane6_snps)
>> 'data.frame':   7724462 obs. of  10 variables:
>>  $ chr           : Factor w/ 25 levels "chr1","chr10",..: 1 1 1 1 1 1 1 1
>> 1 1 ...
>>  $ SNP           : int  100 101 103 108 179 180 191 197 218 222 ...
>>  $ reference     : Factor w/ 5 levels "A","C","G","N",..: 2 2 5 2 2 5 2 2
>> 1 5 ...
>>  $ genotype      : Factor w/ 10 levels "A","C","G","K",..: 1 1 1 8 2 2 3 8
>> 2 2 ...
>>  $ consensus_qual: int  0 0 0 4 33 33 19 19 19 19 ...
>>  $ snp_qual      : int  0 0 0 4 0 33 19 19 19 19 ...
>>  $ rms_qual      : int  0 0 0 0 21 21 21 21 21 21 ...
>>  $ depth         : int  1 1 1 1 2 2 2 2 2 2 ...
>>  $ bases         : Factor w/ 453774 levels "^!,","^!,^!,",..: 5 5 5 410998
>> 49793 155731 284998 416878 133393 133393 ...
>>  $ base_quality  : Factor w/ 555104 levels "`","``","```",..: 359 359 359
>> 54813 92856 92856 92856 92856 92539 55424 ...
>>
>> > str(data_lane6_snps_rsid)
>> 'data.frame':   797807 obs. of  4 variables:
>>  $ chr : Factor w/ 24 levels "1","10","11",..: 3 3 3 3 3 3 3 3 3 3 ...
>>  $ SNP : int  68143872 11071026 69423434 12394791 1302846 95330693 3921381
>> 57122299 41899656 76990037 ...
>>
>> Looking at this line and the line for "SNP" in the above dataframe I am
>> not seeing that these are exhibiting much similarity in range. There are 10
>> times few observations. What was you plan for the non-matching cases? Did
>> you really mean that you wanted a right outer join?
>>
>> You might get information by trying:
>>
>> length(intersect(data_lane6_snps$SNP, data_lane6_snps_rsid$SNP))
>>
>> That would tell you how many potential matches you might have on the basis
>> of SNP numbers, Although an SNP match might or might not be a full match
>> given the chr matching that is also being specified.
>>
>>
>>
>>  $ end : int  68143872 11071026 69423434 12394791 1302846 95330693 3921381
>> 57122299 41899656 76990037 ...
>>  $ rsid: Factor w/ 797807 levels "rs10","rs10000010",..: 100229 685690
>> 505395 470219 780326 29342 29263 327909 434159 723152 ...
>>
>>
>> On Tue, Apr 6, 2010 at 3:59 PM, David Winsemius <dwinsem...@comcast.net>
>> wrote:
>>
>> On Apr 6, 2010, at 3:54 PM, Abhishek Pratap wrote:
>>
>> Hi Guys
>>
>> I have two data frames which I would like to merge on two conditions.
>>
>> I am doing the following  (abstract form)
>>
>> new.data.frame <- merge(df1,df2, by=c("Col1","Col2"))
>>
>> So I am guessing that you really wanted just this:
>>
>> new.data.frame <- merge(df1,df2)
>>
>> ?merge
>>
>> Since the default for merge is:  by = intersect(names(x), names(y)), this
>> would have been equivalent to
>>
>> new.data.frame <- merge(df1,df2, by=c("chr", "SNP") )
>>
>> See above regarding the possibility that you have non-congruent SNP
>> labeling problems.
>>
>>
>>
>>
>>
>> What does
>>
>>  str(df1) ; str(df2)
>>
>> ... show?
>>
>>
>>
>> It is giving me a null result.
>>
>> Basically I need to apply two conditions.
>>
>> I also tried sqldf but it is running forever. Will indexing help ?
>>
>> temp <- sqldf("select a.chr,a.SNP,a.snp_qual,a.rms_qual,a.depth,b.rsid
>> FROM
>> + data_lane6_snps a,
>> + data_lane6_snps_rsid b
>> + WHERE
>> + a.SNP = b.SNP
>> + AND
>> + a.chr = b.chr
>> + ")
>>
>> Thanks!
>> -Abhi
>>
>>      [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> R-help@r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>> David Winsemius, MD
>> West Hartford, CT
>>
>>
>>
>> David Winsemius, MD
>> West Hartford, CT
>>
>>
>>
>>
> David Winsemius, MD
> West Hartford, CT
>
>

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Merging data frames on two conditions

Reply via email to