Re: [R] Counting occurances of a letter by a factor

Thomas Lumley Fri, 10 Sep 2010 16:10:20 -0700

On Fri, 10 Sep 2010, Davis, Brian wrote:

Thomas,


I don't *believe* I have the heterozygote coded both ways.  However, I haven't 
check thoroughly.  I do notice that some SNPs (X's) only have 2 levels say AA 
and AT.  One could build the multiplication table from the levels I suppose.  
(Or force the factor to include all three levels)



I would force the factor to include all three levels.

   -thomas

Brian

-----Original Message-----
From: Thomas Lumley [mailto:tlum...@u.washington.edu]
Sent: Friday, September 10, 2010 3:22 PM
To: Davis, Brian
Cc: Phil Spector; r-help@r-project.org
Subject: Re: [R] Counting occurances of a letter by a factor

On Fri, 10 Sep 2010, Davis, Brian wrote:

I'm my quest for brevity I think I scarified too much clarity.

I'll try to be a little less brief in the hopes of being more clear.

Say I have data frame like this as before:

DF<-data.frame(c("CC", "CC", NA, "CG", "GG", "GC"), c("L", "U", "L", "U", "L", 
NA))
colnames(DF)<-c("X", "Y")
DF

    X    Y
1   CC    L
2   CC    U
3 <NA>    L
4   CG    U
5   GG    L
6   GC <NA>

I need to count the frequency of the unique individual characters in DF$X at 
each factor level in DF$Y

So for DF$Y == "L"  there are 2 "C"'s and 2 "G"'s
and for DF$Y == "U" there are 3 "C"'s and 1 "G"

The NA's should not contribute to the counts.

If I had a individual character in DF$X instead of a string like:

DF2<-data.frame(c("C", "C", NA, "C", "G", "G"), c("L", "U", "L", "U", "L", NA))
colnames(DF2)<-c("X", "Y")
DF2

    X    Y
1    C    L
2    C    U
3 <NA>    L
4    C    U
5    G    L
6    G <NA>

Then table gives me exactly what I need.

table(DF2)

  Y
X   L U
 C 1 2
 G 1 0


I would use table() as the first step

table(DF[,1],DF[,2])


     L U
  CC 1 1
  CG 0 1
  GC 0 0
  GG 1 0


and then multiply by a matrix that counts C and G:

cg<-rbind(C=c(2,1,1,0),G=c(0,1,1,2))
cg%*%table(DF[,1],DF[,2])


    L U
  C 2 3
  G 2 1

If the genotype is a factor then you don't have to worry about empty genotypes.

Also, do you actually get the heterozygotes coded both ways?  When I have had 
to do this it has been simplified by having the heterozygotes all coded the 
same way (ie, only one of CG and GC appears), so that as.numeric() on the 
factor variable gives the number of copies of the alphabetically later allele.

         -thomas

Hopefully this is a little bit clearer what I'm trying to accomplish.

Brian

-----Original Message-----
From: Phil Spector [mailto:spec...@stat.berkeley.edu]
Sent: Friday, September 10, 2010 2:52 PM
To: Davis, Brian
Subject: Re: [R] Counting occurances of a letter by a factor

Brian -
   Here's the only thing I can come up with to give the
same result as your "ans", but it doesn't seem to correspond
with your description of the problem.

DF1 = DF
DF1$X = sapply(strsplit(as.character(DF$X),''),'[',1)
DF2 = DF
DF2$X = sapply(strsplit(as.character(DF$X),''),'[',2)
newDF = rbind(DF1,DF2)
table(newDF$Y,newDF$X)


    C G
  L 2 2
  U 3 1

                                        - Phil Spector
                                         Statistical Computing Facility
                                         Department of Statistics
                                         UC Berkeley
                                         spec...@stat.berkeley.edu



On Fri, 10 Sep 2010, Davis, Brian wrote:

I'm trying to find a more elegant way of doing this.  What I'm trying to 
accomplish is to count the frequency of letters (major / minor alleles)  in  a 
string grouped by the factor levels in another column of my data frame.

Ex.

DF<-data.frame(c("CC", "CC", NA, "CG", "GG", "GC"), c("L", "U", "L", "U", "L", 
NA))
colnames(DF)<-c("X", "Y")
DF

    X    Y
1   CC    L
2   CC    U
3 <NA>    L
4   CG    U
5   GG    L
6   GC <NA>

I have an ugly solution, which works if you know the factor levels of Y in 
advance.

ans<-rbind(table(unlist(strsplit(as.character(DF[DF[ ,'Y'] == 'L', 1]), ""))),

+ table(unlist(strsplit(as.character(DF[DF[ ,'Y']  == 'U', 1]), ""))))

rownames(ans)<-c("L", "U")
ans

 C G
L 2 2
U 3 1


I've played with table, xtab, tabulate, aggregate, tapply, etc but haven't 
found a combination that gives a more general solution to this problem.

Any ideas?

Brian

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Thomas Lumley
Professor of Biostatistics
University of Washington, Seattle


Thomas Lumley
Professor of Biostatistics
University of Washington, Seattle

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Counting occurances of a letter by a factor

Reply via email to