[R] Deleting columns where the frequency of values are too disparate

Josh B Sun, 18 Jan 2009 23:05:01 -0800

Hello R-help community,

I have another question about filtering datasets.


Please consider the following "toy" data matrix example, called "x" for 
simplicity. There are 20 different individuals  ("ID"), with information about 
the alleles (A,T, G, C) at six different loci ("Locus1" -  "Locus6") for each 
of these 20 individuals. At any single locus (e.g., "Locus1" or "Locus2", ... 
or "Locus6"), the individuals have either one allele (from the set of A,T,C,G) 
or one other allele (from the set of A,T,C, G). For example, at Locus1 
individuals have have either the A or T allele only; at Locus2 the individuals 
can have either C or G only; at Locus3 the individuals can have either T or G 
only.

IDLocus1Locus2Locus3Locus4Locus5Locus6
1AGTAAC
2AGGACC
3ACGGCC
4ACGGCC
5AGGGAC
6TGGGCC
7TCGGCC
8TCGGAC
9TGGGCC
10TCGGCC
11AGGGAC
12ACGGCC
13AGGGCC
14AGGGAC
15ACGGCC
16TCGGCC
17TGGGAC
18TGGGCC
19TGGGCC
20TCGGAC

I want to delete any columns from the dataset where the rarer of the two 
alleles has a frequency of ten percent or less. In other words, I would like to 
delete Locus3, Locus4, and Locus6 in this data matrix, because the frequency of 
the rare allele is not greater than ten percent (and conversely, the frequency 
of the common allele is not less than ninety percent). Please note that the 
frequency of the rare allele in Locus6 is equal to zero (conversely, the 
frequency of the common allele is equal to one hundred percent).

Would one of you know of simple way to write this sort of code? (In my real 
dataset, there are 1096 loci, so this cannot be done easily "by eye.")

Thanks again in advance for any suggestions!
Josh B.


      
        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] Deleting columns where the frequency of values are too disparate

Reply via email to