> Please consider the following "toy" data matrix example, called "x" > for simplicity. There are 20 different individuals ("ID"), with > information about the alleles (A,T, G, C) at six different loci > ("Locus1" - "Locus6") for each of these 20 individuals. At any > single locus (e.g., "Locus1" or "Locus2", ... or "Locus6"), the > individuals have either one allele (from the set of A,T,C,G) or one > other allele (from the set of A,T,C, G). For example, at Locus1 > individuals have have either the A or T allele only; at Locus2 the > individuals can have either C or G only; at Locus3 the individuals > can have either T or G only. > > IDLocus1Locus2Locus3Locus4Locus5Locus6 > 1AGTAAC > 2AGGACC > 3ACGGCC > 4ACGGCC > 5AGGGAC > 6TGGGCC > 7TCGGCC > 8TCGGAC > 9TGGGCC > 10TCGGCC > 11AGGGAC > 12ACGGCC > 13AGGGCC > 14AGGGAC > 15ACGGCC > 16TCGGCC > 17TGGGAC > 18TGGGCC > 19TGGGCC > 20TCGGAC > > I want to delete any columns from the dataset where the rarer of the > two alleles has a frequency of ten percent or less. In other words, > I would like to delete Locus3, Locus4, and Locus6 in this data > matrix, because the frequency of the rare allele is not greater than > ten percent (and conversely, the frequency of the common allele is > not less than ninety percent). Please note that the frequency of the > rare allele in Locus6 is equal to zero (conversely, the frequency of > the common allele is equal to one hundred percent). > > Would one of you know of simple way to write this sort of code? (In > my real dataset, there are 1096 loci, so this cannot be done easily "by eye.")
Most of the problem is just organising the data into a sensible form. # read in data data <- readLines(tc <- textConnection("1AGTAAC 2AGGACC 3ACGGCC 4ACGGCC 5AGGGAC 6TGGGCC 7TCGGCC 8TCGGAC 9TGGGCC 10TCGGCC 11AGGGAC 12ACGGCC 13AGGGCC 14AGGGAC 15ACGGCC 16TCGGCC 17TGGGAC 18TGGGCC 19TGGGCC 20TCGGAC")); close(tc) # retrieve the useful bit loci <- sub("[[:digit:]]{1,2}", "", data) # you may also want this ID <- grep("[[:digit:]]{1,2}", data) # find out how many of each base occurs at each locus freqs <- list() n <- length(ID) for(i in 1:6) { assign(paste("locus", i, sep=""), factor(substring(loci,i,i), levels=c("A","C","G","T"))) freqs[[i]] <- summary(get(paste("locus", i, sep=""))) } freqs # remove loci with 90% or more cases of same base loci.to.remove <- sapply(freqs, function(x) any(x>0.9*n)) Regards, Richie. Mathematical Sciences Unit HSL ------------------------------------------------------------------------ ATTENTION: This message contains privileged and confidential inform...{{dropped:20}} ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.