________________________________________ From: David Winsemius [dwinsem...@comcast.net] Sent: Sunday, July 03, 2011 7:08 PM To: Bansal, Vikas Cc: Dennis Murphy; r-help@r-project.org Subject: Re: [R] For help in R coding
On Jul 3, 2011, at 1:07 PM, Bansal, Vikas wrote: > Yes you are right. unlist operation is unnecessary and I have tried > it yesterday and it is working without that operation also.But I > have one more problem on which I have worked whole day but did not > get any solution.As I told you I am new to R,I want to ask that how > I can use the (if condition) in the following code > > df=read.table("Case2.pileup",fill=T,sep="\t",colClasses = "character") > txtvec <- readLines(textConnection(df[,9])) > dad=data.frame(A = (sapply(gregexpr("A|a", (df[,9])), function(x) if > ( x[[1]] != -1) > length(x) else 0 )), > C = (sapply(gregexpr("C|c", (df[,9])), function(x) if ( x[[1]] != -1) > length(x) else 0 )), > G = (sapply(gregexpr("G|g", (df[,9])), function(x) if ( x[[1]] != -1) > length(x) else 0 )), > T = (sapply(gregexpr("T|t", (df[,9])), function(x) if ( x[[1]] != -1) > length(x) else 0 )), > N = (sapply(gregexpr("\\,|\\.", (df[,9])), function(x) if ( x[[1]] ! > = -1) > length(x) else 0 ))) > > > Now my problem is in my data frame I have alphabets A,C,G and T in > 3rd column also.Now these commas (,)and dots(.) in column 9 are for > these alphabets which are in column 3.I want to use if condition > like this > > if in my dataframe column 3 have A then A = (sapply(gregexpr("\\,|\ > \.", (df[,9])), function(x) if ( x[[1]] != -1) > length(x) else 0 ))) else (A = (sapply(gregexpr("A|a", (df[,9])), > function(x) if ( x[[1]] != -1) > length(x) else 0 )),if in my dataframe column 3 haveCA then C = > (sapply(gregexpr("\\,|\\.", (df[,9])), function(x) if ( x[[1]] != -1) > length(x) else 0 ))) else C = (sapply(gregexpr("C|c", (df[,9])), > function(x) if ( x[[1]] != -1) > length(x) else 0 )), if in my dataframe column 3 have G then G = > (sapply(gregexpr("\\,|\\.", (df[,9])), function(x) if ( x[[1]] != -1) > length(x) else 0 ))) else G = (sapply(gregexpr("G|g", (df[,9])), > function(x) if ( x[[1]] != -1) > length(x) else 0 )) if in my dataframe column 3 have T then T = > (sapply(gregexpr("\\,|\\.", (df[,9])), function(x) if ( x[[1]] != -1) > length(x) else 0 ))) else T = (sapply(gregexpr("T|t", (df[,9])), > function(x) if ( x[[1]] != -1) > length(x) else 0 )), > > > So I want to code so that it will give the output like this- > > DATA FRAME (Input) > > col3 col 9 > T .a,g,, > A .t,t,, > A .,c,c, > C .,a,,, > G .,t,t,t > A .c,,g,^!. > A .g,ggg.^!, > A .$,,,,,., > C a,g,,t, > T ,,,,,.,^!. > T ,$,,,,.,." > > > output > > A C G T > 1 0 1 4 > 4 0 0 2 > 4 2 0 0 > 1 5 0 0 > 0 0 4 3 > > > > This is the output for first five rows.v I was unable to follow the logic and because complete output was not offered, I am unable to check my guesses against you full specifications. Oh sorry.I will explain it again.As I told you my dataframe has ten columns.but i am working on 3rd and 9th column. to calculate the number of A C G T . and , we used the following code- > df=read.table("Case2.pileup",fill=T,sep="\t",colClasses = "character") > txtvec <- readLines(textConnection(df[,9])) > dad=data.frame(A = (sapply(gregexpr("A|a", (df[,9])), function(x) if > ( x[[1]] != -1) > length(x) else 0 )), > C = (sapply(gregexpr("C|c", (df[,9])), function(x) if ( x[[1]] != -1) > length(x) else 0 )), > G = (sapply(gregexpr("G|g", (df[,9])), function(x) if ( x[[1]] != -1) > length(x) else 0 )), > T = (sapply(gregexpr("T|t", (df[,9])), function(x) if ( x[[1]] != -1) > length(x) else 0 )), > N = (sapply(gregexpr("\\,|\\.", (df[,9])), function(x) if ( x[[1]] ! > = -1) > length(x) else 0 ))) now in 3rd column of my dataframe I have chAracters A or C or G or T.so my 3rd column and 9th column is like this- col3 col 9 > T .a,g,, > A .t,t,, > A .,c,c, > C .,a,,, > G .,t,t,t > A .c,,g,^!. > A .g,ggg.^!, > A .$,,,,,., > C a,g,,t, > T ,,,,,.,^!. > T ,$,,,,.,." Initially we were working on 9th column only to calculate number of A,C,G and T and (.) and (,) separately using code provided by you shown above. but now i want that if in column 3 I have T so it should make it equal to the number of .|, as I showed you my output output > > A C G T > 1 0 1 4 > 4 0 0 2 > 4 2 0 0 > 1 5 0 0 > 0 0 4 3 In the first row of my input I have T in 3rd column.so T=number of total . and , that is 4.and a and g are 1 in second row of my input i have A in 3rd column so A should be equal to total number of (.) and (,) that is 4 and remaining are the 2 T. that is why i wrote this thing using if condotion > if in my dataframe column 3 have A then A = (sapply(gregexpr("\\,|\ > \.", (df[,9])), function(x) if ( x[[1]] != -1) > length(x) else 0 ))) else (A = (sapply(gregexpr("A|a", (df[,9])), > function(x) if ( x[[1]] != -1) > length(x) else 0 )), if in my dataframe column 3 haveCA then C = > (sapply(gregexpr("\\,|\\.", (df[,9])), function(x) if ( x[[1]] != -1) > length(x) else 0 ))) else C = (sapply(gregexpr("C|c", (df[,9])), > function(x) if ( x[[1]] != -1) > length(x) else 0 )), if in my dataframe column 3 have G then G = > (sapply(gregexpr("\\,|\\.", (df[,9])), function(x) if ( x[[1]] != -1) > length(x) else 0 ))) else G = (sapply(gregexpr("G|g", (df[,9])), > function(x) if ( x[[1]] != -1) > length(x) else 0 )) if in my dataframe column 3 have T then T = > (sapply(gregexpr("\\,|\\.", (df[,9])), function(x) if ( x[[1]] != -1) > length(x) else 0 ))) else T = (sapply(gregexpr("T|t", (df[,9])), > function(x) if ( x[[1]] != -1) > length(x) else 0 )), the code is same i just want to add a condition so that it should check that if in column 3, the character is A then make number of A equal to total number of . and , Should I explain better or can you please tell me which thing is not clear? > -- David. > > > > Can you please help me how to use this if condition in your coding > or we can also do it by using some other condition rather than if > condition? > > > > > > > > > > > > > ________________________________________ > From: David Winsemius [dwinsem...@comcast.net] > Sent: Sunday, July 03, 2011 3:57 AM > To: Bansal, Vikas > Cc: Dennis Murphy; r-help@r-project.org > Subject: Re: [R] For help in R coding > > On Jul 2, 2011, at 4:46 PM, Bansal, Vikas wrote: > >> DEAR ALL, >> I TRIED THIS CODE AND THIS IS RUNNING PERFECTLY... >> >> df=read.table("Case2.pileup",fill=T,sep="\t",colClasses = >> "character") >> txt=df[,9] >> txtvec <- readLines(textConnection(txt)) >> dad=data.frame(A = unlist(sapply(gregexpr("A|a", txtvec), >> function(x) if ( x[[1]] != -1) >> length(x) else 0 )), >> C = unlist(sapply(gregexpr("C|c", txtvec), function(x) if ( x[[1]] ! >> = -1) >> length(x) else 0 )), >> G = unlist(sapply(gregexpr("G|g", txtvec), function(x) if ( x[[1]] ! >> = -1) >> length(x) else 0 )), >> T = unlist(sapply(gregexpr("T|t", txtvec), function(x) if ( x[[1]] ! >> = -1) >> length(x) else 0 )), >> N = unlist(sapply(gregexpr("\\,|\\.", txtvec), function(x) if >> ( x[[1]] != -1) >> length(x) else 0 ))) >> > > The unlist operation is unnecessary since the sapply operation returns > a vector. (It doesn't hurt, but it is unnecessary.) >> >> >> >> >> Thanking you, >> Warm Regards >> Vikas Bansal >> Msc Bioinformatics >> Kings College London >> ________________________________________ >> From: David Winsemius [dwinsem...@comcast.net] >> Sent: Saturday, July 02, 2011 9:04 PM >> To: Dennis Murphy >> Cc: r-help@r-project.org; Bansal, Vikas >> Subject: Re: [R] For help in R coding >> >> On reflection and a bit of testing I think the best approach would be >> to use gregexpr. For counting the number of commas, this appears >> quite >> straightforward. >> >>> sapply(gregexpr("\\,", txtvec), function(x) if ( x[[1]] != -1) >> length(x) else 0 ) >> [1] 3 3 3 4 3 3 2 6 4 6 6 >> >> It easily generalizes to period and the `|` (or) operation on >> letters. >> ( did need to add the check since the length of gregexpr is always at >> least one but ihas value -1 when there is no match >> >>> sapply(gregexpr("t|T", txtvec), function(x) if ( x[[1]] != -1) >> length(x) else 0 ) >> [1] 0 2 0 0 3 0 0 0 1 0 0 >> >> >> On Jul 2, 2011, at 3:22 PM, Dennis Murphy wrote: >> >>> Hi: >>> >>> There seems to be a problem if the string ends in , or . , which >>> makes >>> it difficult for strsplit() to pick up if it is splitting on those >>> characters. Here is an alternative, splitting on individual >>> characters >>> and using charmatch() instead: >>> >>> charsum <- function(s, char) { >>> u <- strsplit(s, "") >>> sum(sapply(u, function(x) charmatch(x, char)), na.rm = TRUE) >>> } >>> >>> unname(sapply(txtvec, function(x) charsum(x, ','))) >>> unname(sapply(txtvec, function(x) charsum(x, '.'))) >>> >>> Putting this into a data frame, >>> >>> dfout <- data.frame(periods = unname(sapply(txtvec, function(x) >>> charsum(x, '.'))), >>> commas = unname(sapply(txtvec, >>> function(x) charsum(x, '.'))) ) >>> txtvec >>> >>> HTH, >>> Dennis >>> >>> On Sat, Jul 2, 2011 at 10:19 AM, David Winsemius <dwinsem...@comcast.net >>>> wrote: >>>> >>>> On Jul 2, 2011, at 12:34 PM, Bansal, Vikas wrote: >>>> >>>>> >>>>> >>>>>>> Dear all, >>>>>>> >>>>>>> I am doing a project on variant calling using R.I am working on >>>>>>> pileup file.There are 10 columns in my data frame and I want to >>>>>>> count the number of A,C,G and T in each row for column 9.example >>>>>>> of >>>>>>> column 9 is given below- >>>>>>> >>>>>>> .a,g,, >>>>>>> .t,t,, >>>>>>> .,c,c, >>>>>>> .,a,,, >>>>>>> .,t,t,t >>>>>>> .c,,g,^!. >>>>>>> .g,ggg.^!, >>>>>>> .$,,,,,., >>>>>>> a,g,,t, >>>>>>> ,,,,,.,^!. >>>>>>> ,$,,,,.,. >>>>>>> >>>>>>> This is a bit confusing for me as these characters are in one >>>>>>> column >>>>>>> and how can we scan them for each row to print number of A,C,G >>>>>>> and T >>>>>>> for each row. >>>>>> >>>>>> Seems a bit clunky but this does the job (first the data): >>>>>>> >>>>>>> txt <- " .a,g,, >>>>>> >>>>>> + .t,t,, >>>>>> + .,c,c, >>>>>> + .,a,,, >>>>>> + .,t,t,t >>>>>> + .c,,g,^!. >>>>>> + .g,ggg.^!, >>>>>> + .$,,,,,., >>>>>> + a,g,,t, >>>>>> + ,,,,,.,^!. >>>>>> + ,$,,,,.,." >>>>>> >>>>>>> txtvec <- readLines(textConnection(txt)) >>>>>> >>>>>> Now the clunky solution, Basically subtracts 1 from the counts of >>>>>> "fragments" that result from splitting on each letter in turn. >>>>>> Could >>>>>> be made prettier with a function that did the job. >>>>>> >>>>>>> data.frame(A = unlist(lapply( lapply( sapply(txtvec, strsplit, >>>>>> >>>>>> split="a"), length) , "-", 1)), >>>>>> + C = unlist(lapply( lapply( sapply(txtvec, strsplit, split="c"), >>>>>> length) , "-", 1)), >>>>>> + G = unlist(lapply( lapply( sapply(txtvec, strsplit, split="g"), >>>>>> length) , "-", 1)), >>>>>> + T = unlist(lapply( lapply( sapply(txtvec, strsplit, split="t"), >>>>>> length) , "-", 1)) ) >>>>>> A C G T >>>>>> .a,g,, 1 0 1 0 >>>>>> .t,t,, 0 0 0 2 >>>>>> .,c,c, 0 2 0 0 >>>>>> .,a,,, 1 0 0 0 >>>>>> .,t,t,t 0 0 0 2 >>>>>> .c,,g,^!. 0 1 1 0 >>>>>> .g,ggg.^!, 0 0 4 0 >>>>>> .$,,,,,., 0 0 0 0 >>>>>> a,g,,t, 1 0 1 1 >>>>>> ,,,,,.,^!. 0 0 0 0 >>>>>> ,$,,,,.,. 0 0 0 0 >>>>>> >>>>>> Has the advantage that the input data ends up as rownames, which >>>>>> was a >>>>>> surprise. >>>>>> >>>>>> If you wanted to count "A" and "a" as equivalent, then the split >>>>>> argument should be "a|A" >>>>>> >>>>>> >>>>> >>>>>>> AS YOU MENTIONED THAT IF I WANT TO COUNT A AND a I SHOULD SPLIT >>>>>>> LIKE >>>>>>> THIS. >>>>> >>>>> BUT CAN I COUNT . AND , ALSO USING- >>>>> data.frame(A = unlist(lapply( lapply( sapply(txtvec, strsplit, >>>>> split=".|,"), length) , "-", 1)), >>>>> >>>>> I TRIED IT BUT ITS NOT WORKING.IT IS GIVING THE OUTPUT BUT AT SOME >>>>> PLACES >>>>> IT IS SHOWING MORE NUMBER OF . AND , AND SOMEWHERE IT IS NOT EVEN >>>>> CALCULATING AND JUST SHOWING 0. >>>> >>>> You need to use valid regex expressions for 'split'. Since "." and >>>> "," are >>>> special characters they need to be escaped when you wnat the >>>> literals to be >>>> recognized as such. >>>> >>>> I haven't figured out why but you need to drop the final operation >>>> of >>>> subtracting 1 from the values when counting commas: >>>> >>>> data.frame(periods = unlist(lapply( lapply( sapply(txtvec, >>>> strsplit, >>>> split="\\."), length) , "-", 1)) >>>> ,commas = unlist( lapply( sapply(txtvec, strsplit, >>>> split="\\,"), length) ) ) >>>> periods commas >>>> .a,g,, 1 3 >>>> .t,t,, 1 3 >>>> .,c,c, 1 3 >>>> .,a,,, 1 4 >>>> .,t,t,t 1 4 >>>> .c,,g,^!. 1 4 >>>> .g,ggg.^!, 2 2 >>>> .$,,,,,., 2 6 >>>> a,g,,t, 0 4 >>>> ,,,,,.,^!. 1 7 >>>> ,$,,,,.,. 1 7 >>>> >>>> -- >>>> >>>> David Winsemius, MD >>>> West Hartford, CT >>>> >>>> ______________________________________________ >>>> R-help@r-project.org mailing list >>>> https://stat.ethz.ch/mailman/listinfo/r-help >>>> PLEASE do read the posting guide >>>> http://www.R-project.org/posting-guide.html >>>> and provide commented, minimal, self-contained, reproducible code. >>>> >> >> David Winsemius, MD >> West Hartford, CT >> > > David Winsemius, MD > West Hartford, CT > David Winsemius, MD West Hartford, CT ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.