On 14-01-01 10:55 PM, Joshua Banta wrote:
Dear Listserve,

I have a data-parsing question for you. I recognize this is more in the domain of 
PERL/Python, but I don't know those languages! On the other hand, I am pretty good 
overall with R, so I'd rather get the job done within the R "ecosphere."

Here is what I want to do. Consider the following data:

string <- "ATCGCCCGTA[AGA]TAACCG"

I want to alter string so that it looks like this:

ATCGCCCGTA[A][G][A]TAACCG

In other words, I want to design a piece of code that will scan a character 
string, find bracketed groups of characters, break up each character within the 
bracket into its own individual bracketed character, and then put the group of 
individually bracketed characters back into the character string. The lengths 
of the character strings enclosed by a bracket will vary, but in every case, I 
want to do the same thing: break up each character within the bracket into its 
own individual bracketed character, and then put the group of individually 
bracketed characters back into the character string.

So, for example, another string may look like this:

string2 <- "ATTATACGCA[AAATGCCCCA]GCTA[AT]GCATTA"

I want to alter string so that it looks like this:

"ATTATACGCA[A][A][A][T][G][C][C][C][C][A]GCTA[A][T]GCATTA"

R is fine for that sort of operation, using regular expressions for matching and sub() or gsub() for substitution. For example, this code finds all the bracketed strings of 1 or more ATCG letters:

matches <- gregexpr("[[][ATCG]+]", string)

In the result, which looks like this for your example string,

[[1]]
[1] 11
attr(,"match.length")
[1] 5
attr(,"useBytes")
[1] TRUE


the 11 is the start of the bracketed expression, the 5 is the length of the match. (There may be other starts and lengths if there are multiple bracketed expressions.) So use substr to extract the matches.

You need to be a little careful putting the string back together after adding the extra brackets, because `substr<-` won't replace a string with one of a different length. I use this version instead:

`mysubstr<-` <- function(x, start, stop, value)
  paste0(substr(x, 1, start-1), value, substr(x, stop+1, nchar(x))

I'll leave the details of the substitutions to you...

Duncan Murdoch

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to