Hello Allan, Thanks the response. Provides me hope. I appreciate [3], might even go with that. And for posterity, here's the code (assuming pastebin never expires)
[1] Test string : http://pastebin.com/FyAFzmTv [2] Pattern (modified as per your suggestion) : http://pastebin.com/s7VT0r5K pattern <- readLines(url("http://pastebin.com/raw.php?i=s7VT0r5K"), warn=FALSE) test <- readLines(url("http://pastebin.com/raw.php?i=rbAvR2dK"),warn=FALSE) regexpr(pattern, test, perl=TRUE) Thanks Saptarshi On Thu, Mar 17, 2011 at 12:12 AM, Allan Engelhardt <all...@cybaea.com>wrote: > Some comments: > > 1. [^\s] matches everything up to a literal 's', unless perl=TRUE. > 2. The (.*) is greedy, so you'll need (.*?)"\s"(.*?)"\s"(.*?)"$ or similar > at the end of the expression > > With those changes (and removing a space inserted by the newsgroup posting) > the expression works for me. > > > (pat <- readLines("/tmp/b.txt")[1]) > [1] > "^(\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}\\.\\d{1,3})\\s([^\\s]*)\\s([^\\s]*)\\s\\[([^\\]]+)\\]\\s\"([A-Z]*)\\s([^\\s]*)\\s([^\\s]*)\"\\s([^\\s]+)\\s(\\d+)\\s\"(.*?)\"\\s\"(.*?)\"\\s\"(.*?)\"$" > > regexpr(pat, test, perl=TRUE) > [1] 1 > attr(,"match.length") > [1] 436 > > 3. Consider a different approach, e.g. scan(textConnection(test), > what=character(0)) > > Hope this helps > > Allan > > > > On 16/03/11 22:18, Saptarshi Guha wrote: > >> Hello R users, >> >> I have this regex see [1] for apache log lines. I tried using R to parse >> some data (only because I wanted to stay in R). >> A sample line is [2] >> >> (a) I saved the line in [1] into "~/tmp/a.txt" and [2] into "/tmp/a.txt" >> >> pat<- readLines("~/tmp/a.txt") >> test<- readLines("/tmp/a.txt") >> test >> grep(pat,test) >> >> returns integer(0) >> >> The same query works in python via re.match(....) (i.e does return groups) >> >> Using readLines, the regex is escaped for me. Does Python and R use >> different regex styles? >> >> Cheers >> Saptarshi >> >> [1] >> >> >> ^(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})\s([^\s]*)\s([^\s]*)\s\[([^\]]+)\]\s"([A-Z]*)\s([^\s]*)\s([^\s]*)"\s([^\s]+)\s(\d+)\s"(.*)"\s"(.*)"\s"(.*)"$ >> >> [2] >> 220.213.119.925 addons.mozilla.org - [10/Jan/2001:01:55:07 -0800] "GET >> >> /blocklist/3/%8ce33983c0-fd0e-11dc-12aa-0800200c9a66%7D/4.0b5/Fennec/20110217140304/Android_arm-eabi-gcc3/chrome:%2F%2Fglobal%2Flocale%2Fintl.properties/beta/Linux% >> 202.6.32.9/default/default/6/6/1/ HTTP/1.1" 200 3243 "-" "Mozilla/5.0 >> (Android; Linux armv7l; rv:2.0b12pre) Gecko/20110217 Firefox/4.0b12pre >> Fennec/4.0b5" "BLOCKLIST_v3=110.163.217.169.1299218425.9706" >> >> [[alternative HTML version deleted]] >> >> ______________________________________________ >> R-help@r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. >> > [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.