Thank you, Tobi, for taking the time to comment on my issues. I will ponder the following.
2017-06-17 18:06 GMT-03:00 Tobias Boege <tabo...@gmail.com>: > On Sat, 17 Jun 2017, Fernando Cabral wrote: > >> Still beating my head against the wall due to my lack of knowledge about > >> the PCRE methods and properties... Because of this, I have progressed > not > >> only very slowly but also -- I fell -- in a very inelegant way. So > perhaps > >> you guys who are more acquainted with PCRE might be able to hint me on > a > >> better solution. > >> > >> I want to search a long string that can contain a sentence, a > paragraph or > >> even a full text. I wanna find and isolate every word it contains. A > word > >> is defined as any sequence of alphabetic characters followed by a > >> non-alphatetic character. > > >The Mathematician in me can't resist to point this out: you hopefully > wanted > >to define "word in a string" as "a *longest* sequence of alphabetic > characters > >followed by a non-alphabetic character (or the end of the string)". > Using your > >definition above, the words in "abc:" would be "c", "bc" and "abc", > whereas > >you probably only wanted "abc" (the longest of those). > > Right, the longest sequence. But I can't see why my definition is not equivalent to yours, even thou it is simpler. "A word is defined as any sequence of alphabetic characters followed by a non-alphabetic character" has to be the longest, no matter what. See, in "abc", "a" and "ab" are not followed by a non-alphabetic, so you have to keep advancing. "abc" is followed by a non-alphabetic, so it will comply with the definition. So I think we can do without stating it has to be the longest sequence. If I am wrong, I still can' t see why. > >> The sample code bellow does work, but I don't feel it is as elegant and > as > >> fast as it could and should be. Especially the way I am traversing the > >> string from the beginning to the end. It looks awkward and slow. There > must > >> be a more efficient way, like working only with offsets and lengths > instead > >> of copying the string again and again. > > >You think worse of String.Mid() than it deserves, IMHO. Gambas strings > >are triples of a pointer to some data, a start index and a length, and > >the built-in string functions take care not to copy a string when it's > >not necessary. The plain Mid$() function (dealing with ASCII strings only) > >is implemented as a constant-time operation which simply takes your input > >string and adjusts the start index and length to give you the requested > >portion of the string. The string doesn't even have to be read, much less > >copied, to do this. > > >Now, the String.Mid() function is somewhat more complicated, because > >UTF-8 strings have variable-width characters, which makes it difficult > >to map byte indices to character positions. To implement String.Mid(), > >your string has to be read, but, again, not copied. > > Right. Since I am workings with Portuguese, it has to be UTF8. So I can't avoid using String.Mid(). But I still understand it has to be copied because I am doing a str = String.Mid(str, HowMany) In this case I would guess it has to be copied because the original contents is shrunk, which happens again and again, until nothing is left to be scanned. I understand Gambas does not do garbage collection as old basic used to do, but still, I suppose it eventually will have to recover unused memory. > > Extracting a part of a string is a non-destructive operation in Gambas > > and no copying takes place. (Concatenating strings, on the other hand, > > will copy.) So, there is some reading overhead (if you need UTF-8 > strings), > > but it's smaller than you probably thought. > > As per above, in this case it is not only extracting, but overwriting the contents itself. > > Dim Alphabetics as string "abc...zyzABC...ZYZ" > > Dim re as RegExp > > Dim matches as String [] > > Dim RawText as String > > > > re.Compile("([" & Alphabetics & "]+?)([^" & Alphabetics & "]+)", > > RegExp.utf8) > > RawText = "abc12345def ghi jklm mno p1" > > > > Do While RawText > > re.Exec(RawText) > > matches.add(re[1].text) > > RawText = String.Mid(RawText, String.Len(re.text) + 1) > > Loop > > > > For i = 0 To matches.Count - 1 > > Print matches[i] > > Next > > > > > > Above code correctly finds "abc, def, ghi, jlkm, mno, p". But the tricks > I > > have used are cumbersome (like advancing with string.mid() and resorting > to > > re[1].text and re.text. > > > > >Well, I think you can't use PCRE alone to solve your problem, if you want > >to capture a variable number of words in your submatches. I did a bit of > >reading and from what I gather [1][2] capturing group numbers are > assigned > >based on the verbatim regular expression, i.e. the number of submatches > >you can receive is limited by the number of "(...)" constructs in your > >expression; and the (otherwise very nifty) recursion operator (?R) does > >not give you an unlimited number of capturing groups, sadly. > What I need is to grab a word at a time. The reason I am using two submatches "([:Alpha:])([:^Alpha:])" is because I don't care for Non-Alpha. This way I can I can forget about the submatch, but it will help me to skip to the next word (since len(re.text) complises the lenght of both submatches). > > > Anyway, I think by changing your regular expression, you can let PCRE > take > > care of the string advancement, like so: > For the time being, I will use the loop the way you proposed bellow. It seems cleaner than my solution. As to the performance, latter I'll check which one is faster. Thanks a lot - fernando > > 1 #!/usr/bin/gbs3 > 2 > 3 Use "gb.pcre" > 4 > 5 Public Sub Main() > 6 Dim r As New RegExp > 7 Dim s As string > 8 > 9 r.Compile("([[:alpha:]]+)[[:^alpha:]]+(.*$)", RegExp.UTF8) > 10 s = "abc12345def ghi jklm mno p1" > 11 Print "Subject:";; s > 12 Do > 13 r.Exec(s) > 14 If r.Offset = -1 Then Break > 15 Print " ->";; r[1].Text > 16 s = r[2].Text > 17 Loop While s > 18 End > > Output: > > Subject: abc12345def ghi jklm mno p1 > -> abc > -> def > -> ghi > -> jklm > -> mno > -> p > > But, I think, this is less efficient than using String.Mid(). The trailing > group (.*$) _may_ make the PCRE library read the entire subject every time. > And I believe gb.pcre will copy your submatch string when returning it. > If you care deeply about this, you'll have to trace the code in gb.pcre > and main/gbx (the interpreter) to see what copies strings and what doesn't. > > Regards, > Tobi > > [1] http://www.regular-expressions.info/recursecapture.html (Capturing > Groups Inside Recursion or Subroutine Calls) > [2] http://www.rexegg.com/regex-recursion.html (Groups Contents and > Numbering in Recursive Expressions) > > -- > "There's an old saying: Don't change anything... ever!" -- Mr. Monk > > ------------------------------------------------------------ > ------------------ > Check out the vibrant tech community on one of the world's most > engaging tech sites, Slashdot.org! http://sdm.link/slashdot > _______________________________________________ > Gambas-user mailing list > Gambas-user@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/gambas-user > -- Fernando Cabral Blogue: http://fernandocabral.org Twitter: http://twitter.com/fjcabral e-mail: fernandojosecab...@gmail.com Facebook: f...@fcabral.com.br Telegram: +55 (37) 99988-8868 Wickr ID: fernandocabral WhatsApp: +55 (37) 99988-8868 Skype: fernandojosecabral Telefone fixo: +55 (37) 3521-2183 Telefone celular: +55 (37) 99988-8868 Enquanto houver no mundo uma só pessoa sem casa ou sem alimentos, nenhum político ou cientista poderá se gabar de nada. ------------------------------------------------------------------------------ Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot _______________________________________________ Gambas-user mailing list Gambas-user@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/gambas-user