Thufir <[EMAIL PROTECTED]> posted [EMAIL PROTECTED], excerpted below, on Sat, 12 May 2007 07:57:13 +0000:
> for score files, some e-mail addresses have underscores or other > characters. when must the escape be used? Just for dots? That's an interesting question, as the answer is somewhat complicated. FWIW, unescaped dots /will/ match, but they match any (single) character. Thus, a regex of gmail.com would indeed match a literal gmail.com, but would also match gmailqcom, [EMAIL PROTECTED], etc, but would NOT match gmail..com or gmailxxcom, because that's TWO characters, and dot matches only ONE character. Generally, you'll need to watch it for anything that's not a-z,0-9. Many but not all symbols and punctuation have special meanings in regex. Some special chars to be alert for and what they mean (note that not all these are valid in all headers, but that won't affect whether they match in the scorefile or not): * matches any number (zero or more) of the preceding, so .* matches anything (literally, zero or more of any character). ? matches zero or one of the preceding, so is useful for matching something that may or may not be there. .? therefore matches a single character, that may or may not be there. boots? would match boot or boots (s may or may not be there) but not bootsss (except that the expression as shown isn't anchored, so any junk including s's on either side would match). \ is the escape character, so to match a literal \, use \\. + matches one or more of the preceding, so ..* is exactly the same as .+ , both meaning one or more of any character. ^ anchors at the left, $ at the right, where "anchor" means there's nothing outside the specified match. Thus, using the above boots? example, bootsss would still match as would bootadsfasdfe and simply boot. To make it match /only/ boot or boots, you'd use ^boots?$ . Anything additional on the line would fail the match. (However, in our case we are talking about header lines, with the "match" only being on the value of the header. Header lines by definition have a header name, such as from, followed by a colon, followed by a space, followed by the value. Therefore, what is actually being matched is anything after the header name, colon, and space. Also note that header lines may be "folded" if they are too long. IDR the full folding spec from the RFC, but you may look it up if interested. Meanwhile, just keep in mind that the header may extend over multiple lines. Most often, this will occur with headers such as the path header or the references header, which get appended to, in the first case as the post propagates from server to server, in the second as replies get nested in the thread. Long propagation paths or deep thread reply nesting commonly causes header folding.) [] indicates a character class. Any of the enumerated characters will match. A range may be indicated with a dash (which can be matched literally by placing it first, after a ^ if any), and a ^ as the first character negates. As with a -, a ] must be placed first (or escaped) as otherwise it would indicate the end of the character class Thus, [a-zA-Z0-9] indicates all ASCII letters plus numbers. (Note that normally, regex are case sensitive so [a-z] and [A-Z] are different. However, pan is normally case insensitive, so it won't matter to it. You can force case sensitivity by using keyword= instead of keyword:.) [bcf]at would match bat, cat, and fat, but not mat. Also see the POSIX character classes, below. You can specify a limited range of repeats (as opposed to + and * which are unlimited) by using {n,m}, where n and m are the minumum and maximum number of repeats. Leaving one out makes it unlimited at that end. Thus, ba+d matches bad, baaaaaaaad, baaaaaaaaaaaaaaaaaaaaad, etc, but not bd (ba*d would also match bd, zero or more a's). ba{1,3}d matches bad, baad, and baaad, but not bd or baaaad. ba{,3}d would be the same as ba?a?a?d and match bd and up to three a's. ba{1,}d would be the same as ba+d. ba{2,}d would require two or more a's... **IMPORTANT** I've not specifically tested pan but some regex implementations treat the unescaped {} as literals and escaped {} as range indicators, some treat the escaped as literal and unescaped as range indicators. IF USING {} THEREFORE, TEST YOUR SCORES BEFORE RELYING ON THEM!! Parenthesis (() indicate grouping. (They also save the included match for further use, say in substitution, but pan's scoring doesn't need or do substitution so you can safely ignore that for now.) | indicates alternatives. Thus, (dog)|(cat) will match the three letters dog, OR match the three letters cat. It will NOT match dogcat, or dat, or cog. Note that it's occasionally useful to match a sequence which may or may not be there, as ((dog)|), the dog may be there or not. Escaped letters often have other meanings, depending on the regex implementation. pan uses pcre, perl compatible regular expressions, an extremely rich matching language one could (and many have) literally write /chapters/ on. I'll just mention a couple such escaped letter matches that you may find useful, \s matches a literal space, and \t matches a tab (the capitalized forms \S and \T would match NOT space and NOT tab, but I'm not sure if they are implemented), and a general idea you can look up for more if desired, word borders, with the \b, \B, \w, and \W. Finally, the original regex matching language, as so many things computer, was designed with ASCII in mind. All those "funny" international symbols, with `, ^, etc in combination with letters, can make things "interesting". Also, at least one western charset has Z as a letter somewhere in the middle of its alpha chars, so A-Z won't have the intended effect there! To address these and other "interesting" situations without making things /too/ complicated for those not using those charactersets, POSIX character classes were added. There's also collating element equivalency clases which work somewhat similarly but which I'm not going to cover here. Back to POSIX character class matching. Within a [] character class, one can further include POSIX character classes, denoted with [:classname:]. (It's important to note that these are recognized within [] only, so to use them alone, you use [[:classname:]].) Standard POSIX character classes include: alnum, alpha, blank, cntrl, digit, graph, lower, print, punct, space, upper, and xdigit. Thus, instead of [a-zA-Z0-9] which may not include "exotic" alphabetic characters, [[:alnum:]] may be used. [[:space:]] includes vertical and horizontal space, so spaces, tabs, line and form feeds, and carriage returns. [[:print:]] is all printable characters, [[:cntrl:]] is the reverse, control characters. Lower and upper won't really matter in our context since pan is case insensitive by default, graph is all graphical characters (similar to print, I'm not sure the difference but believe one includes [:space:] while the other doesn't). In addition to what has already been covered, there are all sorts of additional matchings, positive and negative lookahead and lookbehind (suppose you use .*ad but don't want it to match covad, a DSL provider, for instance, a negative lookbehind may be just the ticket), even ways of executing external programs and returning the results for the match (I doubt pan implements that but honestly haven't tried), all /sorts/ of fancy and exotic stuff. As I mentioned above, literally chapters, if not entire books, could be (and have been) written on the subject of regex. However, that should be good for an introduction. Basically, anytime you use something outside of an alphanumeric literal match, consider the possibility that it may need escaped, and test before before relying on it. Do that and keep in mind the basics, .*+?()[]{}\ , that do need escaped, and you'll be covered well over 90% of the time, certainly within pan's limited usage, for scoring headers. The rest is nice to know for those special cases, but not generally necessary, and can be looked up (save this post, or google on "regular expressions", or even "perl compatible regular expressions") if necessary. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman _______________________________________________ Pan-users mailing list Pan-users@nongnu.org http://lists.nongnu.org/mailman/listinfo/pan-users