Ed Goforth <[EMAIL PROTECTED]> posted [EMAIL PROTECTED], excerpted below, on Sun, 16 Jul 2006 02:35:41 -0400:
> And a follow-up question on the "Articles -> Create Score" interface. > It always shows the option for "contains" grayed out for the Author > field. If I write a rule by hand like > > %BOS > [*] > Score: =-9999 > From: aol.com > %EOS > > will Pan's scorefile parser barf? Is this a legal rule? It's not a legal rule, but not for the reason you think. All the expressions are regular expressions, actually, a particular variant thereof called perl compatible regular expressions (thus PAN's dependency on pcre, perl compatible regular expressions). If you select something else setting up a score in the UI, PAN actually converts it into a regular expression before putting it in the score file. It appears you posted using Mozilla Thunderbird for Linux, so maybe you know all about regular expressions. However, if you do, I would have assumed you would have recognized them and seen the problems with the above. Thus assuming you aren't familiar with regular expressions, I suggest you do some googling, as they are really quite powerful. There's a whole lot more to it than I'll cover here but this will get you started. In general, letters and numbers (normally) stand for themselves, but many kinds of punctuation has alternate meanings. In particular, wildcards work rather differently than the shell patterns you used above and are likely quite familiar with. A single dot (.) stands for "any character". A question mark (?) means zero or one of, a star (*) means any number of (zero or more of). Thus, your group expression is invalid. The star says zero or more of... but there's nothing for there to be zero or more /of/! What you /really/ want is [.*], meaning zero or more of any character. IOW, .* is the same thing in regex as * is in shellex. Similarly with .?, meaning any character, may or may not be there (zero or one of), and .+, one or more of (+ must have at least one, * can match zero) Similarly, your aol.com will work, but it will match stuff you probably didn't intend as well, because the dot matches not only a dot, but any character, so it would match aolecom, aolzcom, etc, but NOT aoleecom or aol..com (the dot matches a single character only), and NOT aolcom (there MUST be a single character there to match. A backslash (\) char is the escape character. Thus, to match a literal dot, not "any character", you'd write that this way: aol\.com Other "control" characters include the parenthesis (()) for grouping, and the pipe (|) for alternatives. Thus "(aol|hotmail)\.com" would match either aol.com or hotmail.com. Brackets ({}) allow you to specify a frequency. "b(an){2}a" would therefore match "banana" (the "an" is repeated once, therefore occurring exactly twice). You can also use it to specify a range. "b(an){2,3}a" would therefore match "banana" and "bananana", and "b(an){2,}a" would match two or more "an"s, so "bananananananana" but not "bana" (only one "an", the regex specified two or more). There's also braces ([]), altho I'm not sure how they'd be used here given braces are used to denote the group expression. Braces signify an enumeration, "one of the following", so "b[aeiou]n" would match ban, ben, bin, bon, bun, but not bbn or byn. One can also include a range using the - in the enumeration, so [a-z0-9] matches all alphanumerics in ASCII. (Note that in some locales, "z" isn't actually the last letter of the alphabet, so if you are in one of those locales, things will change accordingly. There's what is called "POSIX character-classes" to deal with that, but we'll skip that here. Look it up if desired.) If you want to include a /literal/ "-", make it the first character of the group, so [-09] is those three characters specifically, while [0-9] is the range. If you want to match anything /except/ a particular enumeration, use ^ at the beginning, so [^a-z] would be anything /but/ the alphabet (in ASCII, anyway). Outside of the brace-enumeration class, ^ and $ are anchors, meaning the beginning and ending of a line. Thus, ^$ indicates an empty line, ^aol\.com$ would match /exactly/ "aol.com" on a line by itself, etc. If a regex isn't anchored, it can match anywhere in the line, so ".*aol\.com.*" is exactly the same as "aol\.com", is exactly the same as "^.*aol\.com.*" because all three match "aol.com" anywhere on the line. Note that anchors don't "eat up" characters. They specify a position but don't match specific characters, so don't "eat them up". Normally, regular expressions are case sensitive, so you might see [Aa][Oo][Ll] if it can be AOL or aol or AoL or AOl or... . However, I believe PAN's matching is always case insensitive. (I'm not sure on that tho so check it.) In some cases the backslash turns /on/ the special meaning. Thus, while "s" matches the letter, "\s" matches whitespace (space, tab, newline, carriage return, form-feed), while \S matches a NON-whitespace (anything BUT a whitespace). Similarly \w means any "word" character (alphanumeric plus _), \W (cap) means any NON-word character. BTW, \t=tab, \n=newline, \r=carriage return, \f=form feed. Also, \d=digit (thus equivalent to [0-9]), \D=non-digit. There are also additional anchors, \b anchors at a word boundary, \B of course means NOT a word boundary. Then there's things like look-ahead and look-behind assertions, both positive and negative, and other "advanced" techniques I won't go into here, in part because I'm not sure how much of the advanced stuff PAN actually handles. There's always google if you are interested in learning more. OK, here's a real-world examples to show you just how complicated these things can get. This is taken from my privoxy (a privacy enhancing ad-filtering page-rewriting web proxy based on junkbuster) filter file (the filters are regular expression based), where it matches the file format code (PDF, DOC etc). The rest of the filter then turns that part red, so it stands out, but we only covered matching, not substitution, so I'm not posting the whole filter, only the regex match portion. (<font\scolor\s*=\s*)\#[a-f0-9]{3,6}\s*(>\s*File\sFormat:\s*)</font>([^<>]*)<a\b I didn't mention it above but besides grouping, parenthesis save that portion of the expression to be used again later. That's what some of the parenthesis here are. The above matches HTML code similar to the following: <font color = #aaaaaa> File Format: </font> a bunch of stuff not angle brackets <a[word-boundary] It's open grouping ( literal <font color spaces \s* literal = spaces close grouping ) literal # three to six hex digits (0-9a-f) spaces open grouping ( literal > spaces literal File Format: spaces close grouping ) literal </font> open grouping ( wildcard grouping of everything up until a < or > (denoting an HTML tag, this is in the braces with the ^ denoting everything /but/, with the star repeating the braces content any number of times) literal <a word boundary /b Note the wildcard everything /but/ technique there, followed by an instance of what it couldn't match in the wildcard. If I hadn't used the "anything but" technique, it could try to grab nearly the entire web page from there on, until the last web page anchor tag (<a ...>), as matches are normally "greedy", matching all they possibly can. Telling it to match anything /but/ an html tag marker stops it from going too far. ... Quite a whirlwind tour. Probably a bit overwhelming. However, it should demonstrate a bit of the power that regular expressions allow, and if you save it for reference and go over it a few times, it might give you some ideas for just how flexible regular expressions, and therefore scoring, can be. =8^) Of course, the above assumes as I said, that it wasn't all review anyway. =8^) -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman _______________________________________________ Pan-users mailing list Pan-users@nongnu.org http://lists.nongnu.org/mailman/listinfo/pan-users