[Pan-users] Re: scorefile in 0.14.2.91

Duncan Sun, 16 Jul 2006 05:26:52 -0700

Ed Goforth <[EMAIL PROTECTED]> posted
[EMAIL PROTECTED], excerpted below, on  Sun, 16 Jul 2006
02:35:41 -0400:


> And a follow-up question on the "Articles -> Create Score" interface.
> It always shows the option for "contains" grayed out for the Author
> field.  If I write a rule by hand like
> 
> %BOS
> [*]
> Score: =-9999
>   From: aol.com
> %EOS
> 
> will Pan's scorefile parser barf?  Is this a legal rule?

It's not a legal rule, but not for the reason you think.  All the
expressions are regular expressions, actually, a particular variant
thereof called perl compatible regular expressions (thus PAN's dependency
on pcre, perl compatible regular expressions). If you select something else
setting up a score in the UI, PAN actually converts it into a regular
expression before putting it in the score file.

It appears you posted using Mozilla Thunderbird for Linux, so maybe you
know all about regular expressions.  However, if you do, I would have
assumed you would have recognized them and seen the problems with the
above.  Thus assuming you aren't familiar with regular expressions, I
suggest you do some googling, as they are really quite powerful.  There's a
whole lot more to it than I'll cover here but this will get you started.

In general, letters and numbers (normally) stand for themselves, but many
kinds of punctuation has alternate meanings.  

In particular, wildcards work rather differently than the shell patterns
you used above and are likely quite familiar with.  A single dot (.)
stands for "any character".  A question mark (?) means zero or one of, a
star (*) means any number of (zero or more of).

Thus, your group expression is invalid.  The star says zero or more of...
but there's nothing for there to be zero or more /of/!  What you /really/
want is [.*], meaning zero or more  of any character.  IOW, .* is the same
thing in regex as * is in shellex.  Similarly with .?, meaning any
character, may or may not be there (zero or one of), and .+, one or more
of (+ must have at least one, * can match zero)

Similarly, your aol.com will work, but it will match stuff you probably
didn't intend as well, because the dot matches not only a dot, but any
character, so it would match aolecom, aolzcom, etc, but NOT aoleecom or
aol..com (the dot matches a single character only), and NOT aolcom (there
MUST be a single character there to match.

A backslash (\) char is the escape character.  Thus, to match a literal
dot, not "any character", you'd write that this way: aol\.com

Other "control" characters include the parenthesis (()) for grouping, and
the pipe (|) for alternatives.  Thus "(aol|hotmail)\.com" would match
either aol.com or hotmail.com.

Brackets ({}) allow you to specify a frequency.  "b(an){2}a" would
therefore match "banana" (the "an" is repeated once, therefore occurring
exactly twice).  You can also use it to specify a range.  "b(an){2,3}a"
would therefore match "banana" and "bananana", and "b(an){2,}a" would
match two or more "an"s, so "bananananananana" but not "bana" (only one
"an", the regex specified two or more).

There's also braces ([]), altho I'm not sure how they'd be used here given
braces are used to denote the group expression.  Braces signify an
enumeration, "one of the following", so "b[aeiou]n" would match ban, ben,
bin, bon, bun, but not bbn or byn.  One can also include a range using the
- in the enumeration, so [a-z0-9] matches all alphanumerics in ASCII. 
(Note that in some locales, "z" isn't actually the last letter of the
alphabet, so if you are in one of those locales, things will change
accordingly.  There's what is called "POSIX character-classes" to deal with
that, but we'll skip that here.  Look it up if desired.)  If you want to
include a /literal/ "-", make it the first character of the group, so
[-09] is those three characters specifically, while [0-9] is the range. If
you want to match anything /except/ a particular enumeration, use ^ at the
beginning, so [^a-z] would be anything /but/ the alphabet (in ASCII,
anyway).

Outside of the brace-enumeration class, ^ and $ are anchors, meaning the
beginning and ending of a line.  Thus, ^$ indicates an empty line,
^aol\.com$ would match /exactly/ "aol.com" on a line by itself, etc.  If a
regex isn't anchored, it can match anywhere in the line, so ".*aol\.com.*"
is exactly the same as "aol\.com", is exactly the same as
"^.*aol\.com.*" because all three match "aol.com" anywhere on the line. 
Note that anchors don't "eat up" characters.  They specify a position but
don't match specific characters, so don't "eat them up".

Normally, regular expressions are case sensitive, so you might see
[Aa][Oo][Ll] if it can be AOL or aol or AoL or AOl or... .  However, I
believe PAN's matching is always case insensitive.  (I'm not sure on that
tho so check it.)

In some cases the backslash turns /on/ the special meaning.  Thus, while
"s" matches the letter, "\s" matches whitespace (space, tab, newline,
carriage return, form-feed), while \S matches a NON-whitespace (anything
BUT a whitespace). Similarly \w means any "word" character (alphanumeric
plus _), \W (cap) means any NON-word character.  BTW, \t=tab, \n=newline,
\r=carriage return, \f=form feed.  Also, \d=digit (thus equivalent to
[0-9]), \D=non-digit.

There are also additional anchors, \b anchors at a word boundary, \B of
course means NOT a word boundary.

Then there's things like look-ahead and look-behind assertions, both
positive and negative, and other "advanced" techniques I won't go into
here, in part because I'm not sure how much of the advanced stuff PAN
actually handles. There's always google if you are interested in learning
more.

OK, here's a real-world examples to show you just how complicated these
things can get.  This is taken from my privoxy (a privacy enhancing
ad-filtering page-rewriting web proxy based on junkbuster) filter file
(the filters are regular expression based), where it matches the file
format code (PDF, DOC etc).  The rest of the filter then turns that part
red, so it stands out, but we only covered matching, not substitution, so
I'm not posting the whole filter, only the regex match portion.

(<font\scolor\s*=\s*)\#[a-f0-9]{3,6}\s*(>\s*File\sFormat:\s*)</font>([^<>]*)<a\b

I didn't mention it above but besides grouping, parenthesis save that
portion of the expression to be used again later.  That's what some of the
parenthesis here are.

The above matches HTML code similar to the following:

<font color = #aaaaaa> File Format: </font> a bunch of stuff not angle
brackets <a[word-boundary]

It's open grouping (
literal <font color
spaces \s*
literal =
spaces
close grouping )
literal #
three to six hex digits (0-9a-f)
spaces
open grouping (
literal >
spaces
literal File Format:
spaces
close grouping )
literal </font>
open grouping (
wildcard grouping of everything up until a < or > (denoting an HTML tag,
  this is in the braces with the ^ denoting everything /but/, with the
  star repeating the braces content any number of times)
literal <a
word boundary /b

Note the wildcard everything /but/ technique there, followed by an
instance of what it couldn't match in the wildcard.  If I hadn't used the
"anything but" technique, it could try to grab nearly the entire web page
from there on, until the last web page anchor tag (<a ...>), as matches are
normally "greedy", matching all they possibly can.  Telling it to match
anything /but/ an html tag marker stops it from going too far.

...

Quite a whirlwind tour.  Probably a bit overwhelming.  However, it should
demonstrate a bit of the power that regular expressions allow, and if you
save it for reference and go over it a few times, it might give you some
ideas for just how flexible regular expressions, and therefore scoring, can
be. =8^)

Of course, the above assumes as I said, that it wasn't all review anyway.
=8^)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman



_______________________________________________
Pan-users mailing list
Pan-users@nongnu.org
http://lists.nongnu.org/mailman/listinfo/pan-users

[Pan-users] Re: scorefile in 0.14.2.91

Reply via email to