[Pan-users] Re: Plonk Author

Duncan Thu, 29 Sep 2005 06:45:02 -0700

Tim Kynerd posted <[EMAIL PROTECTED]>,
excerpted below,  on Thu, 29 Sep 2005 11:37:06 +0200:

> On 29 Sep 2005, at 10:54, Brad Rogers wrote:
> 
>> Hello All,
>>
>> What do I need to set "Group" to, when plonking an author, such  
>> that it
>> matches all ng's?
>>
>> I keep seeing articles that I'd rather not.  He doesn't change his
>> e-mail address, and I set the filter to ignore the subject, but I
>> obviously haven't fully understood the "Group" definition.
>>
>> Thanks in advance.
>>
> 
> I'm not 100% sure this works (I haven't checked it carefully), but I  
> set the Group condition to "is not bla.bla.bla". I can be fairly sure  
> there will never be such a group on Usenet, so this condition should  
> apply to all groups.

There are several ways to set "apply to all groups".  If you look at the
score file itself, PAN converts them all to regular expressions (regexs)
before writing them (the other options being plain English for some
prewritten regex magic), so that would be the "native" way to set an
expression.

Using regular expressions, "." substitutes for any character (if you want
/just/ a ".", escape it with a "\", thus "\.".  "*" means "zero or more of
the preceeding character".  Thus, ".*" means "zero or more of any
character".  That pretty well covers all groups.

Another way to do it is to observe that all real USENET groups (and
probably almost all private newsgroups, perhaps with a very few
exceptions) contain the real "." char as a separator, at  least once. 
Thus, you can select "contains", and fill in a ".".  Again, that should
match all groups.

As I noted, of course, PAN will internally convert that to a regex before
writing it to the score file.  "containing a dot" in regex is simply "\."
(the dot must be escaped, as noted above). Note that there are no anchor
characters (a "^" to the left indicates that the line begins with the
sequence, a "$" to the right, if it doesn't follow an escaping \ of
course, indicates that it ends with the sequence, these are called
"anchors" because the anchor the sequence at the beginning or end (or
both) of a line).  Thus, the "." can be anywhere on the line, so it'll
pickup any newsgroup name with a dot in it, which is basically all of
them.  (IDR if the RFC mandates a second level name, therefore at least
one dot, or not, but if there are exceptions, they are few and far between.)

Of course, the same "zero or more instances" of any old character can be
used to mean the same thing -- any group.  Thus, "c*" would be any group
containing zero or more instances of the c char, therefore, any group. 
Any letter or number could be used, as those don't have special meaning
(like the * or . or \ or most other punctuation does) if not escaped.

BTW, because "\" is the escape char, "\\" escapes the special meaning of
the second backslash, converting two into one.  Thus, in regex, \\
converts into a single \.

I mentioned the anchor chars.  That of course presents its own way to mean
"all groups".  "^$" is the  empty line, which of course means "no groups",
so select "does not match regular expression" and fill in "^$", to "not
match a blank group name", thus, matching all groups. =8^) How's that
convert to a regular expression in its own right? That's a bit more
complicated. Honestly, I had to try /that/ one to see, and I still don't
quite understand the resulting notation.  I'll have to look it up to see
if I can find documentation for it.

Completing this whirlwind intro to regex, the () chars group a
sub-expression, as one might expect, the [] chars create an itemized
character subclass, within a subclass, the - acts as a range character,
and as the initial char of a subclass, the ^ means "not".  The | means
"or".  Thus, [0-9] is one numeric character.  [-0-9] means a dash or
numeric character (the dash at the beginning can't indicate a range so it
matches itself) [a-zA-Z0-9]+ means one or more (as opposed to the *
meaning zero or more, + means one or more) alphanumeric chars.  [^0-9]
would be a char that's NOT numeric.  (tom|jerry) means one /or/ the other
of them.  (tom(my)?|jerr[iy]) means tom, tommy, jerry, or jerri (? means
zero or one occurances, it may or may not exist).  Etc.

Finally, note that as implied by the alphanumeric example, regexs are
normally case sensitive, so those tom and jerry examples above would NOT
match the capitalized names.  [Jj][Ee][Rr][Rr][Yy] would match jERrY in
any case.  (There are additional allowances for specifying case
insensitive matching, but that's in the flag section of the match, which
is out of the scope of the discussion here.)  IDR if PAN's matching is
case sensitive or not, but unless a special exception has been made, it
will be.

Editing the score file directly should be /far/ easier now, with a bit of
regex knowledge.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman in
http://www.linuxdevcenter.com/pub/a/linux/2004/12/22/rms_interview.html

_______________________________________________
Pan-users mailing list
Pan-users@nongnu.org
http://lists.nongnu.org/mailman/listinfo/pan-users

[Pan-users] Re: Plonk Author

Reply via email to