Re: [Pan-users] Ignoring specific threads

Duncan Mon, 15 Sep 2014 16:36:50 -0700

JCA posted on Mon, 15 Sep 2014 15:06:31 -0600 as excerpted:

> I was wondering if Pan can do the following:
> 
>    Let's assume take a user U in a given group G. U is a crank, a
> troll or something like that. I would like to tell Pan to ignore not
> only all posts from U but also all threads initiated by U. Is this
> possible with Pan?


Ignoring threads by a specific person isn't necessarily impossible, but 
it's not /directly/ possible, either.  You'll sort of be relying on a bit 
of a side-effect of something else, and hoping that you can get a good 
match without catching too many unrelated posts in the process.  Tho if 
it /does/ catch other posts you can potentially use score ordering or 
incremental scoring to rescue them.

IOW, this will be advanced score usage that could be complicated to setup 
and not necessarily worth the hassle, but in theory it can be done... 
sort-of.

Here's the deal.  Proper threading uses the references header.  This 
header contains a multi-generational list of "parent" post message-IDs.  
To score on threads or subthreads you score on the appropriate message-ID 
in the references header, and anything that matches will get the assigned 
score.

The problem is that message-IDs (which are assigned to both email and 
news messages, news message format being almost entirely the same as 
email message format, with a few different news-specific headers and 
generally omitting a few mail specific headers, altho both news and mail 
headers can be present and normally won't conflict with each other) are 
designed to uniquely ID specific messages, so a match of an entire 
message-ID will match only the single (sub)thread in reply to that 
specific message.  To match all threads originated by a specific author, 
you need to find something unique about that author's message-IDs that 
you can score on, that won't catch other author's message-IDs as well.  
To the extent that you can do so, you can filter threads replying to that 
person.  To the extent that you cannot, that the fixed part of the target 
author's message-IDs also appear in the message-IDs of others, you score 
their messages also.

As it happens, message-IDs are set either by the posting client, or by 
the server posted to, if the posting client didn't set one.  There's no 
hard rules governing the algorithm used to get a globally unique ID that 
is extremely unlikely to apply to a different message (message-IDs are 
used to track messages, so if two different messages get the same ID, 
only the first one seen by a particular server or client will normally 
appear), only general rules on the characters it can contain and the 
general format, which is similar to an email address, userpart @ 
domainpart.  (I deliberately spaced it out to avoid triggering gmane's 
email address obfuscation.)

If the posting client doesn't include a message-ID, then the server will 
set one.  Usually the domain side of these is the domain name of the news 
service provider the message was posted to, say @ giganews.com, or some 
such.  Of course scoring on that will catch all users who post to that 
NSP, with clients that don't set the message-ID themselves.

Clients that set the message-ID can use a similar pattern, pan uses the 
domain name of the email address you are posting with, for instance.  The 
Agent (and freeagent) client at least used to use the agent domain name 
instead.  Of course, in most cases either one of these will result in a 
domain name match that matches far more than one poster.

So the domain name side of the message-ID can be useful in narrowing 
things down, but ordinarily won't be enough by itself to identify a 
single poster, so you'll need to match something from the user side of 
the message-ID as well.

But the user-side of the message-ID tends to be almost entirely 
unstandardized, except of course there's some restriction in the 
characters that can be used, and the idea is to ultimately have something 
unique enough that no other message will have the same message-ID, 
despite a lot of other messages from the same poster and others normally 
having the same domain-side.

So what you'll want to try to do is look at the message-ID of a post from 
the target author, and **TRY** to find a match that's as unique to his 
posts as possible, but still dependably identifies ALL his posts.

If you're lucky, he uses a news server or client that nobody else posting 
to the group in question uses, and between limiting the score to that 
domain-name side of the message-ID, plus anything that's unique on the 
user side, and limiting that score to a specific group, it'll "just 
work".  Tho of course there's always the possibility that a new poster 
will appear that matches as well, that you'll miss.


But chances are pretty good you won't find a good enough match and that 
other posters will match that score as well.  But if it's only a few 
other posters that get caught in the net, all hope is not yet lost.  

Pan uses two types of scoring, absolute scoring, where a matching rule 
sets that score and no further rules are processed, and incremental 
scoring, where the score is simply increased or decreased by the value in 
the score.

Ignore is a score of -9999 or lower.  Normally, setting an ignore sets an 
absolute score of -9999, but a post can also be ignored if no absolute 
scores apply but the total of all incremental scores ends up being -9999 
or lower.  So if the net cast by your would-be references-header message-
id ignore is too wide and catching others as well, you have two possible 
methods to counteract that.

If you want to use an absolute score ignore, then counteracting it is as 
simple as setting another absolute score that catches the "mistakes", 
that gets processed first (appears before the too wide score in the 
scorefile, which you can edit for order as necessary).

The problem here is that the references header will contain message-ids 
from multiple generations of parent, and the ones that contain the target 
may well contain the false-positive IDs as well.  So an absolute score 
isn't likely to do what you need, because trying to undo it for the false-
positives will likely undo too much as well.

Which leaves incremental scoring.  The idea here would be to find a mix 
of scores such that in the end, all the matches for the target posts end 
up at -9999 or lower, while incrementals add just enough score back to 
the false-positives to rescue them from the ignore, bringing their score 
up to at least -9998, if not up further, to zero or positive.  That's 
definitely an art unto itself; or as I said above, "advanced".

Meanwhile, something that may help:  In your example you specified 
threads INITIATED by U.  As it happens, regular-expression matches have a 
way to specify BEGINS WITH and/or ENDS WITH.  If you're only worried 
about matching threads where U is the original poster, the ^ character at 
the beginning of the regex can be used to specify "begins with".  You can 
then use a wildcard that omits the ">" character used to terminate each 
message-ID, thus forcing the match to only apply to the first one.  
Something like this (spaces again inserted either side of the @) :

References: ^[^>]* @ sample\.com>

^ means begins-with.  The [] encloses a character-set, with ^ as the 
first character meaning "not".  * means "any number of matches of the 
previous".  So what that means is:

References header, begins with, any-number-of-characters-not-including->, 
@ sample.com, >.

Thus the first message-ID in the references header would have to have 
sample.com as the domain name portion.


But something else to keep in mind as well:  Some clients are broken and 
do not include a properly populated References header in replies.  These 
clients will often attempt to thread by the contents of the subject 
header, instead. Obviously, no references header, no match on a 
references-header score. =:^(

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


_______________________________________________
Pan-users mailing list
Pan-users@nongnu.org
https://lists.nongnu.org/mailman/listinfo/pan-users

Re: [Pan-users] Ignoring specific threads

Reply via email to