Re: [Pan-users] Scoring based on arbitrary headers?

Duncan Mon, 05 Jan 2015 20:11:14 -0800

Jim Henderson posted on Mon, 05 Jan 2015 17:55:41 +0000 as excerpted:

> What would it take to be able to score articles based on an arbitrary 
> header?
> 
> Say, for example, I get an X-Forwarded-For: header - if I wanted to do 
> some simple matching (say even substring-based matching), is there 
> currently a mechanism that would let me do this (say, manually editing 
> the score file), or would it require changes to underlying code in Pan?


With three caveats, AFAIK scoring by arbitrary header should "just work" 
in current pan.

Caveats:

1) I've not needed to actually try this personally, and I'm too lazy ATM 
to do the list (or git log) search to verify, but I'm almost certain that 
Heinrich said it works now.  If it seems to fail when you try it, perhaps 
I can try to dig up the message, but I expect it /does/ work tho I can't 
personally vouch for that as I'm not personally using the feature.

2 (the big one)) While arbitrary header scoring should work, due to the 
nature of NNTP it's not as efficient, and will require downloading (at 
least part of) the message before pan can apply that score.  You can't 
score it after only downloading "headers", as you can with pan's normal 
GUI scoring options.

Here's the deal.  What pan calls "headers" is actually "overviews".  If 
you go back in list history you'll see I used to make a big deal about 
this, and for some time insisted on calling it the "overview pane" rather 
than the "header pane", because that's what it shows, overviews, *NOT* 
all, or even generally /most/, headers.

Overviews, in NNTP, consist of a strictly selected subset of message 
headers and other message metadata -- generally that generically found 
most useful before full message download.  From RFC 3977 section 8.3.2, 
the first eight fields of an overview MUST be, in order:

"0" or article number (see below)
Subject header content
>From header content
Date header content
Message-ID header content
References header content
:bytes metadata item
:lines metadata item

A news admin MAY configure additional headers or metadata[2] for 
overviews, and the xref and distribution (if present) headers are 
commonly included.  Anything else is entirely optional and left to 
provider/admin policy.

Now here's the kicker.  NNTP provides the overview command to fetch this 
information for individual messages or for a range of articles based on 
article number, and it's precisely this OVERVIEW information that most 
news clients, including pan, display as the article list, BEFORE THE 
ARTICLES THEMSELVES ARE DOWNLOADED (at least to local cache).

As a consequence, even tho pan should score on the contents of arbitrary 
headers just fine, IF THE HEADER ISN'T IN THE OVERVIEW, PAN CAN'T SCORE 
ON IT UNTIL THE ARTICLE IS DOWNLOADED.

Which /does/ cripple scoring on non-overview headers to a significant 
extent, but there's nothing to be done about it.

And as long as it works, even crippled, if for instance you're ignore-
scoring based on a non-overview header, even if you must download the 
full message to do so, that does still automate the ignore, so you don't 
have to /manually/ see and deal with these messages you presumably found 
offensive enough to want to ignore, and while that's not as good as being 
able to avoid downloading them at all, at least you don't have to see and 
deal with it manually, which is still CONSIDERABLY better than NOTHING! 
=:^)

What I do NOT know, because as I said I've not actually tried it here, is 
if pan will automatically rescore when it downloads the message and can 
do so, or if you'll need to manually trigger a rescore.

If you have to manually trigger the rescore, there's another 
implication.  You'll presumably need to do what I normally do for 
binaries anyway, download a slew of them to cache for later processing, 
then come back when they're all in local cache and go thru them again, in 
this case, triggering the rescore presumably as first order of business 
when you come back to process the already locally cached messages.  Of 
course at least for binaries that means configuring your cache size 
considerably larger than pan's default 10 MiB.

Of course, if pan already does a second scoring pass after download to 
cache, or better yet, after download of just the headers so it can cancel 
big binary downloads before they're finished, then you shouldn't have to 
change the cache size.  But I don't know if it handles that automatically 
or not.

So if you test this, please post your results. =:^)

3) Yes, you must edit the scorefile manually to score on arbitrary 
headers.  This is for two reasons.  First, obviously that's an infinite 
list of possible headers, which doesn't fit well with pan's scoring GUI.  
Of course the GUI could include the ability to specify your own header, 
but that's where the second reason comes in.  Pan's GUI, particularly 
when Charles was primary dev, was kept simple and ideally intuitive, and 
explaining the technical implications and limitations of non-overview 
header scoring is ANYTHING but simple.  Thus the obvious solution, make 
it possible, but only by editing the scorefile directly.  Those technical 
enough to be willing to do that should be technical enough to appreciate 
the implications of non-overview scoring, and motivated/desperate enough 
to still appreciate the more limited benefits it offers.

Since you mentioned that as a possibility, presumably you're already 
familiar with the scorefile format.  Just in case you aren't, or in case 
you need a refresher and don't have the link handy, here's the 
boilerplate:

http://slrn.sourceforge.net/docs/score.txt

That of course is the slrn scorefile doc.  Pan uses the same basic 
format, but is case-insensitive by default, and doesn't handle some of 
the more advanced features (like external file-includes and nested 
conditionals).  Also, last I was aware, pan had a bug and ORed all 
scoring conditions, the documented double-colon behavior, even where it 
was single-colon and thus by the documentation the conditions should be 
ANDed.

See my past posts on the scorefile format for further details...

So basically, to score on an arbitrary header, you'd create the score as 
normal, but use the desired header instead of subject/from/etc.

Good luck, hope the limitations don't ruin it for you, and hope you can 
confirm it working for us! =:^)

---
[1] References header and threading:  Pan uses it too.  Some clients (MSOE 
among them) thread by subject as well and if the subject changes, 
consider it a new thread, but that's not generally considered valid, and 
leads to confusion when people hit reply (and thus have a references 
header in their message) and think by changing the subject they're 
starting a new thread.  The /valid/ way to start a new thread is with a 
new message, *NOT* a reply to an old one.

[2] Headers and metadata:  The distinction is this:  Headers are always 
literal content within the message.  Metadata is always calculated.  
Thus, for example, the :bytes metadata and Bytes: header are two entirely 
different things and may well have different content.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


_______________________________________________
Pan-users mailing list
Pan-users@nongnu.org
https://lists.nongnu.org/mailman/listinfo/pan-users

Re: [Pan-users] Scoring based on arbitrary headers?

Reply via email to