Dominique Dumont via Pan-users posted on Wed, 20 Aug 2025 09:32:58 +0200
as excerpted:

> On Wednesday, 30 July 2025 03:24:42 Central European Summer Time Robin
> Laing wrote:
> 
>> Groups I look at have a large number of articles.  Due to obfuscation
>> usage, they get a very large number, in the 100's of thousands in a few
>> months.  There are original posts as well.
> 
> You can try to define scores and actions to remove these obfuscated
> posts:
> - define a score to ignore them
> - set an actions (in preferences) to
> delete article scoring at -9999 or less

Good idea if that's possible, which it may not be.

I haven't done regular binary downloads in years so I'm honestly not sure 
what modern "obfuscation" looks like (as compared to say, all the spam 
garbage I used to try to ignore&delete back in the day, with some methods 
easier to manage than others), but upon reading the message I had 
interpreted "obfuscation" as deliberately confounding automated 
"censorship" methods.  If that's the case it may make pan's ignore&kill 
methods, or indeed, pretty much any such automated handling, difficult or 
impossible.

As an example there was a troll that randomized their name.  Of course 
that kills any chance at ignoring by author, and of course subject, etc, 
wasn't stable either.  Now I /suspected/ that some other header in their 
posts remained stable enough to score against, but at that time pan could 
only score on normal overview header content (see discussion below) so 
headers like NNTP-Posting-Host, provider, path, even user agent, weren't 
available to score on.

I argued that despite the downsides of post-download-to-cache score/
filtering (again, see below), having it available was still better than 
having to deal with such posts manually, and IIRC, that was one of the 
changes Heinrich made, so I *think* entire post content (including both 
non-overview headers and full body) can be scored on now.

> For instance here's a filter I use:
> 
> %BOS
> %Score created by Pan on Sat Feb  4 19:22:42 2023
> [alt.binaries.*]
> Score:: -9999
> Subject: [0-9a-f-]{15,}
> %EOS

If I'm not mistaken (it has been awhile...), what that does in English is:

For any subject with a run of 15 or more alphanumeric (plus -) characters 
(so basically really long words without spaces), score -9999 ("ignore" 
level in pan-score-speak).

That's a reasonably good one.  And it's on subject so available in the 
overviews.

> That said, I do not know if pan remove the articles when downloading
> headers (in which case you should be fine) or if it applies scores and
> actions after loading all headers (in which case, you may face similar
> memory usage and crash)

I /think/ pan scores twice now, first on receiving overviews (but I'm not 
sure if it scores (and applies rules based on scores) as-it-goes, 
desirable in this case as it would eliminate the memory usage of deleted 
messages, or after it's done with initial processing, which wouldn't), 
then rescoring after caching, when all headers and message content are 
available (which obviously uses full memory).

For anyone a bit lost by the distinction between overviews and headers, 
here's my attempt at putting the RFC[1] tech-speak in plain English.  Note 
that while I believe the following to be correct as far as it goes, I'm 
glossing over the details both for brevity and because in some cases I may 
not understand them myself.

In RFC-speak, "headers" are the first lines of a message, separated into 
individual headers by (unescaped[2]) CRLF (carriage-return, line feed) 
characters, and from the message body by the first (unescaped) blank line 
of the message -- that is, two CRLF sequences in a row.

The most common headers include Subject, Author, Date, message length (in 
lines, size, or both), content (body) format, etc.  Less common headers, 
often displayed only if the "raw" message is shown, include NNTP-posting-
host (commonly the IP posted from, often unencoded by cheap or ISP 
providers, usually encrypted so only the news provider can track them by 
"dedicated" news providers where news is the specific service paid for by 
the end-user customer), service provider, the path the message took from 
poster news provider to downloader news provider (actually common but not 
as critical as subject, author...), etc.

You will note that some of these headers are critical to the decision of 
whether to download the message (at least to local cache, whether then 
saved or not), or not, as well as to threading display.  Specifically, 
Subject, Author, Date, Message-ID (used to request a specific message), 
XRef (per-group and per-server sequential message number, used by the 
client to track read messages, etc), References (used for threading), and 
size (in lines or bytes), are quite critical to either pre-whole-message-
download display in the "headers" pane, or to ID of the message itself.

NNTP has what's called an "overview" that lets the client fetch these 
critical headers for display before downloading the entire message to 
local cache.  But this overview doesn't include less critical headers or 
(obviously) the entire message (body).

Originally pan could score only on the overviews, and thus only on the 
critical headers included in the overviews.  When scoring on overviews is 
possible, it's by-far preferred since it allows one to "ignore" (negative 
side, or "watch", on the positive side) messages before the full message 
is downloaded to local cache, thereby saving on download traffic and time.

Pan's rules/actions then allow one to act on the score, say auto-
downloading "watched" (+9999 or above) messages, auto-deleting "ignored 
(-9999 or below) messages, perhaps auto-marking-as-read but not deleting 
anything scored below zero but not yet reaching -9999/ignored, etc, and 
it's obvious why doing so without actually downloading (to cache) the 
message is preferred, when possible.

But what about that troll randomizing his Author headers I mentioned 
above?  Messages such as that are often impossible to score-ignore and 
delete from the overviews alone, but once the message is downloaded to 
cache, there's all the non-overview headers (and indeed, the body itself) 
available to score against if desired.

Obviously this isn't the greatest situation to be forced into since you 
have to download the message to cache in ordered to score on the full 
message (both body and non-overview headers), but I still think it's a win 
if pan can score and process them automatically, even if post-download, 
thereby allowing the human user to avoid having to do so manually.

And as I said above, I /think/ non-overview (re)scoring is one of the 
features Heinrich added to pan, tho one has to directly edit the score 
file to do it (pan's GUI only allows scoring on overview headers).  Tho 
FWIW, that was long after I needed the feature to deal with that author-
randomizing troll, and after I quit doing regular binary downloads as 
well, so I don't believe I ever used the feature personally.

Now to bring this back to topic...

Obviously post-cache (re)scoring, even if possible (which if I'm correct 
it now is), besides still requiring the download-to-cache, won't help much 
with the memory issue, as that's well after pan has incurred the memory 
cost.  But (I contend anyway) it's still useful if it allows avoiding 
having to manually deal with posts you'd prefer not to.

Meanwhile, as DD pointed out, the overview-level scoring could be either 
as-pan-first-sees-it, thereby eliminating the memory overhead, or later, 
after it has constructed the full overview picture in memory, thereby 
incurring the memory cost regardless.  As I'm not a coder and don't have 
that as prior info from watching the pan lists over the years, I don't 
know that either, and he (or some other coder who could more easily than I 
extract that info from the actual source) would be in a better position 
than I to get it.

Long story short, be aware that even if it's possible to successfully 
score and automate deletion of a post, that doesn't mean it's not going to 
incur that memory cost.  If it's possible to score on overview-included 
headers such as subject and author, it'll depend on when pan actually does 
that processing, before or after it has incurred the full memory cost.  If 
scoring on non-overview headers (or full content such as signature) is 
necessary, then even if it works (which I believe it should now, tho as 
mentioned, only by directly editing the scorefile as it's not in the PAN 
GUI), it's going to incur full memory cost regardless -- there's simply no 
way around it.

---
[1] The RFCs, "Requests for Comments", are the specifications that 
standardize internet formats and the like so multiple implementations can 
properly interoperate with one another.  The RFCs specifically for net-
news (NNTP) reference the ones for a common internet message format 
originally specified in the email RFC context but also used for news.

[2] Escapes are used to allow wrapping of header lines that would 
otherwise exceed the 998 character (1000 minus terminating CRLF sequence) 
header line-length limit, and, in later additions, to encode non-ASCII 
characters not allowed by the original RFCs into allowed ASCII.  As a 
common example, the References header, which include message-IDs for the 
messages this one is a followup to for threading purposes, can often 
exceed 998 characters, because the individual message-IDs can be quite 
long and in a deep thread there can be quite a number of them.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


_______________________________________________
Pan-users mailing list
[email protected]
https://lists.nongnu.org/mailman/listinfo/pan-users

Reply via email to