Dominique Dumont via Pan-users posted on Wed, 20 Aug 2025 09:32:58 +0200
as excerpted:
> On Wednesday, 30 July 2025 03:24:42 Central European Summer Time Robin
> Laing wrote:
>
>> Groups I look at have a large number of articles. Due to obfuscation
>> usage, they get a very large number, in the 100's of thousands in a few
>> months. There are original posts as well.
>
> You can try to define scores and actions to remove these obfuscated
> posts:
> - define a score to ignore them
> - set an actions (in preferences) to
> delete article scoring at -9999 or less
Good idea if that's possible, which it may not be.
I haven't done regular binary downloads in years so I'm honestly not sure
what modern "obfuscation" looks like (as compared to say, all the spam
garbage I used to try to ignore&delete back in the day, with some methods
easier to manage than others), but upon reading the message I had
interpreted "obfuscation" as deliberately confounding automated
"censorship" methods. If that's the case it may make pan's ignore&kill
methods, or indeed, pretty much any such automated handling, difficult or
impossible.
As an example there was a troll that randomized their name. Of course
that kills any chance at ignoring by author, and of course subject, etc,
wasn't stable either. Now I /suspected/ that some other header in their
posts remained stable enough to score against, but at that time pan could
only score on normal overview header content (see discussion below) so
headers like NNTP-Posting-Host, provider, path, even user agent, weren't
available to score on.
I argued that despite the downsides of post-download-to-cache score/
filtering (again, see below), having it available was still better than
having to deal with such posts manually, and IIRC, that was one of the
changes Heinrich made, so I *think* entire post content (including both
non-overview headers and full body) can be scored on now.
> For instance here's a filter I use:
>
> %BOS
> %Score created by Pan on Sat Feb 4 19:22:42 2023
> [alt.binaries.*]
> Score:: -9999
> Subject: [0-9a-f-]{15,}
> %EOS
If I'm not mistaken (it has been awhile...), what that does in English is:
For any subject with a run of 15 or more alphanumeric (plus -) characters
(so basically really long words without spaces), score -9999 ("ignore"
level in pan-score-speak).
That's a reasonably good one. And it's on subject so available in the
overviews.
> That said, I do not know if pan remove the articles when downloading
> headers (in which case you should be fine) or if it applies scores and
> actions after loading all headers (in which case, you may face similar
> memory usage and crash)
I /think/ pan scores twice now, first on receiving overviews (but I'm not
sure if it scores (and applies rules based on scores) as-it-goes,
desirable in this case as it would eliminate the memory usage of deleted
messages, or after it's done with initial processing, which wouldn't),
then rescoring after caching, when all headers and message content are
available (which obviously uses full memory).
For anyone a bit lost by the distinction between overviews and headers,
here's my attempt at putting the RFC[1] tech-speak in plain English. Note
that while I believe the following to be correct as far as it goes, I'm
glossing over the details both for brevity and because in some cases I may
not understand them myself.
In RFC-speak, "headers" are the first lines of a message, separated into
individual headers by (unescaped[2]) CRLF (carriage-return, line feed)
characters, and from the message body by the first (unescaped) blank line
of the message -- that is, two CRLF sequences in a row.
The most common headers include Subject, Author, Date, message length (in
lines, size, or both), content (body) format, etc. Less common headers,
often displayed only if the "raw" message is shown, include NNTP-posting-
host (commonly the IP posted from, often unencoded by cheap or ISP
providers, usually encrypted so only the news provider can track them by
"dedicated" news providers where news is the specific service paid for by
the end-user customer), service provider, the path the message took from
poster news provider to downloader news provider (actually common but not
as critical as subject, author...), etc.
You will note that some of these headers are critical to the decision of
whether to download the message (at least to local cache, whether then
saved or not), or not, as well as to threading display. Specifically,
Subject, Author, Date, Message-ID (used to request a specific message),
XRef (per-group and per-server sequential message number, used by the
client to track read messages, etc), References (used for threading), and
size (in lines or bytes), are quite critical to either pre-whole-message-
download display in the "headers" pane, or to ID of the message itself.
NNTP has what's called an "overview" that lets the client fetch these
critical headers for display before downloading the entire message to
local cache. But this overview doesn't include less critical headers or
(obviously) the entire message (body).
Originally pan could score only on the overviews, and thus only on the
critical headers included in the overviews. When scoring on overviews is
possible, it's by-far preferred since it allows one to "ignore" (negative
side, or "watch", on the positive side) messages before the full message
is downloaded to local cache, thereby saving on download traffic and time.
Pan's rules/actions then allow one to act on the score, say auto-
downloading "watched" (+9999 or above) messages, auto-deleting "ignored
(-9999 or below) messages, perhaps auto-marking-as-read but not deleting
anything scored below zero but not yet reaching -9999/ignored, etc, and
it's obvious why doing so without actually downloading (to cache) the
message is preferred, when possible.
But what about that troll randomizing his Author headers I mentioned
above? Messages such as that are often impossible to score-ignore and
delete from the overviews alone, but once the message is downloaded to
cache, there's all the non-overview headers (and indeed, the body itself)
available to score against if desired.
Obviously this isn't the greatest situation to be forced into since you
have to download the message to cache in ordered to score on the full
message (both body and non-overview headers), but I still think it's a win
if pan can score and process them automatically, even if post-download,
thereby allowing the human user to avoid having to do so manually.
And as I said above, I /think/ non-overview (re)scoring is one of the
features Heinrich added to pan, tho one has to directly edit the score
file to do it (pan's GUI only allows scoring on overview headers). Tho
FWIW, that was long after I needed the feature to deal with that author-
randomizing troll, and after I quit doing regular binary downloads as
well, so I don't believe I ever used the feature personally.
Now to bring this back to topic...
Obviously post-cache (re)scoring, even if possible (which if I'm correct
it now is), besides still requiring the download-to-cache, won't help much
with the memory issue, as that's well after pan has incurred the memory
cost. But (I contend anyway) it's still useful if it allows avoiding
having to manually deal with posts you'd prefer not to.
Meanwhile, as DD pointed out, the overview-level scoring could be either
as-pan-first-sees-it, thereby eliminating the memory overhead, or later,
after it has constructed the full overview picture in memory, thereby
incurring the memory cost regardless. As I'm not a coder and don't have
that as prior info from watching the pan lists over the years, I don't
know that either, and he (or some other coder who could more easily than I
extract that info from the actual source) would be in a better position
than I to get it.
Long story short, be aware that even if it's possible to successfully
score and automate deletion of a post, that doesn't mean it's not going to
incur that memory cost. If it's possible to score on overview-included
headers such as subject and author, it'll depend on when pan actually does
that processing, before or after it has incurred the full memory cost. If
scoring on non-overview headers (or full content such as signature) is
necessary, then even if it works (which I believe it should now, tho as
mentioned, only by directly editing the scorefile as it's not in the PAN
GUI), it's going to incur full memory cost regardless -- there's simply no
way around it.
---
[1] The RFCs, "Requests for Comments", are the specifications that
standardize internet formats and the like so multiple implementations can
properly interoperate with one another. The RFCs specifically for net-
news (NNTP) reference the ones for a common internet message format
originally specified in the email RFC context but also used for news.
[2] Escapes are used to allow wrapping of header lines that would
otherwise exceed the 998 character (1000 minus terminating CRLF sequence)
header line-length limit, and, in later additions, to encode non-ASCII
characters not allowed by the original RFCs into allowed ASCII. As a
common example, the References header, which include message-IDs for the
messages this one is a followup to for threading purposes, can often
exceed 998 characters, because the individual message-IDs can be quite
long and in a deep thread there can be quite a number of them.
--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman
_______________________________________________
Pan-users mailing list
[email protected]
https://lists.nongnu.org/mailman/listinfo/pan-users