Re: [Pan-users] memory usage in fake multi-part posts

Duncan Tue, 22 May 2018 00:07:04 -0700

Daisy Flanders posted on Sat, 19 May 2018 21:53:57 +0000 as excerpted:

>  I was wrong, the subject_lookup table is only a part of the problem.
>  Articles are never really deleted and apparently never unref their
>  members. An Article is created before any filtering rules are applied
>  and subsequently removing it frees little if any memory.


First, please refrain from posting HTML, preferably at all but at least 
to the pan list.  Keep in mind that many regulars here not only accept 
but fully approve of pan's treatment of HTML as plain text[1].  Plus, 
many regulars (including me) actually use pan to read the list, via 
news.gmane.org 's list2news, for instance, and I suppose you know how 
ugly pan makes your HTML posts look based on seeing what it does to 
others.


As for the topic at hand...  

Without getting into the code detail (as I'm /not/ a programmer, tho as a 
gentooer that follows live-git for many packages including pan and 
applies patches to some, again including pan, I can often follow 
programmer discussion and even hack up my own patches from time to time), 
I /can/ say (as a 1.5+ decade list regular) that pan's memory use has 
been a historic problem.

It's certainly better than it used to be.  I remember when a couple 
hundred K articles/parts would bring pan to a crawl, assuming it wasn't 
simply killed due to memory use.  Then, after some code changes such as 
combining string usage so for instance a common poster's name only 
occurred once in memory and the rest were references to it... pan could 
handle a couple million.

Now you're saying you've seen it handle a couple billion parts, millions 
of multipart articles.  Quite a bit better, even if what is considered 
"normal" memory (and even the size of the pointer we use to address it, 
32-bit to 64-bit) has also scaled in that time.

But there have always been a couple things that have limited pan.  One is 
that it doesn't have a persistent threading implementation, so each time 
it loads a group, it has to load all the articles and rebuild its message 
threading model for the group "from scratch".  Of necessity, this takes 
quite some memory, and quite some time to load a group, particularly when 
the cache is set high enough not to expire messages out of cache right 
away.

(I have multiple pan instances here, with most of my activity being on my 
"text" instance as opposed to "binaries", and the text instance set to 
not expire and with a large enough cache that it doesn't delete messages 
there either.  For groups such as this one as well as my old and now 
defunct ISP-newsgroups, I have an entire archive, going back to 
approximately the turn of the century.  It hasn't been a problem since I 
switched to SSDs, but back on spinning rust I /had/ to set pan to start 
with X/KDE, so it would have everything finally loaded sometime later 
when I clicked on it in the tray, and I wouldn't have to wait 5+ minutes 
for it to load all those messages off of spinning rust so it could thread 
them.)

I'm less clear on the details but with pan's deletion of messages from 
cache disconnected from its deletion of their listing in the overview/
header pane, it has to track that somehow as well, to avoid messages 
deleted from the header pane coming back when the group is reloaded.

From what you've posted, the behavior you're seeing could be related to 
this latter bit.

But if this specific poster is triggering problems at 5 million "headers" 
and 10 GiB RAM, while you've handled billions in the same group before, 
something's different.

Maybe it's the 1-part-each thing, since that means pan won't be able to 
combine the title strings in its representation like it can with proper 
multiparts, and the 3.5 million individual uncombinable posts from that 
one poster are simply too much for pan, even if it "deletes" them from 
the headers pan, since it is still tracking them.

But given that it's an apparently deliberate DOS attack on the group, the 
attacker could be varying other headers as well.  Is the name always 
exactly the same, allowing pan to deduplicate it in memory, or is it 
varying, foiling that deduplication and causing pan to use more memory 
there too?

And what about the message-ids?  Obviously they can't be duplicated, but 
I could imagine someone deliberately attempting a DOS attack to choose 
extremely long message-ids as well.


Anyway, while there may be small improvements that can be made to the 
existing code, with the existing non-persistent "load them all and 
rethread them each time" method forcing everything into memory, and 
perhaps little change for deletions to actually clear memory since some 
of the detail may need to be retained for threading new replies, etc, I 
believe at least the low-hanging-fruit in that area has already been 
picked, and (barring some immediate glaring bug) you won't get much 
improvement there.

The real problem would seem to be that load everything into memory for 
rethreading each time model, but while this has been discussed many times 
over the years, even thru the pan rewrite from C to C++, the same general 
load everything into memory to thread, no persistence, model, continues 
to be used.

Which means it's likely to be a rather huge project, perhaps multi-year 
of reasonably intense development, to change it, and debug to reasonably 
stable whatever persist some threading data over pan restarts so we can 
load it without rereading /everything/ and only have to actually read 
/new/ posts and fit them in model replaces it.

(At one point before the C++ rewrite there was talk of using for instance 
sqlite database to handle it all, with what was in memory effectively 
being a moving window covering only part of what was in the database, but 
the rewrite did enough better with the dedup, etc, and I suppose the 
switch to 64-bit and double-digit-gig memory for most has helped too, 
that it really hasn't come up as a burning issue since.  And it was never 
stated, but I got the feeling that neither Charles Kerr nor Heinrich 
Mueller were particularly familiar/comfortable with database programming, 
and were significantly enough more comfortable with text based that they 
continued that, even thru Charles' C++ rewrite and Heinrich's major 
feature additions, which otherwise might have been reasonable times to 
introduce and stabilize such major changes.  That would certainly go 
quite away toward explaining why such a thing was never implemented, at 
least in mainline pan.)

---
[1] Pan's treatment of HTML as plain-text:  Actually, a few years ago 
there was a discussion of GNKSA and whether pan should continue 100% 
compliance, or not, given how "dated" it is considered to be by many.  
The context was the GNKSA-mandated 4-connections-per-server limit (tho 
there's a workaround in [2] below), but it surprised me how many were in 
favor of holding the line at 100% compliance even including that, fearing 
dropping the 100% support in just one area would eventually lead to 
ignoring it in considered-to-be more important areas such as plain text 
vs. HTML, quote/reply order, etc.

[2] The 4-connections-per-server limit:  There's actually a non-GUI 
bypass for this, allowing those who want to set more connections to do 
so, while still satisfying GNKSA, since it says a complying client can't 
set more than that, but does /not/ say it can't honor a higher setting if 
a user bypasses the client's settings method to configure it manually.  
The setting is found in servers.xml and is easily text-editor modified... 
with pan closed, of course... and pan will honor a higher setting if 
found there upon startup.  Just be careful not to save any GUI changes to 
the server config after that or it'll overwrite the setting back to the 4 
connections allowed by the GUI, and you'll have to manually edit it back, 
once again.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


_______________________________________________
Pan-users mailing list
Pan-users@nongnu.org
https://lists.nongnu.org/mailman/listinfo/pan-users

Re: [Pan-users] memory usage in fake multi-part posts

Reply via email to