Daisy Flanders posted on Sat, 19 May 2018 21:53:57 +0000 as excerpted: > I was wrong, the subject_lookup table is only a part of the problem. > Articles are never really deleted and apparently never unref their > members. An Article is created before any filtering rules are applied > and subsequently removing it frees little if any memory.
First, please refrain from posting HTML, preferably at all but at least to the pan list. Keep in mind that many regulars here not only accept but fully approve of pan's treatment of HTML as plain text[1]. Plus, many regulars (including me) actually use pan to read the list, via news.gmane.org 's list2news, for instance, and I suppose you know how ugly pan makes your HTML posts look based on seeing what it does to others. As for the topic at hand... Without getting into the code detail (as I'm /not/ a programmer, tho as a gentooer that follows live-git for many packages including pan and applies patches to some, again including pan, I can often follow programmer discussion and even hack up my own patches from time to time), I /can/ say (as a 1.5+ decade list regular) that pan's memory use has been a historic problem. It's certainly better than it used to be. I remember when a couple hundred K articles/parts would bring pan to a crawl, assuming it wasn't simply killed due to memory use. Then, after some code changes such as combining string usage so for instance a common poster's name only occurred once in memory and the rest were references to it... pan could handle a couple million. Now you're saying you've seen it handle a couple billion parts, millions of multipart articles. Quite a bit better, even if what is considered "normal" memory (and even the size of the pointer we use to address it, 32-bit to 64-bit) has also scaled in that time. But there have always been a couple things that have limited pan. One is that it doesn't have a persistent threading implementation, so each time it loads a group, it has to load all the articles and rebuild its message threading model for the group "from scratch". Of necessity, this takes quite some memory, and quite some time to load a group, particularly when the cache is set high enough not to expire messages out of cache right away. (I have multiple pan instances here, with most of my activity being on my "text" instance as opposed to "binaries", and the text instance set to not expire and with a large enough cache that it doesn't delete messages there either. For groups such as this one as well as my old and now defunct ISP-newsgroups, I have an entire archive, going back to approximately the turn of the century. It hasn't been a problem since I switched to SSDs, but back on spinning rust I /had/ to set pan to start with X/KDE, so it would have everything finally loaded sometime later when I clicked on it in the tray, and I wouldn't have to wait 5+ minutes for it to load all those messages off of spinning rust so it could thread them.) I'm less clear on the details but with pan's deletion of messages from cache disconnected from its deletion of their listing in the overview/ header pane, it has to track that somehow as well, to avoid messages deleted from the header pane coming back when the group is reloaded. From what you've posted, the behavior you're seeing could be related to this latter bit. But if this specific poster is triggering problems at 5 million "headers" and 10 GiB RAM, while you've handled billions in the same group before, something's different. Maybe it's the 1-part-each thing, since that means pan won't be able to combine the title strings in its representation like it can with proper multiparts, and the 3.5 million individual uncombinable posts from that one poster are simply too much for pan, even if it "deletes" them from the headers pan, since it is still tracking them. But given that it's an apparently deliberate DOS attack on the group, the attacker could be varying other headers as well. Is the name always exactly the same, allowing pan to deduplicate it in memory, or is it varying, foiling that deduplication and causing pan to use more memory there too? And what about the message-ids? Obviously they can't be duplicated, but I could imagine someone deliberately attempting a DOS attack to choose extremely long message-ids as well. Anyway, while there may be small improvements that can be made to the existing code, with the existing non-persistent "load them all and rethread them each time" method forcing everything into memory, and perhaps little change for deletions to actually clear memory since some of the detail may need to be retained for threading new replies, etc, I believe at least the low-hanging-fruit in that area has already been picked, and (barring some immediate glaring bug) you won't get much improvement there. The real problem would seem to be that load everything into memory for rethreading each time model, but while this has been discussed many times over the years, even thru the pan rewrite from C to C++, the same general load everything into memory to thread, no persistence, model, continues to be used. Which means it's likely to be a rather huge project, perhaps multi-year of reasonably intense development, to change it, and debug to reasonably stable whatever persist some threading data over pan restarts so we can load it without rereading /everything/ and only have to actually read /new/ posts and fit them in model replaces it. (At one point before the C++ rewrite there was talk of using for instance sqlite database to handle it all, with what was in memory effectively being a moving window covering only part of what was in the database, but the rewrite did enough better with the dedup, etc, and I suppose the switch to 64-bit and double-digit-gig memory for most has helped too, that it really hasn't come up as a burning issue since. And it was never stated, but I got the feeling that neither Charles Kerr nor Heinrich Mueller were particularly familiar/comfortable with database programming, and were significantly enough more comfortable with text based that they continued that, even thru Charles' C++ rewrite and Heinrich's major feature additions, which otherwise might have been reasonable times to introduce and stabilize such major changes. That would certainly go quite away toward explaining why such a thing was never implemented, at least in mainline pan.) --- [1] Pan's treatment of HTML as plain-text: Actually, a few years ago there was a discussion of GNKSA and whether pan should continue 100% compliance, or not, given how "dated" it is considered to be by many. The context was the GNKSA-mandated 4-connections-per-server limit (tho there's a workaround in [2] below), but it surprised me how many were in favor of holding the line at 100% compliance even including that, fearing dropping the 100% support in just one area would eventually lead to ignoring it in considered-to-be more important areas such as plain text vs. HTML, quote/reply order, etc. [2] The 4-connections-per-server limit: There's actually a non-GUI bypass for this, allowing those who want to set more connections to do so, while still satisfying GNKSA, since it says a complying client can't set more than that, but does /not/ say it can't honor a higher setting if a user bypasses the client's settings method to configure it manually. The setting is found in servers.xml and is easily text-editor modified... with pan closed, of course... and pan will honor a higher setting if found there upon startup. Just be careful not to save any GUI changes to the server config after that or it'll overwrite the setting back to the 4 connections allowed by the GUI, and you'll have to manually edit it back, once again. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman _______________________________________________ Pan-users mailing list Pan-users@nongnu.org https://lists.nongnu.org/mailman/listinfo/pan-users