Ron Johnson <ron.l.john...@cox.net> posted 4a4ec4e4.4010...@cox.net, excerpted below, on Fri, 03 Jul 2009 21:56:36 -0500:
> Also (and maybe because I'm a DBA), this problem just *screams* for > SQLite and a database in the "First Normal Form". [ OK, this is a very long post, I know (tho I haven't counted the lines, 200? 250? More? I'll let pan show me that when I post and download it). But reading it and following even a few of the included tips should vastly improve your pan experience. =:^) Following all of them... well, that's up to you, but it works well for me! ] Actually, before the C++ rewrite (the original was C coded) and the changes that allowed pan to scale to millions of headers/overviews per group from 100k, Charles' plan was, for quite some time, to eventually switch to just that, an sqlite backend. I don't know why he didn't, except that in the 3-ish years during which pan seemed to be abandoned that we later learned he used at least part of to do the rewrite, several others (K. Haley I believe being one of them) began to experiment with pan, and some of those folks were database folks (I'm not sure if K. Haley is one of /them/). By the time Charles announced the C++ rewrite (aka new-pan, what we use now), there had actually been some preliminary numbers posted to the pan-devel list, and I think that by using some of the data management techniques that Charles /did/ use in new-pan, he actually got it to "reasonably" scale (now, it /does/ work when you throw even several million headers at it, with memory use scaling accordingly, before, 100k headers was bad, and above 200k, pan would literally sit there for days, not really increasing memory usage too badly, but just not getting anywhere -- it simply didn't scale at all above 200k headers or so, memory or no memory), and the numbers probably looked reasonably close to the preliminary database numbers as well -- at least close enough that he judged it not worth the trouble, with the clear benefit of plain text files. But, meanwhile, for those dealing with those huge groups, there's some usage patterns that work rather better than others, and thus some usage patterns that users should avoid in the large groups, if they want a reasonably working pan. # 1 most important, particularly since pan is a GNOME family app and as many Ubuntu users can attest, PAN AND THE GNOME ASSISTIVE TECHNOLOGIES APPLET DO NOT GET ALONG WELL AT ALL!!! When that applet is running, it apparently polls /something/ often enough to keep pan from making efficient progress at header sorting, in particular. What might otherwise take 30 seconds or maybe two minutes (still long enough), ends up taking half an hour... two hours... more... So if you're running that, do yourself a favor and at LEAST shut it off when running pan. Either that, or switch to something other than pan, as the two simply don't get along. For more details, see the list archives. With that out of the way... The key to working /with/ pan on large groups, not fighting it, thus making the problem worse, is to understand what takes it the most time, and do what you can, including changing your behavior where necessary, to minimize that issue. (Yeah, I know, but the alternatives are to simply wait pan out, as at least it /does/ continue making progress now, unlike with the old code, or to switch to something other than pan. If you'd rather do the latter than change your behavior while using pan, well, there /are/ other solutions out there, tho none quite like pan.) So here's a bit of a peak behind the curtain, explaining in plain English a bit of what pan's actually doing... What takes pan the longest time (and uses the most memory too, I believe) isn't the actual downloading of either messages or headers/overviews, but sorting those overviews/headers, plugging new ones in at the correct location in the thread or multipart message as necessary, doing the subject and author string manipulations that help it keep a reasonable handle on memory, etc. It saves that threaded list on exit (of the group or pan itself), so it doesn't have to rethread existing overviews when it comes back to them, but when it starts up, once that list gets above a certain size, it still does enough disk churning verifying the list, and checking what's in cache so the little cached icon displays correctly, that it gets painful on a cold cache. (Once the data's all in cache, unless it's flushed, pan starts up quite fast.) Here's how I know the effect of that. I take advantage of the fact that pan checks the PAN_HOME environmental variable when it starts to see where its config is (defaulting to ~/.pan2 if the variable isn't populated, I'm not sure what it does if it's populated, but the contents aren't a sane path) to run several separate pan instances, each with its own config. On my text group instance, I set no-expire on the overviews/ headers, and expanded the cache (the setting for that isn't in the GUI, so it's a direct config file edit, preferences.xml) from the default 10 MB to a couple gig, so I could save a decent history. I have posts going back a couple years in several groups, and on some of the gmane list2news list archive groups, I have the entire group history as it appears at gmane. Thus, I have quite a number of overviews/headers archived, but (for my text instance) they're all text groups, so it's only... half a gig or so of actual message cache. Loading that text instance of pan, cold cache, takes probably a couple or three minutes of disk thrashing -- and that's on a 4-spindle RAID-6, so it goes MUCH faster than it would on a typical single-spindle pan storage dir machine. Of course, as I continue to accumulate message overviews and history, that load times continues to increase. =:^( But once pan is loaded and thus the cache hot, I can quit pan and restart close enough to instantly that I don't notice the delay. As a result, and here's tip #2, I load pan (the text instance) with my KDE session and keep it running more or less constantly, as long as I'm in X. I have 8 gigs RAM, so it's no big deal there, and if I do something that flushes cache, with pan running, I don't lose all of it, at least, so while it might take a few extra seconds to start up, it's not like it is from a cold cache. So tip #2 is, if your header/overview and cache is large enough that the pan start time is getting uncomfortably long, consider starting it with your desktop session, letting it load while you do other things. Then it'll be loaded when you get to it. Even if you then quit pan, as long as it hasn't been quit for too long and the cache flushed, it'll restart far faster, since most of that data will still be in cache. But it's generally far more effective to keep pan running while doing anything disk cache intensive, than it is quit pan, and restart it afterward. This is because pan doesn't take so much memory once all that data is loaded -- it's the loading from disk that's a pain. It should be noted that a good portion of this time, however, would be avoidable, if I (1) hadn't fiddled with the default 10 MB cache, and (2) had the overview/header expiry set to something more "reasonable". That's tip #3, then. There's a tradeoff between saved headers/overviews (and to a lessor degree message cache, but even with a default 10MB message cache, loading the headers from cold-disk-cache takes time) and from-disk load time. For binaries especially, once you've processed them, you don't tend to need the headers any more, so I STRONGLY recommend a reasonably short expiry, and for even more effective control of the problem, DELETING MESSAGES (not simply letting them mark-as-read and expire naturally) AS YOU ARE DONE WITH THEM. Of course, as I said, that's really more workable with binary groups than with text, as often, you want to keep text around for awhile. But you can still set the expiry as short as you can reasonably manage for text groups, which should be all it'll affect on general purpose text/binary instances if you use the delete binaries immediately when done rule, because they should already be deleted by the time the expiry comes round. This #3 is in fact probably the most critical (other than #1) for active binary users, especially on servers such as Giganews, with such high retention. If you start actively deleting headers/overviews for binaries when you are done, and set expiry (which will now affect text-only, since you've deleted the binaries) as short as possible, say two weeks, you WILL notice a difference! Here, we're talking startup time, but as we'll see, it affects overview/header update time as well. OK, time to explain a bit more about pan's processing. Once it has an existing list of threaded messages, when it updates headers/overviews, it takes a bit of time to plug the new ones into the appropriate place in the existing list. Obviously, the larger the existing list and the more new ones that came in in the update, the longer this sorting process is going to take. That's where tip #3 affects update as well. If your existing header/ overview list is shorter, because you manually deleted the ones you were done with, pan's processing time will be shorter as well. Thus, it does NOT pay to keep a list of already processed binary group headers/ overviews around between sessions (incompletes that you're waiting for completion of being an exception), as that just complicates pan's job, making it take longer to do that processing than it has to. Again, delete messages (headers/overviews) in the binary groups as soon as you're done saving off the binaries and otherwise processing them. It makes a HUGE difference! Tip #4. For high volume binary groups, or on high retention servers, for ALL binary groups, when you first browse them, DO NOT DOWNLOAD ALL HEADERS/OVERVIEWS AT ONCE. Unfortunately, pan has a get the latest N days/number-of-headers option, but not a get the oldest N option. Thus, if you're wanting to go back quite some time, get the N latest, process what you can (thus in accord with tip #3, deleting the ones you're finished with), then get the next N latest, process them (again deleting what you're finished with), until you've gone back as far as you wish or hit the retention limit. Like #3, the reason here is simple. Keep the number of overviews pan has to deal with at one time to something reasonable. Tip #5. The implications of #3 and 4 should be clear enough. Don't let unseen messages in a binary group build up unnecessarily between visits. Just because the server you use has the retention to let you visit a busy binary group every couple of weeks, doesn't mean you're going to be making it easy on pan -- and thus on yourself waiting for pan -- if you wait two weeks between visits. Every day is nice, tho of course there will be days you'll be doing other stuff and don't get to it. But for the busy groups, do at least try to get to them twice a week or so, and if they're indeed that busy, expect a bit of extra trouble if you're waiting even that long. It follows then, that if it has been awhile since you visited a group, and you know it's a busy group, you may find the incremental approach of tip #4 useful to avoid having pan taking such huge bites at once. Tip #6. This one isn't directly related to the above or to this problem, but it's generally useful and helps with this problem. It's simple enough. Remember that changing groups triggers pan's save group state functionality, as does quiting and restarting pan, but that takes longer and is more hassle. Thus, when processing large groups either text or binary, it can be wise to periodically switch to a different group and back, just so pan saves the state of where you were, and if pan or the system crashes for some reason, you'll only lose track of the read and deleted messages back to the last time you switched groups. When you're processing thousands of overviews, having pan or the system crash and lose state on a couple thousand overviews worth of work isn't fun, so avoid it, by switching out of and back into the group every 200-500 overviews worth (numbers that seem to work well for me). As you'll note, I mentioned that pan loses delete state. When you delete a message, it deletes the message itself in cache immediately, but again, doesn't update the group state until you switch groups. If you crash before that, the overviews/headers will show up again (but without the cached messages) as undeleted and probably unread (unless you'd read them, switched out of the group and back, then deleted them, in which case they'll show up as read since pan had that state stored when you switched out and back in). Tip #7 follows both from #6, and as a consequence of 3-5. Turn off pan's get-new-headers-when options (under preferences, behavior tab, groups). In particular, you don't want it auto-fetching new headers/overviews when you switched out and back into a group just so pan would update its disk- saved state. However, I've also found that it works better if you let pan startup, then switch to a group and manually get new headers, then when pan's finished with that, switch to the next group... etc. Again, don't give pan too many things to do at once and it works better. (Fortunately for those of us using them, it does seem to cope reasonably well with multiple servers, since it keeps only one common threaded list, not one per server.) Tip #8. Again, this is a general pan tip. Don't use the mark-entire- group-read functionality, either in preferences (when leaving group, when exiting pan) or manually. Due to the way modern servers work (new posts can come in numbered below the group's sequential high-water-mark), this is broken on many of them and you'll miss posts as a result. It seems to have other somewhat unpredictable but generally undesired effects as well. Just don't use it, and avoid them. Instead, when you are done with a group, you can select-all (headers), and use the mark read, or delete (tip #3 again) on them. One caveat with this has to do with ignored and otherwise view-filtered posts. Since they're not displayed, select-all won't select them, and they won't be marked as read or deleted. For that reason, I keep all the match scores options enabled in the view, header pane submenu, and depend on the color- coding in the scores column to alert me to score, including ignored. For groups with many ignored messages, however, it may be easier to either leave the match ignored off until the end, or to sort by score (unthreading if necessary) and deal with them first. Tip #9. This one helps to counteract the negative effects of tips #4 and 5. You can use pan's command-line options to tell it to fetch headers and quit. pan headers:group.name (as revealed in the help text, pan --help, I don't actually use this one myself, you may need --no-gui too). You can then create a script that fetches all the headers from all your groups for you, and use cron or other scheduler to run it periodically, say every hour or two, or just once, say an hour before you get home from work. While that'll accumulate headers to some extent negating the previous tips, it'll be automated and you won't have to wait for pan to do that sorting, as it'll already be done when you get there. Unfortunately, pan does not yet have a similar command line (or preferences) option to let you auto-download the messages themselves. There's discussion of adding the feature, based on the score category (so you could download only watched messages, for instance), and Charles was the one who actually mentioned that, so he's definitely thinking about it, but it hasn't been implemented yet. OK, that's the main tips, tho some more optional usage-style ones follow. As I said, #1 is most important for those running GNOME, as pan's hardly workable if that assistive technologies thing is running. #3 is most critical after that, with #6 and 7 being low-cost-bit-effect tips. Follow them all, and I'm quite sure you'll see a marked improvement, especially if you were doing all of them differently, before. Now for the optional, usage style related ones. Tip #I. As mentioned, it's possible to setup multiple independent pan "instances", with separate settings, cache, everything. What I did here is create a few simple pan starter scripts (bash), calling them pan.bin, pan.text, and pan.test, the first two obviously for text and binary instances, the latter to use when I'm "just browsing", since pan doesn't fully erase group history when you delete messages and unsubscribe, and I can manually blow it away much easier when I don't have to worry about blowing away regular group history at the same time. Each session script can simply set and export the PAN_HOME environmental variable pointing to its separate data (and config) dir, before starting pan. (Here, I do a couple other things as well, like set the gtkrc locations using a different config var, since I use kde and that's not always set correctly, and I HATE the default color theme pan comes up with if it doesn't get those settings.) For shared settings file, such as my scorefile and the accels.txt keyboard shortcut config, symlinks work wonderfully, and I only have the one common config file to worry about for all three. Otherwise, the separate instances use the files in their respective data dirs. As pertaining to pan efficiency, Tip #I is useful because it allows me to keep separate text and binary settings, using those most efficient (or that I simply prefer) for each. As you'll note in the tips below, that does help. Tip #II. Again as mentioned, it's possible to change the default 10 MB message cache size. I already mentioned that I keep the text instance message cache at several gigs, and set no-expire for the servers, altho that would interfere with efficient binary processing. Below I'll explain the way I handle binaries. The setting is in the preferences.xml file, in the data dir, as set above or ~/.pan2/ by default. The setting is (the 5120 value being for my text instance): <int name='cache-size-megs' value='5120'/> Tip #III. In combination with the multiple instances and custom cache size of tips #I and II, how I actually deal with binaries is a bit different than outlined in tips 3-5. I set a very large cache, actually a dedicated binary message cache partition, 12 gigs, with the cache-size option set accordingly, and do the following: Instead of using the normal download function which automatically saves the files and then deletes the cached-messages (but not the headers/ overviews, those are marked read), I prefer the download to cache function. What I'll do is download the headers/overviews, sort thru them, deleting what I know I don't want, sometimes downloading a sample here and there of stuff I'm not sure about, thus allowing me to delete the entire series without actually downloading it, if I don't like it, then select-all (or do it with a reasonable size group of overviews at a time, if I want to split it into several jobs so I can start working on the first one before the others are done) and download to cache. Then I go do something else, maybe go to sleep or to work, or play a computer game, or catch up with my text instance. Whatever. Anyway, I come back to the binary pan instance later, after it has grabbed everything and stashed in in cache. (Obviously, this won't work with the 10 MB default cache size as after it hits that 10 MB, it'll be deleting them as fast as it downloads them! So this only works in combination with tip #II.) Then, everything's already downloaded and local, so working with it is pretty fast! I then go thru and do my sorting, saving what I want, deleting the messages and headers/overviews as I'm done with them. This works far better for me than the download-and-auto-save, because using the download-and-auto-save functionality, everything has to be saved to an intermediate directory, losing the post context in the process. When I then go to the intermediate directory, I have the filenames, but that's it, no who posted it, no date posted, no additional information that might have been in the subject line, etc. It gets all mixed up, and besides that, it's awful easy to just keep downloading to the intermediate dir, without actually going thru and doing the final processing, with the intermediate dir thus growing and growing, until one gives up and moves everything off to an unsorted dir somewhere, and starts over. Pretty soon one has unsorted1, unsorted2... But by downloading to cache, then working with everything already local, I can select series and save them all to their final location directly. As I do so, since I'm working from pan itself, I still have all that extra message metadata, who posted it, when, what they said in the subject, etc, if I want to use any of that information in deciding where I'm going to save to, or if I want to create a text file there with additional information. All that would be lost if I used the auto-save functionality and was trying to sort out the jumble of files that ultimately ends up there. Typically what I'll do is setup the downloads for all my usual binary groups, then do whatever. The message cache thus must be big enough to contain all the downloaded messages from all groups. When I come back, I can start working thru them, deleting the messages and their headers as I'm done with them. When I'm totally done, I shut down pan (well, the binary instance), and manually delete the message cache itself. Then the next time I start the binary pan instance, it's starting with an entirely clean cache. Because I've deleted the headers/overviews as I went as well, pan doesn't have but the few I left as incomplete still around to try to properly thread new messages into, when I restart. All it has is the individual article numbers that it has already seen (and that I deleted) as tracked in the newsrc files for each server. With a clean cache, and no or only the partially complete headers/ overviews to worry about, even with a million or two headers coming down in an update, pan performance stays MUCH faster than it would be if it were trying to plug that million or two headers into an existing thread structure of 10 or 20 million headers! It still takes a bit of time, but given the number of headers, that's entirely reasonable. It's also worth noting that doing it this way, pan's not trying to download the messages, and decode and save the binaries, both at the same time. It downloads them to cache only, then later, I come back and do the decode and saving bit. This makes both steps individually faster, since neither pan nor the slow disk is having to try to deal with both at the same time. So there's actually a number of benefits to doing it this way. As mentioned, I still have access to the post metadata when I'm trying to sort the binaries into their final location. That's pretty nice on its own. But it also means pan is far more efficient at processing things, since it doesn't have a huge buildup of cruft. Third, I can set it up and let pan do its downloading while I do something else, and when I do come back and deal with it, it's all local, thus much faster to access. Finally, when pan's downloading, that's all it's doing, it's not trying to decode and save at the same time. And when it's decoding and saving, it's not downloading at the same time. Well, except for those few samples I download individually, before I set it to work on the big batch download. The negative is that the encoded messages take up more room than just the binaries do. With yEnc, it's only 5% or so, so that's not too bad, but UUE and MIME/Base64 are both 33% overhead, so you need a bit over four gigs of cache to store only 3 gigs of actual binary files. But disk space is cheap these days... Still, while that's the way that works best for me, it's obviously not everyone's style, or pan would default to downloading to cache, instead of the download and save default it currently has. But that's why I listed these three tips separately and marked them as distinctly optional. It does work well, but it's not for everybody. Meanwhile, if people just use tips 1-9, or even just 1 and 3 mainly, it'll likely improve their experience dramatically, even if they don't choose to do the whole separate pan instances, huge cache, download-to-cache, then go thru and save, thing. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman _______________________________________________ Pan-users mailing list Pan-users@nongnu.org http://lists.nongnu.org/mailman/listinfo/pan-users