[Pan-users] Re: Creating an local archive of subscribed groups?

Duncan Sun, 22 Aug 2010 23:51:58 -0700

Jurgen Defurne posted on Sun, 22 Aug 2010 18:10:14 +0200 as excerpted:

> I am a regular user of Pan for some high technical newsgroups.
> 
> What I would like is to have the contents of these groups as a local
> archive which can be searched using Pan.
> 
> I have already tried two ways to do this. The first one was using 'Cache
> Article' after selecting all articles, but it seems that when the cache
> gets beyond a certain size, older cached articles disappear.
> 
> I am now trying with 'Save Articles...', but this creates one file,
> which cannot be incrementally updated.
> 
> What other (simple, preferably) possibilities do there exist, not
> necessarily using Pan for storage, but certainly for reading and
> searching?


You're running into pan's default cache size limit, 10 MB.  That setting, 
like several others, *IS* available in the config files, but is not made 
available in the GUI, basically because while pan only requires gtk+, it's 
a gnome family app, and gnome in general caters to the "simple" users who 
are apparently afraid of too many config options, even when they'd be 
seriously useful for some users!  (FWIW, that's one reason that despite 
all the problems with kde4, I'm still a kde user -- kde's comparable 
policy is to create a generally sane default, but expose far more options 
in the configuration for those who wish to use them.  But knode doesn't 
handle binaries as well as pan does and klibido handles binaries but not 
text, and I'm not sure if it was ported to kde4, either, so pan it is.)

Anyway, desktop environment politics aside...

As you may know, pan's config and data are stored in ~/.pan2/ by default.  
In that directory (or whatever one you have pan's files stored in, if 
you've made use of the PAN_HOME environmental variable to point pan at a 
different location, find preferences.xml.  As usual, if you're going to 
edit config files, do so with the app you're editing the config for, pan 
in this case, closed.

In preferences.xml, the preferences are grouped by type, and then 
alphabetically by name.  Look for type int, name "cache-size-megs".
Make it whatever integer number of megs you like.

Here, I make use of the PAN_HOME environmental variable I mentioned to run 
multiple pan "instances", each pointed at a different data dir.  The way I 
have it setup, I have one for text groups, one for binaries, and a third 
for testing, but of course, you can split it up however you like.  I 
mention this by way of explaining how and why I have multiple 
preferences.xml files, each with a different cache size.

For my text groups instance, I have:

<int name='cache-size-megs' value='5120'/>

Since those groups are mostly text and I've set the expiration to none for 
the servers in that instance, I have posts going back years in some groups 
(to when the pan C++ rewrite was introduced with 0.90, as it changed file 
formats for a number of things, actually, back further than that on some 
gmane.org groups/lists, gmane of course being a list2news archive and 
gateway, presenting a whole bunch of mailing lists as newsgroups, with 
unexpiring posts), and the cache is still only ~2 gig, so I'm a long way 
from maxing it out.

The test instance is I think still at default.  I use a separate test 
instance so I can visit groups without subscribing, say if someone reports 
a problem post that I want to try, and not have pan storing information 
about groups I don't really care about and am not subscribed to, in my 
other instances.

The binaries instance has a cache on a dedicated 12 gig partition, so I've 
set its cache size to an arbitrary number, a bit above 12 gigs.

<int name='cache-size-megs' value='12500'/>

And while I've not actually done binaries in some time (it seems I've just 
too many other things I find interesting to do, and just never get to it), 
I have actually tested that 12 gig a few times, some years ago.  Pan 
handles it fine, or at least did, back then.

So provided you set unexpiring for your server(s), you shouldn't have a 
problem setting a cache size into the double-digit gigs if necessary, or 
maintaining an archive going back as far as you can get messages, without 
them expiring locally, just because they expire on whatever server you're 
using.

The one caveat I have noticed is that the more data you keep around, the 
longer pan takes to load up, especially from cold disk cache.  My way 
around that has been to assign pan its own dedicated desktop (kwin allows 
you to configure specific apps to always appear on a specific desktop, and 
that's what I do with pan), and to start it when I start X/KDE, keeping it 
running pretty much all the time I'm in X, so it only shuts down when I 
shut down X/KDE.  If you like, you can put pan on its own partition, and 
periodically back it up, then wipe the partition and copy everything back, 
thus defragging it, speeding up initial load.

Also, I run a 4-disk kernel/md RAID-1 now, but previously ran a RAID-6, 
which with four spindles, is effectively two-way striped for read access.  
To my surprise, reading multiple files as is the case when pan is loading, 
the kernel is good enough at scheduling parallel I/O on the RAID-1 that it 
NOTICEABLY shrank my load time when I switched to that, as compared to the 
RAID-6.  I had thought that the RAID-6 would be faster due to the 
effective two-way-striping for read access, but I was wrong, the kernel's 
good enough at scheduling on the RAID-1 that it apparently keeps all four 
disks reading data in parallel, so pan loads faster from mirrored RAID 
than from striped RAID.

....

That's one option, all-pan.  The other option would be to run a personal 
news-server installation, like leaf-node.  Leaf-node would download the 
messages to your local disk and store them there, then serve them locally 
to pan.  Doing it this way, you could leave pan's cache size untouched (or 
maybe even shrink it), and point it at your local server instead of the 
remote.  You'd still set pan not to expire articles, so it'd keep its 
article index intact, but it wouldn't need a big cache, since it's pulling 
from the local leaf-node (or whatever) server anyway.  You'd then set 
leafnode to unexpiring as well, so it continued to retain articles back as 
far as you could get.

One advantage to this, if you're doing enough binaries that you're waiting 
on pan to download, anyway, is that the local server would presumably be 
running all the time in the background, downloading messages as they came 
in, so they'd always be available virtually instantly from pan, since 
they're already stored locally.  No waiting on the network connection to 
the server.

But that's probably not that significant an issue unless you're still on 
an analog modem dialup connection, because anything much faster than that, 
and if you're downloading enough data that you're waiting on the network, 
you'll quickly be looking for more room for your archive -- which will 
soon measure in terabytes, not gigabytes.

However, there's another possible advantage, as well.  Pan's loadup should 
be faster if it's only caching the default 10 MB.

.....

Meanwhile, pan does have one serious limitation, in terms of search (and 
of scoring).  It only scores and searches the message overviews -- 
basically, the information in the header pane, author, subject, etc (tho 
message-ids are also in the overviews and form the basis for the watch/
kill/score thread feature).  Pan is unable to score or search on actual 
message content.   If you're happy with pan's searching already, and just 
need a larger cache to search on, that's fine.  But do be aware that if 
you do want/need to search on message content, you'll need to use 
something else.

Of course, with kde's nepomuk/strigi indexing (what I'm familiar with 
since I run kde), or beagle (AFAIK the gnome indexer), or google-desktop 
indexing, or whatever, you can point that at either the pan cache (for 
option one above) or leaf-node's cache (for option two), and get full 
content indexing, if that's what you want.  You can then open the file 
using whatever editor you have associated, find the subject and date info, 
and use pan to view the whole thread in context, if desired.

So pan can still be used to view the thread, once you find a post that 
interests you based on content.  It's just that if you do want full post 
content search, not just subject/author search, you'll need to use 
something else for the initial search, and can then open the thread in pan 
if you like.  If that's a limitation you can live with, great.  Otherwise, 
you should probably look for a different news client, one with full post 
search capability.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


_______________________________________________
Pan-users mailing list
Pan-users@nongnu.org
http://lists.nongnu.org/mailman/listinfo/pan-users

[Pan-users] Re: Creating an local archive of subscribed groups?

Reply via email to