Re: [Pan-users] size of newsrc-1 file

Duncan Tue, 05 Jul 2016 23:41:13 -0700

Heinz Mezera posted on Tue, 05 Jul 2016 12:47:21 +0200 as excerpted:

> Hello pan-users,
> 
> does the size of newsrc-1 influence pan's time to start, to quit or its
> performance?
> 
> I use Ubuntu's 16.04 version of pan (0.139-5build1) and it takes rather
> long until pan appears on Ubuntu's desktop.
> 
> Can I compact newsrc-1 or reduce its size somehow?


I suspect your problem isn't the newsrc file, but something else...
[discussed below, but first...]

To answer your question somewhat directly, however, the newsrc file(s, 
one per configured server) can indeed be compacted some, and that /might/ 
affect startup time, tho in my own experience there's a far worse trigger 
of startup delay that I suspect is the real problem.  However, the newrc 
files can be made more efficient.

These newsrc files follow a standard text-based format and can be edited 
using a standard text editor.  As always, making a backup of the 
unaltered file before you begin is recommended, just in case you screw up 
the edits.

Rather than describe in detail the format, I'll simply provide you a 
google link...

https://www.google.com/search?q=newsrc+file+format

There is however one caveat about pan's usage.  (Current) Pan doesn't use 
the subscription info in the newsrc (tho old C-based pan, 0.14.x, did, 
before the C++ rewrite), because a newsrc is inherently single-server, 
and pan's subscriptions apply across all configured servers that carry 
the group.  So pan uses a different method to track group subscriptions.

What pan /does/ track in the newsrcs, however, is the per-server per-
newsgroup article sequence numbers, so it knows which ones on each server 
you've already seen so it knows not to download those headers again.

It's this sequence of comma-separated article numbers that appears at the 
end of the newsrc line for any group you've visited (or seen a cross-
posted message in).

And you can consolidate these article numbers lists by removing the gaps 
and making the ranges continuous.

It's worth noting that news servers initially communicate what they 
currently have using only a high-water and a low-water mark, plus an /
estimated/ count of the number of messages available, with that estimate 
allowed to be /more/ than the number of currently available messages, but 
never /less/.  These are IOW the lowest numbered message still available 
(unexpired), and the highest numbered message available (the latest 
message to arrive), plus the estimate.  Missing article numbers between 
the high and low water marks are specifically allowed -- this lets 
servers remove messages reported as spam or as copyright violations, 
etc.  Sometimes these missing messages will be filled in later (some 
servers are infamous for doing this, infamous because it screws up some 
news clients).  Often they're not.

And it's these gaps in the server store, along with simply not visiting 
the newsgroup for longer than its expiration period if your server does 
expire messages (some dedicated news service providers effectively don't 
expire messages, these days), that appear as gaps in pan's sequence 
number lists -- because it never saw those messages.


Now, if you're reasonably sure your server doesn't fill in article 
sequence numbers, only ever increasing them, or if you simply don't care 
to see what are likely old messages if they are filled in, you can cut 
out all the commas and make the list a single range, from 1 or whatever 
the lowest number is in the existing list, to the highest number.  If the 
server does do fill-ins, you might still be able to make the oldest 
messages a continuous range, while leaving the gaps in anything newer 
than say a month old, just in case.

So, to take one example line from the linuxtopia google hit (the first 
hit in the google above, as a write this, note that this page is from a 
book copyrighted in 2003, and its mention of pan as an exception to the 
newsrc format is... dated, pan does use the format now):

news.software.readers! 1-95504,137265,137274,140059,140091,140117

You can edit that to:

news.software.readers! 1-140117

Much shorter! =:^)

Unfortunately, if you follow a lot of groups, all that manual editing 
could be a big chore (unless you can figure out a nice script to automate 
the process, should be possible), with, I suspect, rather limited results 
in terms of startup.


Instead, what I've found to take the real time, particularly on spinning 
rust drives (I'm on SSD now and haven't had to worry about it since I 
upgraded to SSD), is large message caches.

Note that pan's cache size is configurable, but defaults to 10 MB which 
shouldn't be an issue, but also will start dumping already downloaded 
articles to make room for more, particularly if you do binaries, rather 
quickly.  For a usage pattern that saves off attachments directly, with 
no further use for the messages in cache after that, 10 MB is fine.  For 
a usage pattern more like mine, however, where I tend to download a bunch 
of stuff to cache so it's local, and then go thru it later, a cache size 
of several GB may be more appropriate.  Similarly, if you have groups 
that you effectively archive, keeping all messages without expiring them 
at all, as I do with my text groups, a cache of several gigs will likely 
hold several years worth of text-group messages.  (I have text messages 
going back to 2002 in some groups.  My cache for my text-groups pan 
instance[1] is, as of now, 1.4 GiB, so the average usage is 100 MB/year.)

Once that cache gets to a few hundred MiB, you'll start noticing pan 
startup gets slower and slower on *first* startup, as the cache gets 
bigger and bigger.  (Pan will start up faster after the first start, 
since everything's already cached.  At least it will if you have enough 
memory to cache into RAM the full pan message cache.  If you're running 1 
GiB or less of RAM... probably not so much.)  This is because pan loads 
those messages every time it starts, in ordered to rethread them -- it 
keeps track of message threading in memory.

Back when I was on spinning rust, I found a few ways to deal with this.  

One was, set pan to start with my X user session, so it could grind away 
for several minutes loading stuff while I did other things.  A few 
minutes later when I had completed other tasks, pan would generally be up 
(in the system tray) and ready to go.  I'd normally keep pan running 
constantly, in the system tray, until I was ready to end the user X 
session.

Another I found quite by accident.  I periodically do backups of the 
multiple partitions on my system, and every few years, I'll boot to the 
backup, wipe away the normal working partition, and copy things back from 
the backup to the working copy, renewing it.

I found that at least with some filesystems (I was using reiserfs at the 
time), pan evidently fragments the cache files rather heavily.  I believe 
this is most likely to happen when multiple threads are downloading files 
at once, writing them in parallel and fragmenting them in the process.

By backing up the cache files, erasing the working cache copy, and 
copying everything back into place, the new copy was defragmented due to 
the copy process, and pan started up much faster after that, even tho it 
still had the same size cache.

Of course over time it slowed down again as I added new messages to my 
newsgroup archive, but now that I knew the trick, I could defrag the 
cache any time the start time got too long, and pan would startup faster 
again.

And of course as I mentioned, putting it on SSD sped things up 
dramatically, because ssds have zero seek time, so fragmentation doesn't 
affect them anything close to as badly (tho it can still have some effect 
due to IOPs per file increasing with the number of fragments).


That's what definitely took the load time for me, pan reading all those 
files from cache into memory, so it could rethread them.

There's a simple way to confirm whether this is your problem or not.  
With pan closed, simply rename the article-cache directory to something 
else, so pan will recreate a new, empty cache, when it starts.  If the 
cache is your slowdown, pan should start much faster, likely nearly 
instantly, with no cache to load.

Tho of course if you've never upped your cache size from the default 10 
MB, the cache is unlikely to be the problem, and you probably won't 
notice a difference with the above test.


Finally, I should mention that a big scorefile will slow pan down at 
startup.  There are ways to dramatically optimize the scorefile, but 
that's a different subject, that we can deal with later if you find it to 
be the problem.  Meanwhile, however, you can test it using the same 
technique I suggested above for testing the cache.  Simply rename the 
scorefile and see if pan starts faster with an empty one.  If the 
scorefile turns out to be your problem, post back with the results and we 
can deal with that, then.

---
[1] Text-groups pan instance:  It is possible to have several separately 
configured pan instances, each with their own configuration and cache.  
~/.pan2/ is only the default location.  If the $PAN_HOME variable is 
found to be set in pan's environment as it starts, it will use the 
location found in that variable as its configuration and cache home, 
instead.  I've taken advantage of this to setup a number of pan wrapper 
scripts here, pan.text, pan.test, and pan.bin, that each point at a 
different config and cache.  This lets me manage my unexpiring text-group-
archive cache separately from my binaries cache, also unexpiring and set 
rather large, but cleared manually from time to time.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


_______________________________________________
Pan-users mailing list
Pan-users@nongnu.org
https://lists.nongnu.org/mailman/listinfo/pan-users

Re: [Pan-users] size of newsrc-1 file

Reply via email to