Re: [Pan-users] GNU javamail and article number

Duncan Mon, 18 Feb 2013 20:25:12 -0800

Thufir Hawat posted on Mon, 18 Feb 2013 01:38:03 +0000 as excerpted:

> [Developer question: NNTP app in Java]


FWIW, your message is a bit like some of mine, more a stream of conscious 
than well edited and organized.  It does make a response a bit harder, 
without quoting the whole long post, anyway, as finding the actual bit to 
reply to and getting the proper context is difficult.  I guess I'm seeing 
a bit how hard it must be to reply to some of my posts.

Anyway, I've slightly edited and rearranged order, etc...

> I'm looking at the source for:
> http://developer.classpath.org/inet/doc/gnu/inet/nntp/GroupResponse-
source.html

>   62:   /*
>   63:    * The last article number in the group.
>   64:    */
>   65:   public int last;
> 
> which looks like last should be the number for the last article for a
> group.  Now, when checking for new articles, what is that number
> compared to?
>
> http://cvs.savannah.gnu.org/viewvc/*checkout*/mail/source/gnu/mail/
providers/nntp/NNTPFolder.java?root=classpathx&content-type=text%2Fplain

>               GroupResponse response = ns.connection.group(name);
>               if (response.last > last)
>                 {
>                   hasNew = true;
>                 }
> 
> I'm just try to figure out how, when connecting to a new server, do you
> know what was the article number for the latest article?  Is that kept,
> generally, in the .newsrc perhaps?

FWIW, the newsrc tracks read messages, not already seen (but possibly 
unread) messages.  The (multi-app-standard) newsrc file assumes only a 
single server, so multiple newsrc files must be used when there's more 
than a single server.  It's the newsgroups.xov file that tracks already 
seen messages -- the server highwater marks.  AFAIK, unlike the newsrc 
file format, newsgroups.xov isn't common to other news clients, and it 
contains entries for multiple servers.  (A comment in the file lists the 
specific format.)

> My concern is that the "id" isn't reliable:
> 
> http://docs.oracle.com/javaee/6/api/javax/mail/
Message.html#getMessageNumber%28%29

> "Note that the message number for a particular Message can change during
> a session if other messages in the Folder are deleted and expunged."
> 
> Because, when javax mail (which is utilized in this context) loads a
> folder, it simply *counts* the number of messages in a given folder.
> 
> How does pan handle this?  For simplicity, let's assume just one server
> is being accessed.  Pan keeps the latest article number in a .newsrc
> file and then iterates up?

No... more later.

> What I'm after is not just a method, as above, to check for new articles
> but to return a range of articles which are new -- something along those
> lines.
> 
> Or, maybe, another approach is to just keep the latest article number
> increment it, and request the article until errors are caught.  However,
> that assumes there are no gaps in the article numbers.  And, still, the
> article number *must* be stored somewhere.
> 
> It all starts with *getting* the article number.  Apparently
> NNTPFolder.java is using GroupResponse to handle the article number, so
> I should be also using GroupResponse and see what article numbers it
> gets?

I don't claim to be a coder, tho I can sort of follow along on many 
coding discussions and do occasional limited patching, etc.  And java... 
wouldn't exactly be my choice were I to try to become a coder.  After I 
spent a couple hours last nite trying to make sense of the docs at the 
various links you provided, I /think/ I have some sense of it, but I'm 
more sure than ever that Java isn't my choice of coder's beverage, by a 
LONG shot!

Anyway, it seems there's three sets of... article IDs... we're looking 
at, two from the RFCs, and a third from the Java classes you're working 
with.  The classes ID is very similar in idea and function to one of the 
two RFCs IDs, and *MAY* be identical to it in the FolderNNTP subclassing 
of the general Folder class, but I'm not sure -- I couldn't find anything 
that actually /said/ that, one way or the other.

But the similarity without knowing if they're identical makes things 
extremely confusing, because reading the docs I had to keep reminding 
myself that the article numbering the were talking about was the local-
client classes numbering, not the one from the server (RFC standard)... 
unless they're identical in the case of FolderNNTP, which I never did 
figure out.

> However,
> the GNU javamail NNTP API seems to have no provision for directly seeing
> those article numbers:
> 
> http://www.gnu.org/software/classpathx/javamail/javadoc/gnu/mail/
> providers/nntp/NNTPFolder.html
> 
> There' just no method listed for dealing with article numbers, they're
> encapsulated, which I guess is good.  But they're encapsulated so well I
> don't see how to get *new* articles without re-fetching everything.

(In the below I think I use NNTPFolder and FolderNNTP interchangeably, 
forgetting which one NNTPFolder, was actually used.  So if you see a 
reference to FolderNNTP that I missed changing, read it as NNTPFolder.)

There's references to article numbering in some of the methods.  But as I 
said, it's ANYTHING but clear whether the article numbering they refer to 
is identical to the RFCs one the server's using, or if it's a local-only 
classes version that works similarly, but is independent from the RFCs 
article numbers the server is passing.

Were I working on a project using those classes, I'd now be hacking up 
some experimental code to actually SEE the article numbers the classes 
are using, and compare them to what I was seeing actually being passed 
from the server, using ngrep or similar connection sniffing.  That'd 
answer once and for all whether they were identical, or not.

> As suggested here:
> 
> I've been reading RFC's, but that doesn't help with determining what GNU
> javamail is actually doing, versus what it's supposed to do.  (I really
> don't like the Apache API at all -- but if there's a Java API someone
> knows works for this, that would be interesting.  The GNU API is very
> clean, just maybe **too** clean.)

I can't help but think about the various perl and python nntp handling 
modules I've read about...  I've never actually worked with them, but 
I'd /hope/ they're easier to work with for people familiar with the RFCs.  
And given that I believe there's actually several different nntp modules 
to choose from, I expect I'd be more comfortable with at least ONE of 
them, than these Java classes...

But be that as it may, you're working with what you're working with, so 
let's try to deal with it...


FWIW, the RFC in question would appear to be rfc3977.  The GROUP command 
you referred to is covered in section 6.1.1, but it's worth reading about 
the related LISTGROUP (6.1.2), LAST (6.1.3) and NEXT (6.1.4) commands in 
section 6.1, Group and Article Selection, as well.  That can be found 
here:

http://tools.ietf.org/html/rfc3977#section-6.1

The GROUP command and its response codes (response codes are covered in 
section 3.2) formats look like this (section 6.1.1.1):

Syntax
     GROUP group

   Responses
     211 number low high group     Group successfully selected
     411                           No such newsgroup

   Parameters
     group     Name of newsgroup
     number    Estimated number of articles in the group
     low       Reported low water mark
     high      Reported high water mark


As background, it's worth explicitly noting here the three article ID 
forms I mentioned earlier.

First, there's message-id, found as a header in the article, that's 
designed to be a GUID, globally unique ID.  Message-ID is covered in the 
generic Internet Message RFCs covering both mail and news.  Pan, BTW, 
uses the fact that message-ids are GUIDs in its message caching -- pan's 
message cache filenames are message-ids, with a bit of character 
substitution where necessary in ordered to sanely manage filesystem 
filename compatibility.  That works out pretty well with multi-server as 
well, since message-ids are supposed to be GUIDs and the same message 
will have the same message-id regardless of which server you fetch it 
from, so once the file is cached from one server, it's seen as already 
there and the other server fetch threads simply skip on to the next 
message.

The jave class methods do appear to accept message-id as a parameter in a 
number of cases, as do various RFC/NNTP commands.

Second, there's the RFC message numbers, per-server per-group sequential 
message numbering.  It is these numbers that the GROUP command reports 
for the low and high watermarks as listed above -- that's the first and 
last messages potentially available on the server at the time the 
response was issued.

These RFC-standard article numbers are what pan tracks in its newsrcs and 
newsgroups.xov, and are extremely commonly used in all sorts of news 
clients, because they're (nominally, see below) sequential and rather 
less free-form than message-ids tend to be, and thus comparatively easy 
to track and to work with.  The down side is that they're per-server, 
thus the need to reset them if a user changes news server, or if the news 
server itself gets rebuilt and didn't have backups allowing it to restart 
the numbering sequences where it left off.

Additionally, article numbers are /nominally/ sequential, but as rfc3977 
explicitly points out in a number of places, that does NOT mean that 
there's no gaps, or that there's a consistent persistence of articles by 
number during a particular nntp session.  In particular, common server 
implementations assign article numbers on an incoming message server 
before they've been locally processed, despammed, forwarded to the front-
ends the users (or rather their news clients) actually contact, etc.  
Despamming and the like thus results in article sequence number gaps, and 
additionally, there's no guarantee in terms of local server processing 
order, so it's very common to see say 255346 come in and boost the 
highwater mark from 255205, before numbers 255206 thru 255345 appear.  
These late to transfer articles then "backfill" the sequence, and any 
client which updated after the high number boost but before the backfill 
that is NOT prepared for backfills, will simply miss those posts entirely!

(FWIW, from what I've seen pan does middling well with this.  It either 
catches most of the backfills or the backfill case isn't as common as 
I've been lead to believe, but it can still be useful to manually "fetch 
all headers", as opposed to just new headers, occasionally, as doing so 
does seem to catch the occasional missed post.  They weren't late to 
server sequence numbering or they'd show up with a new headers fetch; 
they were article sequence number backfills that pan didn't catch on its 
own, that only show up with "fetch all headers".  But pan does WAY better 
at that than some other clients I've used, which would sometimes backfill 
more messages than had been fetched the first time!)

The NNTP LISTGROUP command is similar to GROUP, returning the same 211 
information, but in addition, it enumerates the articles actually 
available within the range, listing them one per line in an extended 
response after the initial 211 reply line.

The NNTP NEXT command can be used to iterate thru actually available 
posts, letting the server decide what the next one it has is, instead of 
the client having to guess.  Do however note the above caveat, that 
individual article numbers can appear and disappear over time within a 
session, so the NEXT ordering within a particular range as seen by two 
different clients that time their NEXT requests differently, isn't 
necessarily going to be consistent.  At minimum, I'd suggest an 
implementation using NEXT iterate repeatedly over a range, until no 
further articles appear.  The alternative of course, if the LISTGROUP 
command is available on a particular server, would be use that to check 
the range again after the first run thru, to see if any further articles 
have appeared.

The NNTP LAST command, counterintuitively, fetches the PREVIOUS (not the 
last) article in the newsgroup, quoting rfc3977: "that is, the highest 
existing article number less than the current article number".  Again, 
the dynamic actually available article status caveat applies.

That covers the two RFC article id types, "article number" and "message-
id".

Now we get to the NNTPFolder class article numbers.  As mentioned above, 
these appear to be very similar in idea and function to the rfc "article 
numbers", but it's not AT ALL clear to me whether they're actually 
identical, or whether the java classes do their own independent 
numbering, acting as if they're a server of their own, with their own 
numbers, instead of using the server numbers used in the rfc NNTP 
protocol.

There are some cross-session "stateless" nntp client implementations.  
One example is lynx, the text-based browser, which DOES do nntp, but 
apparently without any way to save state between sessions, so what's 
actually available on the server when you connect is what you get.  The 
whole idea of cross-session stateless/cacheless nntp seems rather strange 
to me, but when you think about it, it's the way people /normally/ use 
the web, so it sort of makes sense for a browser net-news 
implementation.  (I had no idea lynx did news at all until I read about 
it somewhere and had to give it a go.  Sure enough!  Could come in handy 
some day when X isn't working so I can't use pan, but I remember seeing 
the problem discussed in a recent newsgroup post, I just have to get to 
it, in ordered to see the steps I need to do to get back into X!  
Actually, since I have my pan text instance set to unexpiring and a multi-
gig/multi-year cache, I could probably grep it out of there as well, but 
firing up lynx and heading for the newsgroup would likely be easier if it 
was recent and I remember subject and/or author details well enough to 
find it quickly.)

Reading the NNTPFolder docs, it occurs to me that if the "article 
numbers" they refer to are NOT identical to the rfc's server-supplied 
"article-numbers, it may be that this javaclass implementation at least, 
is designed to be just that, cross-session stateless, you get what the 
server has available when you connect, no more, no memory of a previous 
session to save or worry about.

That would certainly simplify the implementation!  Unfortunately, it's 
not particularly useful, in the way nntp is traditionally used, at 
least.  For a particular "browsing" session, sure, but forget about 
saving state!

HOWEVER, it MAY be that the "article numbers" referred to are INDEED the 
rfc-version, as supplied by the server, in which case saving and 
recalling state DOES appear to be reasonable, since the current server 
state as seen by the GROUP commands, etc, is then by definition matchable 
against the previous session's state.

Some bits of the documentation hint at this, tho as I said, I never could 
find anything explicitly STATING it.

So for instance, in regard to your deleted/expunged concern, note that 
the delete/expunge methods don't apply to NNTPFolder as it's read-only.  
At first, the read-only bit appears to back the stateless single-session 
thing, but it DOES mean that you don't have to worry about THAT 
particular renumbering issue.

And the "open" method, while the method enumeration at the top says it 
doesn't apply to NNTPFolder, down in the description, it actually says 
something different, that the "open" method is used to issue the GROUP 
command and to update current state.  So that's how NNTPFolder exposes 
the GROUP command...

Various other bits I was able to infer, by looking at the methods 
inherited from the parent Folder class, altho their NNTPFolder subclass 
usage and implementation differences aren't explicitly documented.

GetMessages appears to be the method that exposes article numbers for use 
in other commands.  In the generic/mail folders case, there's the delete/
expunge and renumbering to worry about, but since delete/expunge doesn't 
apply to NNTPFolder...

What I was looking for was something explicit that says that in NNTPFolder 
subclass (as opposed to the Folder parent class), article numbers aren't 
simply counted, but instead, the server sequence numbers are used.  I 
couldn't find it, but given the read-only nature of NNTPFolders and thus 
the elimination of the delete/expunge, etc, issues, it would seem to be a 
logical subclass extension.

And as best I can see, if the server article numbers ARE used, you'd then 
have a chance to compare and update state in a new session against the 
old one, since you'd have a means to measure current server status 
against saved previous status, while if article numbers are local-only, 
as you, I don't see any way to measure current session server state 
against that of a previous session, so it may well be that this class 
implementation at least, is "session state only", much like that of lynx 
as I mentioned above, and what you see in that session is what you get, 
no saved state between sessions at all.

So as I said, were I working on the project, I'd be tooling up some 
experimental code right about now, to actually check out those article 
numbers, comparing them against the server assigned article numbers as 
seen in the actual net traffic sniffed with ngrep or the like.

Or, being your the coder that I'm not, since the code is available, you 
can actually take a look at the implementing class code to see what it 
does, instead of doing the reverse engineering and experimentation that 
I'd do.


I don't know how well that answers your questions.  Certainly not as well 
as someone with actual experience with this javaclass code could have.  
But even in the few hours I've spent looking at it, I THINK (famous last 
words) I understand a bit more about it than the "frustrated and at a 
loss" you seemed to be expressing in your message.  Thus, hopefully, it's 
at least /some/ help. =:^/

But really, if you're not wedded to java for some reason, do consider 
looking at a python implementation.  I believe you'll find a reasonable 
amount of existing code, with multiple nntp helper modules to choose 
from, and that you'll find at least one of them rather saner than the java 
classes seen here.  Maybe you're more comfortable with java, I don't 
know, but I'd almost certainly be more comfortable with python, even tho 
I don't claim to be a python coder either.  And of course the same goes 
for perl; multiple helper modules should be available, as well as 
implementing code using them that you can study.  Except I personally 
prefer python, and have thought for several years that if I eventually 
progress beyond bash, python is my logical next step.

(I actually did look into learning perl, but decided it was a bit /too/ 
flexible for me; python's enforced formatting due to its use of 
formatting for block indication, among other things when compared against 
perl, appeals to me.  Plus, there's more python available in my 
environment to study, not a small factor considering that my practical 
knowledge of bash scripting originated with my taking apart and putting 
back together the various initscripts in my first Linux installations, 
Mandrake 8.x, back then.  That's actually why I took a look at perl first 
as well, as the Mandrake package manager was perl-based.  Now of course 
I'm on gentoo, with its portage package manager (as well as a second 
gentoo PM implementation, pkgcore, there's a third as well, paludis, but 
it's C++ based) being python based.)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


_______________________________________________
Pan-users mailing list
Pan-users@nongnu.org
https://lists.nongnu.org/mailman/listinfo/pan-users

Re: [Pan-users] GNU javamail and article number

Reply via email to