Thufir Hawat posted on Mon, 18 Feb 2013 01:38:03 +0000 as excerpted: > [Developer question: NNTP app in Java]
FWIW, your message is a bit like some of mine, more a stream of conscious than well edited and organized. It does make a response a bit harder, without quoting the whole long post, anyway, as finding the actual bit to reply to and getting the proper context is difficult. I guess I'm seeing a bit how hard it must be to reply to some of my posts. Anyway, I've slightly edited and rearranged order, etc... > I'm looking at the source for: > http://developer.classpath.org/inet/doc/gnu/inet/nntp/GroupResponse- source.html > 62: /* > 63: * The last article number in the group. > 64: */ > 65: public int last; > > which looks like last should be the number for the last article for a > group. Now, when checking for new articles, what is that number > compared to? > > http://cvs.savannah.gnu.org/viewvc/*checkout*/mail/source/gnu/mail/ providers/nntp/NNTPFolder.java?root=classpathx&content-type=text%2Fplain > GroupResponse response = ns.connection.group(name); > if (response.last > last) > { > hasNew = true; > } > > I'm just try to figure out how, when connecting to a new server, do you > know what was the article number for the latest article? Is that kept, > generally, in the .newsrc perhaps? FWIW, the newsrc tracks read messages, not already seen (but possibly unread) messages. The (multi-app-standard) newsrc file assumes only a single server, so multiple newsrc files must be used when there's more than a single server. It's the newsgroups.xov file that tracks already seen messages -- the server highwater marks. AFAIK, unlike the newsrc file format, newsgroups.xov isn't common to other news clients, and it contains entries for multiple servers. (A comment in the file lists the specific format.) > My concern is that the "id" isn't reliable: > > http://docs.oracle.com/javaee/6/api/javax/mail/ Message.html#getMessageNumber%28%29 > "Note that the message number for a particular Message can change during > a session if other messages in the Folder are deleted and expunged." > > Because, when javax mail (which is utilized in this context) loads a > folder, it simply *counts* the number of messages in a given folder. > > How does pan handle this? For simplicity, let's assume just one server > is being accessed. Pan keeps the latest article number in a .newsrc > file and then iterates up? No... more later. > What I'm after is not just a method, as above, to check for new articles > but to return a range of articles which are new -- something along those > lines. > > Or, maybe, another approach is to just keep the latest article number > increment it, and request the article until errors are caught. However, > that assumes there are no gaps in the article numbers. And, still, the > article number *must* be stored somewhere. > > It all starts with *getting* the article number. Apparently > NNTPFolder.java is using GroupResponse to handle the article number, so > I should be also using GroupResponse and see what article numbers it > gets? I don't claim to be a coder, tho I can sort of follow along on many coding discussions and do occasional limited patching, etc. And java... wouldn't exactly be my choice were I to try to become a coder. After I spent a couple hours last nite trying to make sense of the docs at the various links you provided, I /think/ I have some sense of it, but I'm more sure than ever that Java isn't my choice of coder's beverage, by a LONG shot! Anyway, it seems there's three sets of... article IDs... we're looking at, two from the RFCs, and a third from the Java classes you're working with. The classes ID is very similar in idea and function to one of the two RFCs IDs, and *MAY* be identical to it in the FolderNNTP subclassing of the general Folder class, but I'm not sure -- I couldn't find anything that actually /said/ that, one way or the other. But the similarity without knowing if they're identical makes things extremely confusing, because reading the docs I had to keep reminding myself that the article numbering the were talking about was the local- client classes numbering, not the one from the server (RFC standard)... unless they're identical in the case of FolderNNTP, which I never did figure out. > However, > the GNU javamail NNTP API seems to have no provision for directly seeing > those article numbers: > > http://www.gnu.org/software/classpathx/javamail/javadoc/gnu/mail/ > providers/nntp/NNTPFolder.html > > There' just no method listed for dealing with article numbers, they're > encapsulated, which I guess is good. But they're encapsulated so well I > don't see how to get *new* articles without re-fetching everything. (In the below I think I use NNTPFolder and FolderNNTP interchangeably, forgetting which one NNTPFolder, was actually used. So if you see a reference to FolderNNTP that I missed changing, read it as NNTPFolder.) There's references to article numbering in some of the methods. But as I said, it's ANYTHING but clear whether the article numbering they refer to is identical to the RFCs one the server's using, or if it's a local-only classes version that works similarly, but is independent from the RFCs article numbers the server is passing. Were I working on a project using those classes, I'd now be hacking up some experimental code to actually SEE the article numbers the classes are using, and compare them to what I was seeing actually being passed from the server, using ngrep or similar connection sniffing. That'd answer once and for all whether they were identical, or not. > As suggested here: > > I've been reading RFC's, but that doesn't help with determining what GNU > javamail is actually doing, versus what it's supposed to do. (I really > don't like the Apache API at all -- but if there's a Java API someone > knows works for this, that would be interesting. The GNU API is very > clean, just maybe **too** clean.) I can't help but think about the various perl and python nntp handling modules I've read about... I've never actually worked with them, but I'd /hope/ they're easier to work with for people familiar with the RFCs. And given that I believe there's actually several different nntp modules to choose from, I expect I'd be more comfortable with at least ONE of them, than these Java classes... But be that as it may, you're working with what you're working with, so let's try to deal with it... FWIW, the RFC in question would appear to be rfc3977. The GROUP command you referred to is covered in section 6.1.1, but it's worth reading about the related LISTGROUP (6.1.2), LAST (6.1.3) and NEXT (6.1.4) commands in section 6.1, Group and Article Selection, as well. That can be found here: http://tools.ietf.org/html/rfc3977#section-6.1 The GROUP command and its response codes (response codes are covered in section 3.2) formats look like this (section 6.1.1.1): Syntax GROUP group Responses 211 number low high group Group successfully selected 411 No such newsgroup Parameters group Name of newsgroup number Estimated number of articles in the group low Reported low water mark high Reported high water mark As background, it's worth explicitly noting here the three article ID forms I mentioned earlier. First, there's message-id, found as a header in the article, that's designed to be a GUID, globally unique ID. Message-ID is covered in the generic Internet Message RFCs covering both mail and news. Pan, BTW, uses the fact that message-ids are GUIDs in its message caching -- pan's message cache filenames are message-ids, with a bit of character substitution where necessary in ordered to sanely manage filesystem filename compatibility. That works out pretty well with multi-server as well, since message-ids are supposed to be GUIDs and the same message will have the same message-id regardless of which server you fetch it from, so once the file is cached from one server, it's seen as already there and the other server fetch threads simply skip on to the next message. The jave class methods do appear to accept message-id as a parameter in a number of cases, as do various RFC/NNTP commands. Second, there's the RFC message numbers, per-server per-group sequential message numbering. It is these numbers that the GROUP command reports for the low and high watermarks as listed above -- that's the first and last messages potentially available on the server at the time the response was issued. These RFC-standard article numbers are what pan tracks in its newsrcs and newsgroups.xov, and are extremely commonly used in all sorts of news clients, because they're (nominally, see below) sequential and rather less free-form than message-ids tend to be, and thus comparatively easy to track and to work with. The down side is that they're per-server, thus the need to reset them if a user changes news server, or if the news server itself gets rebuilt and didn't have backups allowing it to restart the numbering sequences where it left off. Additionally, article numbers are /nominally/ sequential, but as rfc3977 explicitly points out in a number of places, that does NOT mean that there's no gaps, or that there's a consistent persistence of articles by number during a particular nntp session. In particular, common server implementations assign article numbers on an incoming message server before they've been locally processed, despammed, forwarded to the front- ends the users (or rather their news clients) actually contact, etc. Despamming and the like thus results in article sequence number gaps, and additionally, there's no guarantee in terms of local server processing order, so it's very common to see say 255346 come in and boost the highwater mark from 255205, before numbers 255206 thru 255345 appear. These late to transfer articles then "backfill" the sequence, and any client which updated after the high number boost but before the backfill that is NOT prepared for backfills, will simply miss those posts entirely! (FWIW, from what I've seen pan does middling well with this. It either catches most of the backfills or the backfill case isn't as common as I've been lead to believe, but it can still be useful to manually "fetch all headers", as opposed to just new headers, occasionally, as doing so does seem to catch the occasional missed post. They weren't late to server sequence numbering or they'd show up with a new headers fetch; they were article sequence number backfills that pan didn't catch on its own, that only show up with "fetch all headers". But pan does WAY better at that than some other clients I've used, which would sometimes backfill more messages than had been fetched the first time!) The NNTP LISTGROUP command is similar to GROUP, returning the same 211 information, but in addition, it enumerates the articles actually available within the range, listing them one per line in an extended response after the initial 211 reply line. The NNTP NEXT command can be used to iterate thru actually available posts, letting the server decide what the next one it has is, instead of the client having to guess. Do however note the above caveat, that individual article numbers can appear and disappear over time within a session, so the NEXT ordering within a particular range as seen by two different clients that time their NEXT requests differently, isn't necessarily going to be consistent. At minimum, I'd suggest an implementation using NEXT iterate repeatedly over a range, until no further articles appear. The alternative of course, if the LISTGROUP command is available on a particular server, would be use that to check the range again after the first run thru, to see if any further articles have appeared. The NNTP LAST command, counterintuitively, fetches the PREVIOUS (not the last) article in the newsgroup, quoting rfc3977: "that is, the highest existing article number less than the current article number". Again, the dynamic actually available article status caveat applies. That covers the two RFC article id types, "article number" and "message- id". Now we get to the NNTPFolder class article numbers. As mentioned above, these appear to be very similar in idea and function to the rfc "article numbers", but it's not AT ALL clear to me whether they're actually identical, or whether the java classes do their own independent numbering, acting as if they're a server of their own, with their own numbers, instead of using the server numbers used in the rfc NNTP protocol. There are some cross-session "stateless" nntp client implementations. One example is lynx, the text-based browser, which DOES do nntp, but apparently without any way to save state between sessions, so what's actually available on the server when you connect is what you get. The whole idea of cross-session stateless/cacheless nntp seems rather strange to me, but when you think about it, it's the way people /normally/ use the web, so it sort of makes sense for a browser net-news implementation. (I had no idea lynx did news at all until I read about it somewhere and had to give it a go. Sure enough! Could come in handy some day when X isn't working so I can't use pan, but I remember seeing the problem discussed in a recent newsgroup post, I just have to get to it, in ordered to see the steps I need to do to get back into X! Actually, since I have my pan text instance set to unexpiring and a multi- gig/multi-year cache, I could probably grep it out of there as well, but firing up lynx and heading for the newsgroup would likely be easier if it was recent and I remember subject and/or author details well enough to find it quickly.) Reading the NNTPFolder docs, it occurs to me that if the "article numbers" they refer to are NOT identical to the rfc's server-supplied "article-numbers, it may be that this javaclass implementation at least, is designed to be just that, cross-session stateless, you get what the server has available when you connect, no more, no memory of a previous session to save or worry about. That would certainly simplify the implementation! Unfortunately, it's not particularly useful, in the way nntp is traditionally used, at least. For a particular "browsing" session, sure, but forget about saving state! HOWEVER, it MAY be that the "article numbers" referred to are INDEED the rfc-version, as supplied by the server, in which case saving and recalling state DOES appear to be reasonable, since the current server state as seen by the GROUP commands, etc, is then by definition matchable against the previous session's state. Some bits of the documentation hint at this, tho as I said, I never could find anything explicitly STATING it. So for instance, in regard to your deleted/expunged concern, note that the delete/expunge methods don't apply to NNTPFolder as it's read-only. At first, the read-only bit appears to back the stateless single-session thing, but it DOES mean that you don't have to worry about THAT particular renumbering issue. And the "open" method, while the method enumeration at the top says it doesn't apply to NNTPFolder, down in the description, it actually says something different, that the "open" method is used to issue the GROUP command and to update current state. So that's how NNTPFolder exposes the GROUP command... Various other bits I was able to infer, by looking at the methods inherited from the parent Folder class, altho their NNTPFolder subclass usage and implementation differences aren't explicitly documented. GetMessages appears to be the method that exposes article numbers for use in other commands. In the generic/mail folders case, there's the delete/ expunge and renumbering to worry about, but since delete/expunge doesn't apply to NNTPFolder... What I was looking for was something explicit that says that in NNTPFolder subclass (as opposed to the Folder parent class), article numbers aren't simply counted, but instead, the server sequence numbers are used. I couldn't find it, but given the read-only nature of NNTPFolders and thus the elimination of the delete/expunge, etc, issues, it would seem to be a logical subclass extension. And as best I can see, if the server article numbers ARE used, you'd then have a chance to compare and update state in a new session against the old one, since you'd have a means to measure current server status against saved previous status, while if article numbers are local-only, as you, I don't see any way to measure current session server state against that of a previous session, so it may well be that this class implementation at least, is "session state only", much like that of lynx as I mentioned above, and what you see in that session is what you get, no saved state between sessions at all. So as I said, were I working on the project, I'd be tooling up some experimental code right about now, to actually check out those article numbers, comparing them against the server assigned article numbers as seen in the actual net traffic sniffed with ngrep or the like. Or, being your the coder that I'm not, since the code is available, you can actually take a look at the implementing class code to see what it does, instead of doing the reverse engineering and experimentation that I'd do. I don't know how well that answers your questions. Certainly not as well as someone with actual experience with this javaclass code could have. But even in the few hours I've spent looking at it, I THINK (famous last words) I understand a bit more about it than the "frustrated and at a loss" you seemed to be expressing in your message. Thus, hopefully, it's at least /some/ help. =:^/ But really, if you're not wedded to java for some reason, do consider looking at a python implementation. I believe you'll find a reasonable amount of existing code, with multiple nntp helper modules to choose from, and that you'll find at least one of them rather saner than the java classes seen here. Maybe you're more comfortable with java, I don't know, but I'd almost certainly be more comfortable with python, even tho I don't claim to be a python coder either. And of course the same goes for perl; multiple helper modules should be available, as well as implementing code using them that you can study. Except I personally prefer python, and have thought for several years that if I eventually progress beyond bash, python is my logical next step. (I actually did look into learning perl, but decided it was a bit /too/ flexible for me; python's enforced formatting due to its use of formatting for block indication, among other things when compared against perl, appeals to me. Plus, there's more python available in my environment to study, not a small factor considering that my practical knowledge of bash scripting originated with my taking apart and putting back together the various initscripts in my first Linux installations, Mandrake 8.x, back then. That's actually why I took a look at perl first as well, as the Mandrake package manager was perl-based. Now of course I'm on gentoo, with its portage package manager (as well as a second gentoo PM implementation, pkgcore, there's a third as well, paludis, but it's C++ based) being python based.) -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman _______________________________________________ Pan-users mailing list Pan-users@nongnu.org https://lists.nongnu.org/mailman/listinfo/pan-users