wayne posted <[EMAIL PROTECTED]>, excerpted below, on Sun, 07 Aug 2005 22:46:51 -0700:
> This explains a lot about pan's behavior. Thanks. I looks like a problem > for all newsreaders. The method I use to determine if there is a binary > attachment is to examine "lines". If it's over 1,000, then more than > likely there is an attachment. I have, on occasion, seen messages with 10 > lines that have an attachment however. BTW, what does "lines" actually > mean? The problem with doing anything automated with lines, is that given PAN's less than perfect (for the reasons listed previously) detection of attachments, sometimes, the last part of a multipart will get treated as a text message of less than X (whatever your cutoff is) lines. 1000 lines can be quite a lot, particularly with yEnc. I tend to use something closer to 250 lines, which might kill something, but if so, it's usually enough you get the valuable part of the data anyway (as long as it's not a scrambled/encrypted binary, such that the last 250 lines ends up being data from all over the binary, or unless it's executable and therefore has to be perfect). What does lines mean? The answer to that is rather more complex than it used to be... Ready for an equally long and complex answer, that will likely tell you more than you ever wanted to know about the topic? I hope so, because here it is! <g> Consider... the RFCs (Request for Comments, which is how the various documents that form the Internet standards we use today, started, as proposals or requests for comments) for Internet News / NNTP were originally based *VERY* heavily on the RFCs for mail. That's one of the reasons both have nearly the same headers, and are styled much the same, as well as why it's comparatively easy to create a dual purpose news and mail application. In effect, it was the same as the mail spec, with only the changes necessary to effectively turn what was a one-to-one medium, mail, into the one-to-many medium of news. Also, consider, it used to be far more common to have mail to news and news to mail gateways. People in some corners of the net didn't directly connect, so all the news for a group for a week or a month would be compiled, and delivered via CDROM or similar. They'd read thru it, compose any replies, and might then post them using mail, which they had access to, in such a way that it would go thru perhaps three or five different format transfers and gateways, before it got to the mail2news gateway and was posted to the newsgroup. Thus, it wasn't uncommon for news threads to go on for weeks, with a week or more between posts from one side, the other, or both. Anyway, with such sorts of strange setups, it only made sense to make the messages as nearly compatible as possible. The mail RFCs, in turn, were designed with /great/ care to ensure compatibility over the widest range of equipment imaginable. Understand that all sorts of hardware was in use, most of it 40 or 80 character wide character-mode only displays (no graphics), some of it printout-only (no CRT display), with internal character representations all over the map, ANSI/ASCII (American National Standards Institute, American Standard Code for Information Interchange, the most common character encoding to this day), EBCDIC (Extended Binary Coded Decimal Interchange Code), original BCD, little endian, big endian, variations on middle endian... 6-bit char, 7-bit, 8-bit, even 9-bit and 10-bit char, and even some fractional ones like 8.5-bit char (two chars in 17 bits) representations... and on all these hardware platforms were operating systems and applications, many of which conflicted as well, each with their OWN mail implementation, designed for local use well before any thought was being given to interoperation. Out of all this /mess/, they /somehow/ agreed on a common standard, to aid in interoperation of the various private networks, each with their own standards, a common set of Internet standards, that is, literally inter-network standards, was hammered out. As a result, the standards didn't care /how/ it was implemented internally, but only specified the message format, headers, and protocols, that the message and protocol **HAD** to use, to be conformant with the standard, and therefore have any expectation at all of some semblance of interoperation. The result of all this was a fairly confined set of a bit more than 64 (i.e. 6-bit, because it was /more/ than that, 7-bit became the practical minimum) "legal" characters, that were implemented on all systems, and therefore could be expected to survive conversion between all the different sorts of systems a message might pass over in its journey. This consisted of the 26 letters in upper and lower case, therefore 52 letters, the 10 digits, making 62, and the most common punctuation characters, again, ONLY including those common to all implementations. The headers and all additional elements of a message could not be composed of anything BUT those characters. Here I'll pause to mention a couple of those characters specifically, due to their special meaning. To this day, various systems differ on their line ending representation. The two choices are the Carriage Return (CR) character, and the Line Feed (LF) character. Unix systems, therefore Linux, as a consequence, represent a line end with the LF char. Macintosh systems traditionally represent it instead with the CR char. (Hmm... I just realized that I'm not sure /how/ OSX works in this case, since it's a Mac, built on a Unix/BSD core. Maybe it recognizes both chars equally?) The internet standards use both, in the familiar CRLF order, as do MS platforms. Open a plain text file created in MS Windows, on Linux, and unless the text editor hides it (as some do), you'll see an extraneous CR at the end of every line, before the LF that Linux/Unix recognizes as the line termination character. Open a Unix file on MSWormOS, and you'll see lines stair-stepping down the page, because MSWormOS won't recognize the LF as a full line termination, so just does a literal line-feed in place, without returning to the left side (the CR part). Open a Mac file on MSWormOS, and you'll (probably, I'm not that familiar with Mac format files) see a line overprinted many times, because there's the carriage return, but no line feed, to move the print point to the next line. Open a Mac file on Unix, or the reverse, a Linux file on Mac, and you'll see a run-together of all the lines, with an unrecognized character where the line termination would normally be, since the line termination character in the file isn't recognized as belonging in a text file, on the platform it's being opened on. As mentioned, the mail RFCs were originally only designed for text, using this very limited set of characters. Then folks decided they wanted to be able to send binaries around as attachments, and had to come up with a way to encode the (eventually standardized to 8-bit bytes) binary files using the limited set of available text characters. The first commonly used format for doing so was Unix-to-Unix-Encoding, aka UUE, still not uncommon today. However, there was never a formal standard for UUE, and there were occasional compatibility problems as a result. Eventually, a standard proposal, entitled MIME, Multi-purpose Internet Mail Extensions, came about. MIME really formallized more than just file encoding, however, specifying additional message headers and body subformats as necessary to create a solid framework for conveying much more information about the various parts of a post, how they interrelated to each other, and what the file names were if the part was a file. Among other things, this made possible multi-part posts with a normal text portion, an HTML format portion, and various additional sections with other attachments, such as pictures, that could be referenced from the HTML portion, allowing full web page formatting, complete with images, and anything else a web page might include. Two encoding formats were part of the MIME specifications. Quoted-Printable encoding was designed primarily for text, which was readable in "raw" format as well, therefore providing a level of backward compatibility with non-MIME clients. Part of the MIME/QP subspec was an escape char, used to tell a MIME compatible client that the following characters (digits) were to be interpreted as a hexidecimal representation of an ASCII character code, usually representing a non-printable character, or one outside of the allowed common set, anyway. ("=" is that escape character, so any time it appears in the content, the raw form of the message encodes =33, the ASCII char number for it, since = is the escape character and thus can't represent itself. Perhaps you've seen messages with such occasional =33, or other =XX sequences? Now you know what they mean and why they are there!) Because of this escape char, it was /possible/ to encode binary data as quoted-printable, but doing so required three characters for every byte of non-printable binary data, a **VERY** inefficient method for encoding such binary data. The second MIME encoding format, base64, was therefore designed with binary data in mind. From the name, and because I mentioned it earlier, you can probably guess that it uses 64 printable characters, the subset judged least troublesome in terms of retention across all the various platforms, in its encoding. 64 is 2^6, so each printed char in base-64 represents exactly six bits of data, taking an 8-bit char to do so. Thus, the encoding expansion ratio is fixed, 4 chars represent 3 8-bit bytes of data, an increase in size of exactly 33% over the unencoded file size. Despite, or really /because/ /of/, the fact that base64 was designed to be as efficient as possible given the limitations they were working with, at encoding binary data, encoding text as base64 makes it unreadable, from the human perspective. Thus, spammers, for example, often choose to encode their messages as base64, preventing most filters and the like from successfully blocking them. They'll package that message along with an HTML script part that unpacks and displays the previously obscured message. Thus, for those such as myself that don't like HTML messages due to the security implications anyway, filtering as spam or malware any HTML message neatly covers both the security and HTML-spam bases, with the same filter. FWIW, the RFCs that set out the MIME specifications are RFCs 2045-2049. You can google them if interested in doing a bit more study on the topic. These were some of the first RFCs I ever read, and are surprisingly interesting, not as technical and dry as I would have imagined, and /certainly/ far easier reading than the typical legalese on most stuff you sign (which you /do/ read before signing, right?). It was really amazing to me the extent to which they went to be backward compatible, yet how powerful and flexible the spec actually is. OK, with all that covered, I can now deal with /half/ the answer to your question. yEnc changes the picture dramatically. Meanwhile... While the RFCs (822, in this case, SMTP, Send Mail Transport Protocol) specify a maximum line length of 1000 characters, 998 plus the standard terminating CRLF, many early clients really didn't work to well with line lengths exceeding the 80-char standard display width I mentioned above, forcing inconvenient manual horizontal scrolling, if they allowed one to view more than 80 characters width at all. This is the origin of the 80 character maximum (normal wrap at 72, to allow for quoting) line length limitation netiquette, that forms part of the GNKSA test for news clients, which PAN complies with, therefore, the reason PAN wraps at 72 chars when wrap-text is enabled. For this reason of compatibility with RFC-NON-compliant 80-char limited implementations, altho the spec allowed 998 characters of text per line, and even for the purposes of base64 binary encoding, MIME allows only a maximum 80 chars (78 plus terminating CRLF) per raw line (note that with the format=flowed attribute, a MIME client observing it actually processes an entire paragraph as a single line, wrapping it dynamically to the window or other specified size, instead of arbitrarily, at 72-78 chars). Likewise, while UUE as an informal standard doesn't specify a hard line-wrap limit, traditionally it too limits to 72-78 chars, just to be safe. Thus, the historical (i.e. non-yEnc) answer to your question would be that lines indicates the rough number of 78 char max lines within the post, and from that, given the 4/3 expansion ratio that both UUE and MIME-base64 approximate, one can get a pretty good estimate of the size of any attached binary. Your 1000 lines @ 78 chars per line would therefore roughly equate to 75KB of encoded (still expanded) binary data, maximum (quick estimation, subtracting 3, for headers, any included comment, and the fact that data K is 1024 byte, while you said 1000 lines, not 1024). At 4/3 expansion ratio, that's about 57KB of decoded binary data. Your 1000 line limit traditionally equates, therefore, to a 57KB file. As I previously hinted, yEnc changes these assumptions dramatically. While both UUE and MIME-base64 encoding were originally designed for mail, and the news specs themselves very literally originated in the mail specs, the modern internet news transport system is in practice rather less limited than that, and is, with the exception of certain formatting and protocol issues such as the 998 character limit and the significance of the CRLF sequence, generally full 8-bit binary compatible. yEnc is an informal (much like UUE) encoding scheme that is specifically designed to work with Internet news, maximizing the efficiency of binary encoding for news, and ONLY news. Thus, it specifically and definitely breaks the possibility of easy and direct conversion between mail, with its stricter limits, and news, with in practice far less strict limits, deliberately sacrificing that, in ordered to gain encoding efficiency. As a result of that sacrifice, instead of a the 33% overhead of traditional UUE and MIME encoding methods, yEnc manages to work with only about 5% overhead, altho it's a bit more variable than that of base64. One of the things yEnc does away with in the interest of efficiency, is that old 78/80 characters per line limit. Instead, it goes with the 998/1000 characters (bytes, since it's not all text character data) per line limit. Note that as yEncoded, not /all/ lines will be the full 998, some will be less, perhaps only 5 or 10 characters. However, the vast majority of the lines will be 998 chars plus CRLF, so one can estimate an average line length of perhaps 950 chars (bytes). 950 chars per line times your 1000 lines, 950,000 bytes. A quick estimate, therefore, puts your 1000 line limit at 900+ KB, say 0.9 MB, post decoding, of data. Your 1000 line limit therefore equates to only about 57KB of data if traditionally (UUE or MIME-base64) encoded, a pretty safe assumption that you aren't missing anything binary. If it's yEncoded, however, the result is VERY different. You could be missing 9/10 MB files, or the last part of multiparts not detected as such, up to 9/10 MB in size! That's a decent size JPEG, in the image groups, tho it's not likely to be all that significant at the end of a half hour MPEG. Of course, if it's an executable or something else that needs to be byte-for-byte perfect, it probably doesn't matter what size the missing part is, it can ruin the download. The same goes, of course, if you are one of those compulsive types where the CRCs /must/ match against the official list, or you might as well dump it. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman in http://www.linuxdevcenter.com/pub/a/linux/2004/12/22/rms_interview.html _______________________________________________ Pan-users mailing list Pan-users@nongnu.org http://lists.nongnu.org/mailman/listinfo/pan-users