[Pan-users] Re: attachments

Duncan Mon, 08 Aug 2005 06:00:17 -0700

wayne posted <[EMAIL PROTECTED]>, excerpted below,  on Sun, 07
Aug 2005 22:46:51 -0700:


> This explains a lot about pan's behavior.  Thanks.  I looks like a problem
> for all newsreaders.  The method I use to determine if there is a binary
> attachment is to examine "lines".  If it's over 1,000, then more than
> likely there is an attachment.  I have, on occasion, seen messages with 10
> lines that have an attachment however. BTW, what does "lines" actually
> mean?

The problem with doing anything automated with lines, is that given PAN's
less than perfect (for the reasons listed previously) detection of
attachments, sometimes, the last part of a multipart will get treated as
a text message of less than X (whatever your cutoff is) lines.  1000 lines
can be quite a lot, particularly with yEnc.  I tend to use something
closer to 250 lines, which might kill something, but if so, it's usually
enough you get the valuable part of the data anyway (as long as it's not a
scrambled/encrypted binary, such that the last 250 lines ends up being
data from all over the binary, or unless it's executable and therefore has
to be perfect).

What does lines mean?  The answer to that is rather more complex than it
used to be...  Ready for an equally long and complex answer, that will
likely tell you more than you ever wanted to know about the topic?  I hope
so, because here it is! <g>

Consider...  the RFCs (Request for Comments, which is how the various
documents that form the Internet standards we use today, started, as
proposals or requests for comments) for Internet News / NNTP were
originally based *VERY* heavily on the RFCs for mail. That's one of the
reasons both have nearly the same headers, and are styled much the same,
as well as why it's comparatively easy to create a dual purpose news and
mail application.  In effect, it was the same as the mail spec, with only
the changes necessary to effectively turn what was a one-to-one medium,
mail, into the one-to-many medium of news.

Also, consider, it used to be far more common to have mail to news and
news to mail gateways.  People in some corners of the net didn't directly
connect, so all the news for a group for a week or a month would be
compiled, and delivered via CDROM or similar.  They'd read thru it,
compose any replies, and might then post them using mail, which they had
access to, in such a way that it would go thru perhaps three or five
different format transfers and gateways, before it got to the mail2news
gateway and was posted to the newsgroup.  Thus, it wasn't uncommon for
news threads to go on for weeks, with a week or more between posts from
one side, the other, or both.  Anyway, with such sorts of strange setups,
it only made sense to make the messages as nearly compatible as possible.

The mail RFCs, in turn, were designed with /great/ care to ensure
compatibility over the widest range of equipment imaginable.  Understand
that all sorts of hardware was in use, most of it 40 or 80 character wide
character-mode only displays (no graphics), some of it printout-only (no
CRT display), with internal character representations all over the map,
ANSI/ASCII (American National Standards Institute, American Standard Code
for Information Interchange, the most common character encoding to this
day), EBCDIC (Extended Binary Coded Decimal Interchange Code), original
BCD, little endian, big endian, variations on middle endian... 6-bit char,
7-bit, 8-bit, even 9-bit and 10-bit char, and even some fractional ones
like 8.5-bit char (two chars in 17 bits) representations... and on all
these hardware platforms were operating systems and applications, many of
which conflicted as well, each with their OWN mail implementation,
designed for local use well before any thought was being given to
interoperation. Out of all this /mess/, they /somehow/ agreed on a common
standard, to aid in interoperation of the various private networks, each
with their own standards, a common set of Internet standards, that is,
literally inter-network standards, was hammered out. As a result, the
standards didn't care /how/ it was implemented internally, but only
specified the message format, headers, and protocols, that the message and
protocol **HAD** to use, to be conformant with the standard, and therefore
have any expectation at all of some semblance of interoperation.

The result of all this was a fairly confined set of a bit more
than 64 (i.e. 6-bit, because it was /more/ than that, 7-bit became the
practical minimum) "legal" characters, that were implemented on all
systems, and therefore could be expected to survive conversion between all
the different sorts of systems a message might pass over in its journey.
This consisted of the 26 letters in upper and lower case, therefore 52
letters, the 10 digits, making 62, and the most common punctuation
characters, again, ONLY including those common to all implementations. 
The headers and all additional elements of a message could not be composed
of anything BUT those characters.

Here I'll pause to mention a couple of those characters specifically,
due to their special meaning.  To this day, various systems differ on
their line ending representation.  The two choices are the Carriage Return
(CR) character, and the Line Feed (LF) character.  Unix systems, therefore
Linux, as a consequence, represent a line end with the LF char.  Macintosh
systems traditionally represent it instead with the CR char.  (Hmm... I
just realized that I'm not sure /how/ OSX works in this case, since it's a
Mac, built on a Unix/BSD core.  Maybe it recognizes both chars equally?)
The internet standards use both, in the familiar CRLF order, as do MS
platforms.  Open a plain text file created in MS Windows, on Linux, and
unless the text editor hides it (as some do), you'll see an extraneous CR
at the end of every line, before the LF that Linux/Unix recognizes as the
line termination character.  Open a Unix file on MSWormOS, and you'll see
lines stair-stepping down the page, because MSWormOS won't recognize the
LF as a full line termination, so just does a literal line-feed in place,
without returning to the left side (the CR part).  Open a Mac file on
MSWormOS, and you'll (probably, I'm not that familiar with Mac format
files) see a line overprinted many times, because there's the carriage
return, but no line feed, to move the print point to the next line.  Open
a Mac file on Unix, or the reverse, a Linux file on Mac, and you'll see a
run-together of all the lines, with an unrecognized character where the
line termination would normally be, since the line termination character
in the file isn't recognized as belonging in a text file, on the platform
it's being opened on.

As mentioned, the mail RFCs were originally only designed for text, using
this very limited set of characters.  Then folks decided they wanted to be
able to send binaries around as attachments, and had to come up with a way
to encode the (eventually standardized to 8-bit bytes) binary files using
the limited set of available text characters.  The first commonly used
format for doing so was Unix-to-Unix-Encoding, aka UUE, still not uncommon
today. However, there was never a formal standard for UUE, and there were
occasional compatibility problems as a result.  

Eventually, a standard proposal, entitled MIME, Multi-purpose Internet
Mail Extensions, came about.  MIME really formallized more than just file
encoding, however, specifying additional message headers and body
subformats as necessary to create a solid framework for conveying much
more information about the various parts of a post, how they interrelated
to each other, and what the file names were if the part was a file.  Among
other things, this made possible multi-part posts with a normal text
portion, an HTML format portion, and various additional sections with
other attachments, such as pictures, that could be referenced from the
HTML portion, allowing full web page formatting, complete with images, and
anything else a web page might include.

Two encoding formats were part of the MIME specifications. 
Quoted-Printable encoding was designed primarily for text, which was
readable in "raw" format as well, therefore providing a level of backward
compatibility with non-MIME clients.  Part of the MIME/QP subspec was an
escape char, used to tell a MIME compatible client that the following
characters (digits) were to be interpreted as a hexidecimal
representation of an ASCII character code, usually representing a
non-printable character, or one outside of the allowed common set, anyway.
("=" is that escape character, so any time it appears in the content, the
raw form of the message encodes =33, the ASCII char number for it, since =
is the escape character and thus can't represent itself.  Perhaps you've
seen messages with such occasional =33, or other =XX sequences?  Now you
know what they mean and why they are there!)  Because of this escape char,
it was /possible/ to encode binary data as quoted-printable, but doing so
required three characters for every byte of non-printable binary data, a
**VERY** inefficient method for encoding such binary data.

The second MIME encoding format, base64, was therefore designed with
binary data in mind.   From the name, and because I mentioned it earlier,
you can probably guess that it uses 64 printable characters, the subset
judged least troublesome in terms of retention across all the various
platforms, in its encoding.  64 is 2^6, so each printed char in base-64
represents exactly six bits of data, taking an 8-bit char to do so. Thus,
the encoding expansion ratio is fixed, 4 chars represent 3 8-bit bytes of
data, an increase in size of exactly 33% over the unencoded file size. 
Despite, or really /because/ /of/, the fact that base64  was designed to
be as efficient as possible given the limitations they were working with,
at encoding binary data, encoding text as base64 makes it unreadable, from
the human perspective.  Thus, spammers, for example, often choose to
encode their messages as base64, preventing most filters and the like from
successfully blocking them.  They'll package that message along with an
HTML script part that unpacks and displays the previously obscured
message.  Thus, for those such as myself that don't like HTML messages due
to the security implications anyway, filtering as spam or malware any HTML
message neatly covers both the security and HTML-spam bases, with the same
filter.

FWIW, the RFCs that set out the MIME specifications are RFCs 2045-2049. 
You can google them if interested in doing a bit more study on the topic. 
These were some of the first RFCs I ever read, and are surprisingly
interesting, not as technical and dry as I would have imagined, and
/certainly/ far easier reading than the typical legalese on most stuff you
sign (which you /do/ read before signing, right?).  It was really amazing
to me the extent to which they went to be backward compatible, yet how
powerful and flexible the spec actually is.

OK, with all that covered, I can now deal with /half/ the answer to your
question.  yEnc changes the picture dramatically.  Meanwhile...  While the
RFCs (822, in this case, SMTP, Send Mail Transport Protocol) specify a
maximum line length of 1000 characters, 998 plus the standard terminating
CRLF, many early clients really didn't work to well with line lengths
exceeding the 80-char standard display width I mentioned above, forcing
inconvenient manual horizontal scrolling, if they allowed one to view more
than 80 characters width at all.  This is the origin of the 80 character
maximum (normal wrap at 72, to allow for quoting) line length
limitation netiquette, that forms part of the GNKSA test for news clients,
which PAN complies with, therefore, the reason PAN wraps at 72 chars when
wrap-text is enabled.

For this reason of compatibility with RFC-NON-compliant 80-char limited
implementations, altho the spec allowed 998 characters of text per line,
and even for the purposes of base64 binary encoding, MIME allows only a
maximum 80 chars (78 plus terminating CRLF) per raw line (note that with
the format=flowed attribute, a MIME client observing it actually processes
an entire paragraph as a single line, wrapping it dynamically to the
window or other specified size, instead of arbitrarily, at 72-78 chars). 
Likewise, while UUE as an informal standard doesn't specify a hard
line-wrap limit, traditionally it too limits to 72-78 chars, just to be
safe.

Thus, the historical (i.e. non-yEnc) answer to your question would be that
lines indicates the rough number of 78 char max lines within the post, and
from that, given the 4/3 expansion ratio that both UUE and MIME-base64
approximate, one can get a pretty good estimate of the size of any
attached binary.  Your 1000 lines @ 78 chars per line would therefore
roughly equate to 75KB of encoded (still expanded) binary data, maximum
(quick estimation, subtracting 3, for headers, any included comment, and
the fact that data K is 1024 byte, while you said 1000 lines, not 1024). 
At 4/3 expansion ratio, that's about 57KB of decoded binary data.  Your
1000 line limit traditionally equates, therefore, to a 57KB file.

As I previously hinted, yEnc changes these assumptions dramatically. 
While both UUE and MIME-base64 encoding were originally designed for mail,
and the news specs themselves very literally originated in the mail specs,
the modern internet news transport system is in practice rather less
limited than that, and is, with the exception of certain formatting and
protocol issues such as the 998 character limit and the significance of
the CRLF sequence, generally full 8-bit binary compatible.  yEnc is an
informal (much like UUE) encoding scheme that is specifically designed to
work with Internet news, maximizing the efficiency of binary encoding for
news, and ONLY news.  Thus, it specifically and definitely breaks the
possibility of easy and direct conversion between mail, with its stricter
limits, and news, with in practice far less strict limits, deliberately
sacrificing that, in ordered to gain encoding efficiency.

As a result of that sacrifice, instead of a the 33% overhead of
traditional UUE and MIME encoding methods, yEnc manages to work with only
about 5% overhead, altho it's a bit more variable than that of base64.

One of the things yEnc does away with in the interest of efficiency, is
that old 78/80 characters per line limit.  Instead, it goes with the
998/1000 characters (bytes, since it's not all text character data) per
line limit.  Note that as yEncoded, not /all/ lines will be the full 998,
some will be less, perhaps only 5 or 10 characters.  However, the vast
majority of the lines will be 998 chars plus CRLF, so one can estimate an
average line length of perhaps 950 chars (bytes).  950 chars per line
times your 1000 lines, 950,000 bytes. 

A quick estimate, therefore, puts your 1000 line limit at 900+ KB, say 0.9
MB, post decoding, of data.

Your 1000 line limit therefore equates to only about 57KB of data if
traditionally (UUE or MIME-base64) encoded, a pretty safe assumption that
you aren't missing anything binary.  If it's yEncoded, however, the result
is VERY different.  You could be missing 9/10 MB files, or the last part
of multiparts not detected as such, up to 9/10 MB in size!  That's a
decent size JPEG, in the image groups, tho it's not likely to be all that
significant at the end of a half hour MPEG.  Of course, if it's an
executable or something else that needs to be byte-for-byte perfect, it
probably doesn't matter what size the missing part is, it can ruin the
download.  The same goes, of course, if you are one of those compulsive
types where the CRCs /must/ match against the official list, or you might
as well dump it.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman in
http://www.linuxdevcenter.com/pub/a/linux/2004/12/22/rms_interview.html




_______________________________________________
Pan-users mailing list
Pan-users@nongnu.org
http://lists.nongnu.org/mailman/listinfo/pan-users

[Pan-users] Re: attachments

Reply via email to