On 4/28/05, Henrique de Moraes Holschuh <[EMAIL PROTECTED]> wrote: > > What you *have* to do to get such a thing accepted is to, instead, write > something that *fixes* the headers with the following capabilities: > > 1. Notion of a default source charset, which is a hint of the charset to > encode *from* (because the input data does not have that information) > > 2. Either a configurable destination charset, or use UTF-8 (I would much > rather you went the full way and made it configurable, I believe at > least the CJK people would appreciate that a lot). > > 3. Functionality: > 3.1: Detect illegal 8-bit in headers, and apply the correction > algorithm described below (configurable) > 3.2: Pass-through any non-8bit headers. > 3.3: Reject messages with 8-bit headers. > > Algo for charset conversion: > > Step 1: Look for certain hints of charsets, to try do determine the > correct source charset: UTF signatures, ISO-2022 escape sequences, etc. > > Step 2: If not found, use the default source charset. > > Step 3: Verify if the input sequence is *100% valid and correct* in the > choosen/detected charset. If it is not, reject the message. > > Step 4: Convert to the destination charset (option: detected > charset, configured destination charset), and RFC-2047 encode the > header. > > This needs to be done *before* any sieve processing, etc. > > So far, nobody that keeps complaining about the "X" things has taken the > time to do the above. > Hello,
Why not encode the header using unknown-8bit as the charset?? This is simpler, and has many advantages: 1. No information is lost. No silent change is made to a message. No confusion can be caused by converting a valid word/name into a different valid word/name. (Silently converting to "X" is regarding as unethical by many network administrators. It can cause pain to innocent users. Imagine a message subject which is converted into: "Our beloved Xenia has died", but Xenia is alive . In Europe not all letters are 8-bits, so you will not receive XXXXXXXX but some possible meaningful text. X-ing can cause damage to business, can be contrary to contracts, can be illegal.) If you can not do good, at least do not cause harm. Rejecting mail be also be unethical (important messages will not be delivered because of a 8-bit character) and can also cause problems. If it is a mailing list message, the receiver will be silently unsubscribed and the receiver will not receive any error to fix his mailer. 2. It is theoretically reversible at any time in the future. (If messages are later reprocessed, archived, download using fetchmail etc) 3. unknown-8bit is registered by IANA http://www.iana.org/assignments/character-sets It is used by some mail programs. See http://www.google.com/search?q=unknown-8bit&start=10&sa=N 4. Broken messages can be found with a search at any time. (but false positive are possible) Senders can be notified. Statistics can be made. Filters can be used (like: if message is from newsgroup soc.culture.xyz and contains .......do .........) 5. MUAs can use whatever heuristic want to find the real charset. They may use user preferences, system defaults, may use rules specific for a certain newsgroup or mailing-list, or for a certain sender, they may check if a subject is plausible using a dictionary. They may also warn the user, display the undecoded text and so on. I've done tests with some MUAs with good results. Some of them include support for unknown-8bit because it can be used in contexts like "Content-Type: text/plain; charset=unknown-8bit" ( I believe sendmail may generate this, following RFC 1428. How will Cyrus search in such a message, BTW?). They seem happy to accept it in RFC-2047 header, possible as an unintended side-effect of the charset-handling code. I don't know if they use the same charset detection code as in a message with raw 8-bit characters or a different one. Others will treat it as an unknown charset and display the undecoded string. I've found none, but there may be some broken programs which will crash, corrupt the mail store or execute the mail content. 6. This will fix this Cyrus problem for ever. (By contrast, no heuristic can achieve this: it will need to be adapted, patched, improved and still it will not be perfect, it may find the wrong charset and cause confusions for users. The closer the heuristic is to the user, the better will work. So MUAs should guess the charset) 7. The same code can be used to implement a site-default. (Replace unknown-8bit with what you want and __do__ some plausibility checks) 8. It will work at any point in the mail path. Cyrus will do this, but MTAs and news servers can also convert headers to unknown-8bit. Or reverse it, if necessary. PROBLEMS: 1. How to handle appending messages got using other servers/protocols to a Cyrus mailbox. I don't know imap protocol enough, but I believe the client will have the old headers and not expect them to be changed, so may crash/corrupt the messages or do some other unwanted things. How is this handled with X-ing ?? Rejecting and doing some conversion on the client store before appending may be a solution. 2. 8-bits in bodies is not fixed by this proposal. MTAs should handle this via MIME. Cyrus should reject 8bits in bodies. 3 What to do with a message inside a message. How do present code handle this?? Or they are not parsed by Cyrus?? (Haven't tried.) 4. Care should be taken not to rfc2047-encode text which must be ASCII. Even when properly encoded, non-ASCII is not valid anywhere in headers. Best regards, Adrian Buciuman --- Cyrus Home Page: http://asg.web.cmu.edu/cyrus Cyrus Wiki/FAQ: http://cyruswiki.andrew.cmu.edu List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html