[Gossip] Re: The Great UTF-8 SWITCHEROO
Jeff Breidenbach writes: > I've seen a small but not tiny number of messages where the > Mail User Agent is sticking raw iso-8859-1 characters (outside > the ASCII range) inside the Subject: header. It's invalid, but it's not uncommon. It's getting rarer, though, as more an more legacy mail user agents end up on the scrap heap. One puzzling thing I've seen, though, is that it's quite common for Chinese-language and Russian-language to put unencoded characters (big-5 and koi8, respectively) in the Subject header. I haven't really looked into the causes why, but I've had to add support for default charset handling into Gmane to make certain lists at all legible. That is, when doing the conversion to utf-8, I have a per-list charset list to be used, which I feed to the converter. An alternative approach is to use charset-guessing software (which is supposed to be pretty good, these days), or look for clues in the rest of the message for what the charset most likely is -- there may be a Content-Type with a charset parameter, even though the headers aren't RFC2047 encoded. Etc. Or you can just ignore the problem with these invalid email messages. :-) -- (domestic pets only, the antidote for overdose, milk.) [EMAIL PROTECTED] * Lars Magne Ingebrigtsen ___ Discussion list for The Mail Archive Gossip@jab.org http://jab.org/cgi-bin/mailman/listinfo/gossip
[Gossip] Re: mailman import question
"Jeff Breidenbach" writes: > Does anyone have a "mailman archive to mbox" converter > script in their back pocket? And when I say mailman archive, > I'm talking about the "gzip'd text" like this: > > http://listas.asteriskbrasil.org/pipermail/asteriskbrasil/ I've included the simple script I use below. > Note the lack of headers - ugh. I have no idea what the mailman > folks were thinking, but this is really inconvenient - and makes it > very hard for folks who want to import their list data into M-A. Yup. That they've stripped so many headers is a shame -- it's a major loss of information. -- (domestic pets only, the antidote for overdose, milk.) [EMAIL PROTECTED] * Lars Magne Ingebrigtsen ___ Discussion list for The Mail Archive Gossip@jab.org http://jab.org/cgi-bin/mailman/listinfo/gossip
[Gossip] Re: mailman import question
"Jeff Breidenbach" writes: > Thanks, Lars. I don't see the script - could you please resend? Hm. Looks like the attachment was stripped? Here it is again: #!/usr/bin/perl $url = $ARGV[0]; $compressed = $ARGV[1]; unlink "/tmp/piper-complete.txt"; if ($compressed) { $suffix = "txt.gz"; } else { $suffix = "txt"; } system("wget", "--no-check-certificate", "-O", "/tmp/piper", $url); open(PIPER, "/tmp/piper") || die; while () { $files[$i++] = $1 if /"(.*.$suffix)"/; } close PIPER; for (; $i >= 0; $i--) { print($files[$i], "\n"); system("wget", "--no-check-certificate", "-O", "/tmp/piper.$suffix", $url . $files[$i]); if ($compressed) { system("zcat /tmp/piper.txt.gz | /gmane/fix-piper >> /tmp/piper-complete.txt"); } else { system("/gmane/fix-piper < /tmp/piper.txt >> /tmp/piper-complete.txt"); } } and fix-piper is the following: #!/usr/bin/perl while () { s/^(From:? .*) (at|en) /\1\@/; s/^Date: ([A-Z][a-z][a-z]) +([A-Z][a-z][a-z]) +([0-9]+) +([0-9:]+) +([0-9]+)/Date: \1, \3 \2 \5 \4 +/; print; } -- (domestic pets only, the antidote for overdose, milk.) [EMAIL PROTECTED] * Lars Magne Ingebrigtsen ___ Discussion list for The Mail Archive Gossip@jab.org http://jab.org/cgi-bin/mailman/listinfo/gossip