[Gossip] Re: The Great UTF-8 SWITCHEROO

2005-06-30 Thread Lars Magne Ingebrigtsen
Jeff Breidenbach writes:

> I've seen a small but not tiny number of messages where the
> Mail User Agent is sticking raw iso-8859-1 characters (outside
> the ASCII range) inside the Subject: header.

It's invalid, but it's not uncommon.  It's getting rarer, though, as
more an more legacy mail user agents end up on the scrap heap.

One puzzling thing I've seen, though, is that it's quite common for
Chinese-language and Russian-language to put unencoded characters
(big-5 and koi8, respectively) in the Subject header.  I haven't
really looked into the causes why, but I've had to add support for
default charset handling into Gmane to make certain lists at all
legible.  That is, when doing the conversion to utf-8, I have a
per-list charset list to be used, which I feed to the converter.

An alternative approach is to use charset-guessing software (which is
supposed to be pretty good, these days), or look for clues in the
rest of the message for what the charset most likely is -- there may
be a Content-Type with a charset parameter, even though the headers
aren't RFC2047 encoded.  Etc.

Or you can just ignore the problem with these invalid email
messages.  :-)

-- 
(domestic pets only, the antidote for overdose, milk.)
  [EMAIL PROTECTED] * Lars Magne Ingebrigtsen

___
Discussion list for The Mail Archive
Gossip@jab.org
http://jab.org/cgi-bin/mailman/listinfo/gossip


[Gossip] Re: mailman import question

2006-10-06 Thread Lars Magne Ingebrigtsen
"Jeff Breidenbach" writes:

> Does anyone have a "mailman archive to mbox" converter
> script in their back pocket? And when I say mailman archive,
> I'm talking about the "gzip'd text" like this:
>
> http://listas.asteriskbrasil.org/pipermail/asteriskbrasil/

I've included the simple script I use below.

> Note the lack of headers - ugh. I have no idea what the mailman
> folks were thinking, but this is really inconvenient - and makes it
> very hard for folks who want to import their list data into M-A.

Yup.  That they've stripped so many headers is a shame -- it's a major
loss of information.


-- 
(domestic pets only, the antidote for overdose, milk.)
  [EMAIL PROTECTED] * Lars Magne Ingebrigtsen
___
Discussion list for The Mail Archive
Gossip@jab.org
http://jab.org/cgi-bin/mailman/listinfo/gossip


[Gossip] Re: mailman import question

2006-10-07 Thread Lars Magne Ingebrigtsen
"Jeff Breidenbach" writes:

> Thanks, Lars. I don't see the script - could you please resend?

Hm.  Looks like the attachment was stripped?

Here it is again:

#!/usr/bin/perl

$url = $ARGV[0];
$compressed = $ARGV[1];

unlink "/tmp/piper-complete.txt";

if ($compressed) {
$suffix = "txt.gz";
} else {
$suffix = "txt";
}

system("wget", "--no-check-certificate", "-O", "/tmp/piper", $url);

open(PIPER, "/tmp/piper") || die;

while () {
$files[$i++] = $1 if /"(.*.$suffix)"/;
}

close PIPER;

for (; $i >= 0; $i--) {
print($files[$i], "\n");
system("wget", "--no-check-certificate", "-O", "/tmp/piper.$suffix", $url . 
$files[$i]);
if ($compressed) {
system("zcat /tmp/piper.txt.gz | /gmane/fix-piper >> 
/tmp/piper-complete.txt");
} else {
system("/gmane/fix-piper < /tmp/piper.txt >> /tmp/piper-complete.txt");
}
}

and

fix-piper is the following:

#!/usr/bin/perl

while () {
s/^(From:? .*) (at|en) /\1\@/;
s/^Date: ([A-Z][a-z][a-z]) +([A-Z][a-z][a-z]) +([0-9]+) +([0-9:]+) 
+([0-9]+)/Date: \1, \3 \2 \5 \4 +/; 
print;
}


-- 
(domestic pets only, the antidote for overdose, milk.)
  [EMAIL PROTECTED] * Lars Magne Ingebrigtsen

___
Discussion list for The Mail Archive
Gossip@jab.org
http://jab.org/cgi-bin/mailman/listinfo/gossip