Thanks for the helpful reply -- some comments interspersed below:

On Wednesday, November 26, 2025 01:30:09 PM Greg Wooledge wrote:
> On Wed, Nov 26, 2025 at 12:29:14 -0500, [email protected] wrote:
> > Does anybody here know of an AWK or sed program to convert mbox files to
> > HTML? [...]
> > I know that maildir is the currently favored approach for mail storage,
> > but I have well over 100 MB of emails (or pseudo emails) stored in mbox
> > files, and want to convert them for easy viewing on the Internet (by
> > anyone).
> 
> Why did you specifically ask for awk or sed?  They don't seem like the
> best choices for programming languages to implement this.

I thought they would be languages I could reasonably "handle" -- Perl, C[++], 
and Python (and TCL), for example I have little knowledge or experience with.  
(The last general purpose languages I was reasonably fluent in were Algol and 
Pascal.  (I might be forgetting some.))

If I found a reasonably well written and well documented program in some other 
language that already does most of what I need, I imagine that I could modify 
it as required.

> With that large of an input, I would avoid bash.  It'll be slow.  Also,
> it has no useful libraries.
> 
> You're processing a large amount of text, in a fairly well-defined format,
> so any language that's good at text processing should do the job.  Perl,
> Python, or Tcl would be my picks, but that's probably my personal bias.
> 
> I'm guessing that what you want to end up with would be a directory
> containing one file per message, plus some sort of index.html file that
> links to all of them.  

I hadn't thought that far ahead, but that seems like a good approach.

> If all the messages were plain pre-MIME "header and
> body", you could probably write a program to do that in less than an hour.
> 
> It's going to be tricky if you need to parse MIME attachments.  At that
> point, you'll probably need to break out whatever MIME libraries your
> chosen language has.  Even if it's just to discard the attachments, using
> a MIME library is a better approach than scrubbing out the MIME metadata
> lines with raw text manipulation.  If you actually want to preserve and
> link to the attachments, then the MIME libraries become indispensable.

Yeah, MIME.  The "pseudo emails" I referred to are basically my own plain text 
notes without attachments.  But, I do want to deal with "real" emails as well 
and will have to deal with MIME -- requires more thought, or I may defer that 
until some indefinite time in the future.

> Finally, you need to think about what you want to do with multipart
> messages.  A whole bunch of email these days is written in either HTML
> or some kind of "rich text", and then gets sent out as a multipart
> message, with the original HTML (or rich text converted to HTML) as the
> "preferred" part, and the same HTML or rich text converted to regular
> text as a "fallback" part.  Would you attempt to offer both parts
> somehow?  Or just offer the HTML part "as is" (probably with some of
> the headers reattached above it)?

Like MIME, my notes are not multipart, and  I may defer that until some 
indefinite time in the future.

Reply via email to