Hi,

is anybody willing to share experience about how to extract content from
mailing list archives in order to have it indexed by Lucene or Solr?

Imagine that we have access to archive of some mailling list (e.g.
http://www.mail-archive.com/mailman-users%40python.org/) and we would like
to index individual emails. Is there any easy way how to extract just the
text content produced by sender individual emails? I am interested in
content generated by particular sender omitting the original quoted text. We
can either access individual emails via web or we can download monthly
archive in plain text format (but the content of individual emails depends
on the email client of the author, i.e. plain text, html, html mixed with
plain text in <table> ... etc... it is very messy).

I would prefer information about mailing lists managed by mailman but I
don't want to limit the scope of this question so any general ideas are
welcome.

Regards,
Lukas

Reply via email to