Hi, is anybody willing to share experience about how to extract content from mailing list archives in order to have it indexed by Lucene or Solr?
Imagine that we have access to archive of some mailling list (e.g. http://www.mail-archive.com/mailman-users%40python.org/) and we would like to index individual emails. Is there any easy way how to extract just the text content produced by sender individual emails? I am interested in content generated by particular sender omitting the original quoted text. We can either access individual emails via web or we can download monthly archive in plain text format (but the content of individual emails depends on the email client of the author, i.e. plain text, html, html mixed with plain text in <table> ... etc... it is very messy). I would prefer information about mailing lists managed by mailman but I don't want to limit the scope of this question so any general ideas are welcome. Regards, Lukas