Re: Extracting content from mailman managed mail list archive

Lukáš Vlček Mon, 08 Mar 2010 07:04:54 -0800

I just checked popular search services and it seems that neither
lucidimagination search nor search-lucene support this:
http://www.lucidimagination.com/search/document/954e8589ebbc4b16/terminating_slashes_in_url_normalization
http://www.search-lucene.com/m?id=510143ac0608042241k49f4afe7wcd25df3fbacc7...@mail.gmail.com||mailman


Markmail does not support this as well
http://markmail.org/message/papbjx3aoz3uvbhh

Hmmm....
I think it would be useful to extract just the *NEW* content without all
quotes because this influences Lucene scoring.

Regards,
Lukas

On Mon, Mar 8, 2010 at 3:55 PM, Lukáš Vlček <lukas.vl...@gmail.com> wrote:

> Hi,
>
> is anybody willing to share experience about how to extract content from
> mailing list archives in order to have it indexed by Lucene or Solr?
>
> Imagine that we have access to archive of some mailling list (e.g.
> http://www.mail-archive.com/mailman-users%40python.org/) and we would like
> to index individual emails. Is there any easy way how to extract just the
> text content produced by sender individual emails? I am interested in
> content generated by particular sender omitting the original quoted text. We
> can either access individual emails via web or we can download monthly
> archive in plain text format (but the content of individual emails depends
> on the email client of the author, i.e. plain text, html, html mixed with
> plain text in <table> ... etc... it is very messy).
>
> I would prefer information about mailing lists managed by mailman but I
> don't want to limit the scope of this question so any general ideas are
> welcome.
>
> Regards,
> Lukas
>

Re: Extracting content from mailman managed mail list archive

Reply via email to